Re: Operator checkpointing not working

2018-06-18 Thread Thomas Weise
The default checkpoint interval is 30s and the interval between failing
aggregators is approximately 10s? In that case, no state will ever get
checkpointed and operator reset to initial state.

Thomas

--
sent from mobile

On Mon, Jun 18, 2018, 12:45 PM Mateusz Zakarczemny 
wrote:

> Hi Pramod,
> I removed transient but result is the same -
> https://github.com/Matzz/apex-example/blob/master/src/main/java/aptest/Aggregator.java
>
> Creating aggregator 2018-06-18T10:42:50.582
> Failing aggregator! 2018-06-18T10:42:50.707
> Creating FileOutput 2018-06-18T10:42:50.848
> [1.0]
> [1.0, 2.0]
> [1.0, 2.0, 3.0]
> [1.0, 2.0, 3.0, 4.0]
> [1.0, 2.0, 3.0, 4.0, 5.0]
> [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
> [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
> [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]
> [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]
> [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
> [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0]
> [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0]
> [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0]
> Creating aggregator 2018-06-18T10:42:59.683
> Failing aggregator! 2018-06-18T10:42:59.794
> Creating FileOutput 2018-06-18T10:42:59.926
> Creating aggregator 2018-06-18T10:43:08.810
> Failing aggregator! 2018-06-18T10:43:08.918
> Creating FileOutput 2018-06-18T10:43:08.988
> [1.0]
> [1.0, 2.0]
> [1.0, 2.0, 3.0]
> Creating FileOutput 2018-06-18T10:43:18.059
> Creating aggregator 2018-06-18T10:43:18.142
> Failing aggregator! 2018-06-18T10:43:18.227
> [1.0]
> [1.0, 2.0]
> [1.0, 2.0, 3.0]
> [1.0, 2.0, 3.0, 4.0]
> Creating FileOutput 2018-06-18T10:43:27.130
> Creating aggregator 2018-06-18T10:43:27.135
> Failing aggregator! 2018-06-18T10:43:27.228
> [1.0]
> [1.0, 2.0]
> [1.0, 2.0, 3.0]
> [1.0, 2.0, 3.0, 4.0]
> [1.0, 2.0, 3.0, 4.0, 5.0]
> [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
> [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
> [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]
> [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]
> [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
> [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0]
>
>
>
> pon., 18 cze 2018 o 00:16 Pramod Immaneni 
> napisał(a):
>
>> Hi Matuesz,
>>
>> It is because you have defined the list as transient in the Aggregator.
>> Transient elements are not serialized and included when the checkpoint is
>> created.
>>
>> Thanks
>> On Sun, Jun 17, 2018 at 2:35 PM Mateusz Zakarczemny <
>> m.zakarcze...@gmail.com> wrote:
>>
>>> Hi all,
>>> I created simply app to test apex fault tolerance. It is build from
>>> three main operators:
>>> - sequence generator - operator which generate increasing numbers. One
>>> per time window
>>> - aggregator - just adds incoming number to the list and emits whole
>>> list downstream
>>> - file output - operator which writes incoming messages to the file
>>> To make it faulty, aggregator operator throws an exception for 10% of
>>> messages. Source code is here https://github.com/Matzz/apex-example
>>>
>>> I'm running it on sandbox docker image. I thought that even if
>>> aggregation operator is faulty, application will checkpoint its state.
>>> So over the time output list should be longer and longer.
>>> Unfortunately, it looks like on each failure app is resenting it state
>>> to the beginning. Sample output:
>>>
>>> *tail -f -n 100 /tmp/stream.out *
>>>
>>> *Creating FileOutput 2018-06-16T22:07:01.033*
>>> *Creating aggreagator 2018-06-16T22:07:01.040*
>>> *Creating FileOutput 2018-06-16T22:07:01.041*
>>> *Creating FileOutput 2018-06-16T22:07:02.719*
>>> *Creating aggreagator 2018-06-16T22:07:02.722*
>>> *Creating FileOutput 2018-06-16T22:07:02.723*
>>> *Creating FileOutput 2018-06-16T22:08:48.178*
>>> *Creating aggreagator 2018-06-16T22:08:48.185*
>>> *Creating FileOutput 2018-06-16T22:08:48.186*
>>> *Creating FileOutput 2018-06-16T22:08:49.847*
>>> *Creating aggreagator 2018-06-16T22:08:49.850*
>>> *Creating FileOutput 2018-06-16T22:08:49.852*
>>> *Creating FileOutput 2018-06-16T22:08:56.736*
>>> *Creating aggreagator 2018-06-16T22:08:56.740*
>>> *Creating FileOutput 2018-06-16T22:08:56.743*
>>> *Creating FileOutput 2018-06-16T22:08:57.899*
>>> *Creating aggreagator 2018-06-16T22:08:57.899*
>>> *Creating FileOutput 2018-06-16T22:08:57.899*
>>> *Creating FileOutput 2018-06-16T22:09:10.951*
>>> *Creating FileOutput 2018-06-16T22:09:10.986*
>>> *Creating aggreagator 2018-06-16T22:09:11.001*
>>> *Failing sequence generator!2018-06-16T22:09:11.029*
>>> *Creating FileOutput 2018-06-16T22:09:19.484*
>>> *Creating FileOutput 2018-06-16T22:09:19.506*
>>> *Creating aggreagator 2018-06-16T22:09:19.518*
>>> *Failing sequence generator!2018-06-16T22:09:19.542*
>>> *Creating FileOutput 2018-06-16T22:09:28.646*
>>> *Creating FileOutput 2018-06-16T22:09:28.668*
>>> *Creating aggreagator 2018-06-16T22:09:28.680*
>>> *Failing sequence generator!2018-06-16T22:09:28.704*
>>> *[1.0]*
>>> *Creating FileOutput 2018-06-16T22:09:37.864*
>>> *Creating FileOutput 2018-06-16T22:09:37.886*
>>> *Creating 

Re: Recommended way to query hive and capture output

2017-10-23 Thread Thomas Weise
Have you explored using the JDBC interface to read from Hive?

Apex has JDBC connectors and it should be possible to use them with Hive.

Thanks,
Thomas


On Mon, Oct 23, 2017 at 11:52 AM, Vivek Bhide 
wrote:

> Hi
>
> I would like to know if there is any recommended way to execute queries on
> hive tables (not inserts ), capture the query output and process it?
>
> Regards
> Vivek
>
>
>
> --
> Sent from: http://apache-apex-users-list.78494.x6.nabble.com/
>


Re: how to find application master URL ( non proxy )

2017-09-26 Thread Thomas Weise
The CLI is just another REST client (you see the calls it makes in the
source code and repeat those same calls directly). That won't solve the
problem of not being able to access the AM port directly though. Since the
RM proxy does not support POST, you would need a separate proxy mechanism
that does. Perhaps setting up an SSH tunnel with dynamic forwarding (SOCKS
proxy) will help?

On Tue, Sep 26, 2017 at 4:18 PM, Sunil Parmar <sunilosu...@gmail.com> wrote:

> Thanks ! Is CLI the only way to get it ? Is there an API to find this ? We
> have a homegrown GUI built on STRAM API and setting log levels is one of
> the desirable feature. The limitation on prod environment is that we don't
> access to all the workers nodes where potentially the AppMaster could be
> running and GUI runs from edge node.
>
> Sunil Parmar
>
> On Tue, Sep 26, 2017 at 1:23 PM, Thomas Weise <t...@apache.org> wrote:
>
>> Once the app is running you get the AM REST host/port from the app info,
>> such as:
>>
>> apex (application_1506456479682_0001) > get-app-info
>>
>> ...
>>
>>   "appMasterTrackingUrl": "apex-sandbox:35463",
>>
>> ...
>>
>> Have a look at the Apex CLI or here
>> <https://github.com/atrato/atrato-server> for details on how the REST
>> calls are done.
>>
>> Thomas
>>
>> On Tue, Sep 26, 2017 at 11:53 AM, Sunil Parmar <sunilosu...@gmail.com>
>> wrote:
>>
>>> We're using plain vanilla version of apex in our environment ( no
>>> datatorrent or dtgateway ). We're using STRAM API to access the application
>>> details. We are able to find the STRAM endpoints from YARN application API.
>>>
>>> http://:8088/ws/v1/cluster/apps?state=RUNNING
>>>
>>> The tracking URL property points to STRAM service but it's exposed by
>>> YARN web proxy. i.e.
>>>
>>> http://:8088/proxy/application_1505865155823_0039/ws/v2/stram/
>>> http://:8088/proxy/application_1505865155823_0039/ws/v2/
>>> stram/logicalPlan
>>> http://:8088/proxy/application_1505865155823_0039/ws/v2/
>>> stram/physicalPlan
>>> http://:8088/proxy/application_1505865155823_0039/ws/v2/
>>> stram/loggers
>>>
>>> It works for monitoring but the limitation here though is that it only
>>> allows users to submit "GET" type requests on STRAM api. So it doesn't
>>> allow to run POST API  i.e. to change log level.
>>>
>>> How to find direct application master tracking URL  ( no proxy ) ? Is
>>> there a YARN api / STRAM api to do so ? So far I have found a brute force
>>> way to scrap AM log to find it but I wanted to check with community experts
>>> if there is a better way.
>>>
>>> Thanks,
>>> Sunil Parmar
>>>
>>
>>
>


Re: Malhar input operator for connecting to Teradata

2017-08-28 Thread Thomas Weise
The JDBC poll operator uses a query translator that does not support all
dialects in the OS version.

For a one-off with Oracle, the following override for post processing did
the trick:

@Override

protected String buildRangeQuery(Object key, int offset, int limit)

{

  String query = super.buildRangeQuery(key, offset, limit);

  if (!store.getDatabaseDriver().contains("OracleDriver")) {

return query;

  }

  // hack the query for Oracle 12c since jooq does not support in OS
version

  //
https://www.jooq.org/doc/3.9/manual/sql-building/sql-statements/select-statement/limit-clause/

  // https://www.jooq.org/javadoc/3.8.x/org/jooq/SQLDialect.html

  // example LIMIT 1 OFFSET 2  =>  OFFSET 2 ROWS FETCH NEXT 1 ROWS ONLY

  String mysqlClause = String.format("limit %s offset %s", limit, offset
);

  if (offset == 0) {

mysqlClause = String.format("limit %s", limit);

  }

  String oracleClause = String.format("OFFSET %s ROWS FETCH NEXT %s
ROWS ONLY", offset, limit);

  query = query.replace(mysqlClause, oracleClause);

  LOG.debug("Rewritten query: {}", query);

  return query;

}

Also, the operator in release 3.7 has a number of other bugs, which you
find fixed in master. Plus, there is another PR open that is necessary to
make it work with tables where data is purged.

Thomas

On Mon, Aug 28, 2017 at 3:42 PM, Vivek Bhide  wrote:

> Hi,
>
> I would like to know if there is any input operator available to connect
> and
> read data to Teradata? I tried using the JdbcPOJOPollInputOperator but the
> queries it forms are not as per teradata syntax and hence it fails
>
> Regards
> Vivek
>
>
>
> --
> View this message in context: http://apache-apex-users-list.
> 78494.x6.nabble.com/Malhar-input-operator-for-connecting-
> to-Teradata-tp1843.html
> Sent from the Apache Apex Users list mailing list archive at Nabble.com.
>


Re: How the application recovery works when its started with -originalAppId

2017-08-10 Thread Thomas Weise
There are couple bugs that were recently identified that look related to
this:

https://issues.apache.org/jira/browse/APEXMALHAR-2526
https://issues.apache.org/jira/browse/APEXCORE-767

Perhaps the fix for first item is what you need?

Thomas


On Thu, Aug 10, 2017 at 8:41 PM, Pramod Immaneni 
wrote:

> I would dig deeper into why serde of the linked hashmap is failing. There
> are additional logging you can enable in kryo to get more insight. You can
> even try a standalone kryo test to see if it is a problem with the
> linkedhashmap itself or because of some other object that was added to it.
> You could try a newer version of kryo to check if the serde works in a
> newer version because some big was fixed. Once you get more insight on the
> cause then we would be in a better position to determine the best approach.
>
> Thanks
>
> On Thu, Aug 10, 2017 at 5:04 PM Vivek Bhide 
> wrote:
>
>> Thanks Pramod.. This seems to have done trick.. I will check again when I
>> have some data to process to see if that goes well with it. I am quite
>> confident that it will
>>
>> Just curious, Is this the best way to handle this issue or if there is any
>> other elegant way it can be addressed?
>>
>> Regards
>> Vivek
>>
>>
>>
>> --
>> View this message in context: http://apache-apex-users-list.
>> 78494.x6.nabble.com/How-the-application-recovery-works-
>> when-its-started-with-originalAppId-tp1821p1829.html
>> Sent from the Apache Apex Users list mailing list archive at Nabble.com.
>>
>


Re: How to consume from two different topics in one apex application

2017-08-03 Thread Thomas Weise
I don't think you can use both Kafka client 0.8.x and 0.9.x within the same
application. The dependencies overlap and will conflict. You can use 0.8
client to talk to 0.9 server but since you want to use SSL that's not
possible (security was only added in 0.9).

I have not tried that, but you might be able to to use the Maven shade
plugin to shade Apex Kafka operator along with the Kafka client so both
versions can coexist within one application.

Thomas


On Tue, Aug 1, 2017 at 5:42 AM, rishi  wrote:

> Hi, Thanks for the reply.
>
> I tried to consume from two different topics in same app , I am getting
> error (*java.lang.NoSuchMethodError:
> kafka.utils.ZkUtils.getAllBrokersInCluster(Lorg/I0Itec/zkclient/ZkClient;)
> Lscala/collection/Seq;*)
> .
>
> When I tried consuming from kafka 9 using this(
> KafkaSinglePortInputOperator)
> operator, I was able to do it successfully , but when I am adding another
> one more operator(KafkaSinglePortByteArrayInputOperator) to consume from
> .8
> in same dag I am getting the error.
>
> For testing I am not merging kafka output to any operator, it is writing at
> two different location in HDFS.
>
> Looks like there is some version issue comming , which I am not able to
> identify . Any help is highly appreciated.
>
> My pom.xml looks like this=
>
> 
>
> 3.4.0
> 3.6.0
> lib/*.jar
> 2.7.1.2.3.4.0-3485
> 1.1.2.2.3.4.0-3485
> 0.9.0.1
>  0.9.0.1-cp1
> 2.0.1
>
> 1.7.7
> 1.1
> 2.9.1
> 0.38
> 4.10
>   
>
>   
> 
> HDPReleases
> HDP Releases
>
> http://repo.hortonworks.com/content/repositories/releases/
> default
> 
> 
> HDP Jetty Hadoop
> HDP Jetty Hadoop
>
> http://repo.hortonworks.com/content/repositories/jetty-hadoop/
> default
> 
> 
> confluent
> http://packages.confluent.io/maven
> 
> 
> 
> 
> org.apache.apex
> malhar-library
> ${malhar.version}
>
>
> 
> 
> org.apache.apex
> apex-common
> ${apex.version}
> provided
> 
> 
> junit
> junit
> ${junit.version}
> test
> 
> 
> org.apache.apex
> apex-engine
> ${apex.version}
> test
> 
>
> 
> org.apache.apex
> malhar-contrib
> ${malhar.version}
> 
>
> 
> org.apache.apex
> malhar-kafka
> ${malhar.version}
> 
>
> 
> org.apache.avro
> avro
> ${avro.version}
> 
>
>  
> org.apache.kafka
> kafka_2.11
> ${confluent.kafka.version}
> 
>   
> org.apache.kafka
> kafka_2.10
> 0.8.1.1
> 
>
> 
> io.confluent
> kafka-avro-serializer
> ${kafka.avro.srlzr.version}
> 
> 
> log4j
> log4j
> 
> 
> org.slf4j
> slf4j-log4j12
> 
> 
> 
>
> 
> com.googlecode.json-simple
> json-simple
> ${json.version}
> 
>
> 
> org.apache.hbase
> hbase-client
> ${hbase.version}
> provided
> 
>
> 
> joda-time
> joda-time
> ${jodatime.version}
> 
>
> 
> de.javakaffee
> kryo-serializers
> ${kyroserializer.version}
> 
> 
>
>
> My DAG looks like this=>
>
> public void populateDAG(DAG dag, Configuration conf)
>   {
>
> KafkaSinglePortInputOperator kafkaInTtce =
> dag.addOperator("Kafka_Input_SSL",new KafkaSinglePortInputOperator());
>
> kafkaInTtce.setInitialPartitionCount(Integer.parseInt(conf.get("
> kafka.partitioncount")));
> kafkaInTtce.setTopics(conf.get("kafka.ssl.topic"));
> kafkaInTtce.setInitialOffset(conf.get("kafka.offset"));
> kafkaInTtce.setClusters(conf.get("kafka.cluster"));
>
> kafkaInTtce.setConsumerProps(getKafkaProperties(conf.get("kafka.cluster"),
> conf));
> kafkaInTtce.setStrategy(conf.get("kafka.strategy"));
>
> AvroBytesConversionOperator avroConversion =
> dag.addOperator("Avro_Convert", new
> AvroBytesConversionOperator(conf.get("kafka.schema.registry")));
> ColumnsExtractOperator fieldExtract = dag.addOperator("Field_Extract",
> new ColumnsExtractOperator());
>
> WriteToHdfs hdfs = dag.addOperator("To_HDFS", new
> WriteToHdfs(conf.get("hdfs.filename")));
> hdfs.setMaxLength(268435456); // new file rotates after every 

Re: Apache Apex with Apache slider

2017-08-01 Thread Thomas Weise
Sounds like what you want is a Slider application that monitors the Apex
application?

Yes, that would be possible. The Slider app/script could use the RM to
locate the app and check the YARN status and then through the Apex AM REST
API to poll the Apex metrics.

Thomas


On Tue, Aug 1, 2017 at 11:22 AM, Vivek Bhide  wrote:

> Hi Pramod,
>
> Main reason we are looking for marrying Apex with Slider is to make use of
> slider's ability of monitoring and restarting an application in case of
> failure. We have seen couple of times that Apex streaming application got
> killed because of some sporadic issues with cluster and there was no
> monitoring in place to alert about failure and restart
>
> To avoid these oops situations and probably having some script that will
> monitor the job and alert/restart it we are thinking of using slider
> instead
>
> Regards
> Vivek
>
>
>
> --
> View this message in context: http://apache-apex-users-list.
> 78494.x6.nabble.com/Apache-Apex-with-Apache-slider-tp1802p1804.html
> Sent from the Apache Apex Users list mailing list archive at Nabble.com.
>


Kafka producer does not propagate error when unable to connect to broker

2017-07-16 Thread Thomas Weise
I noticed that when the Kafka output operator (0.9) is configured with a
wrong port or missing the security setup required by the cluster, it will
be in "ACTIVE" state but not do anything, essentially ignoring the error.

I see that part of the problem is that the producer is async. That makes me
wonder how we actually can make any assertions about the processing
guarantee of this operator?

Thanks,
Thomas


Apex HTTP Authentication

2017-07-12 Thread Thomas Weise
I'm working on a secure cluster that has authentication enabled for the
YARN services.

In my Apex setup, I have:

 

   apex.attr.STRAM_HTTP_AUTHENTICATION

   DISABLE

 

"DISABLE - Disable authentication for web services."

That's not what happens though, it rather follows the Hadoop setting and
fails because in this case Kerberos is enabled and the keytab not
configured.

I think that if a DISABLE option is advertised, then it should turn off the
authentication that gets inherited from the node manager environment.

Configuration config = getConfig();

if (SecurityUtils.isStramWebSecurityEnabled()) {

   config = new Configuration(config);

   config.set("hadoop.http.filter.initializers",
StramWSFilterInitializer.class.getCanonicalName());

 } else {

   if
(!"simple".equals(config.get(SecurityUtils.HADOOP_HTTP_AUTH_PROP))) {

 LOG.warn("Found http authentication {} but authentication was
disabled in Apex.",

 config.get(SecurityUtils.HADOOP_HTTP_AUTH_PROP));

 config = new Configuration(config);

 // turn off authentication for Apex as specified by user

 config.set(SecurityUtils.HADOOP_HTTP_AUTH_PROP, "simple");

   }

}

It will also help tremendously when warning from jetty are not swallowed
due to

org.mortbay.log.Log.setLog(null);

Otherwise there is just a "handler failed" message and the user has no way
to know what went wrong without hacking the Apex code?

Thanks,
Thomas


Re: How to consume from Kafka-V10 (SSL enabled) topic using Apex

2017-07-03 Thread Thomas Weise
You can use the operators that work with 0.9 Kafka client to access 0.10
Kafka cluster (Malhar release 3.7 has only operators that work with 0.9
client, next release will have 0.10 client support).

Thomas

On Mon, Jul 3, 2017 at 4:52 AM, Chaitanya Chebolu  wrote:

> Hi Rishi,
>
>   You can use KafkaSinglePortInputOperator which is under malhar-kafka010
> module. Set the SSL properties to the "consumerProps" parameter as below:
>
> KafkaSinglePortInputOperator input = dag.addOperator("InputFromKafka",
> new KafkaSinglePortInputOperator());
> Properties prop = new Properties();
> // Set the ssl properties to "prop"
> input.setConsumerProps(prop);
>
> Regards,
> Chaitanya
>
>
> On Mon, Jul 3, 2017 at 4:43 PM, rishi  wrote:
>
>> Hi,
>>
>> Is there any built in operator in Apex which I can use to consume message
>> from kafka topic [ which is SSL enabled and version 10].
>>
>> Thanks
>> Rishi
>>
>>
>>
>> --
>> View this message in context: http://apache-apex-users-list.
>> 78494.x6.nabble.com/How-to-consume-from-Kafka-V10-SSL-enable
>> d-topic-using-Apex-tp1773.html
>> Sent from the Apache Apex Users list mailing list archive at Nabble.com.
>>
>
>
>
> --
>
> *Chaitanya*
>
> Software Engineer
>
> E: chaita...@datatorrent.com | Twitter: @chaithu1403
>
> www.datatorrent.com  |  apex.apache.org
>
>
>


Re: JDBC poll operator performance

2017-06-27 Thread Thomas Weise
Records can be distributed between partitions based on key ranges, no
sorting is needed for that.

You may need sorting for repeatable read within a partition. But even then
the query should filter to not fetch what was already loaded. Without a
WHERE clause, there is an unnecessary repeated full index scan.

The operator has other deficiencies, such as poor error handling in the
poll thread and also the tuple conversion does not work with all column
expressions, I'm going to submit tickets and fix some of these issues.

The documentation also fails to mention that it won't work with Oracle,
because the dialect is not supported in jooq open source version.

Thomas


On Tue, Jun 27, 2017 at 2:12 AM, Hitesh Kapoor <hit...@datatorrent.com>
wrote:

> I agree with Bhupesh, DB does not guarantees that your data will be
> retrieved in a specific or sorted order if an 'order by' clause is not
> given in the query.
> IMO in case of our poll operator we will have to sort the records for
> non-poller partitions to ensure all records are emitted and no 2 records
> are emitted by different partitions.
> I think we can get away with sorting for poller partition with the idea
> that Thomas has suggested.
>
> --Hitesh
> On Tue, Jun 27, 2017 at 10:48 AM, Bhupesh Chawda <bhup...@datatorrent.com>
> wrote:
>
>> IMO we would need to sort since, even though the keys are monotonically
>> increasing, it may not return the data in the same order. Depends on the
>> implementation and file format of the given db.
>>
>> ~ Bhupesh
>>
>>
>> ___
>>
>> Bhupesh Chawda
>>
>> E: bhup...@datatorrent.com | Twitter: @bhupeshsc
>>
>> www.datatorrent.com  |  apex.apache.org
>>
>>
>>
>> On Tue, Jun 27, 2017 at 9:16 AM, Thomas Weise <t...@apache.org> wrote:
>>
>>> Hi,
>>>
>>> It seems the poll operator performs unnecessary operations in the case
>>> where the "key" column values in the source table are monotonic increasing.
>>> There should be no need to sort or do count selects. Instead it should be
>>> sufficient to just filter with the key range.
>>>
>>> Let's say the key column is a timestamp that is set by a trigger, one
>>> could use:
>>>
>>> SELECT ... WHERE UPDATE_DATE > ""
>>>
>>> Instead of operating with ORDER BY, OFFSET and LIMIT.
>>>
>>> Thanks
>>>
>>>
>>>
>>
>


JDBC poll operator performance

2017-06-26 Thread Thomas Weise
Hi,

It seems the poll operator performs unnecessary operations in the case
where the "key" column values in the source table are monotonic increasing.
There should be no need to sort or do count selects. Instead it should be
sufficient to just filter with the key range.

Let's say the key column is a timestamp that is set by a trigger, one could
use:

SELECT ... WHERE UPDATE_DATE > ""

Instead of operating with ORDER BY, OFFSET and LIMIT.

Thanks


Re: What is recommended way to achieve exactly once tuple processing in case of operator failure scenario

2017-06-17 Thread Thomas Weise
There is no way to avoid reprocessing tuples if the goal is to achieve
exactly-once results. Please have a look at http://apex.apache.org/docs.html
- presentation about fault tolerance and blog "End-to-end exactly-once".

Platform alone cannot guarantee exactly-once results. Operators mutating
state in external systems need to play ball and many of the frequently used
connectors in the Apex library do that.

The operator processing mode "EXACTLY_ONCE" should be deprecated. It leads
to unnecessary checkpointing overhead and cannot guarantee exactly once
results.

Thomas


On Sat, Jun 17, 2017 at 2:21 PM, Vivek Bhide  wrote:

> After any of the operators fails during processing, it always recovers from
> the last checkpointed state. So it will reprocess all the tuples which were
> processed before failure but not checkpointed. What is a recommended way to
> to avoid this from happening? Is there any setting in Apex that enables the
> checkpoint creation just before the operator completely gets killed or
> fails? If not then how can it be achieved?
>
> I tried tweaking the operator processing mode to EXACTLY_ONCE. Also checked
> details about CountStoreOperator but to make sure I cover all the
> individual
> operator failures, I will have to put CountStoreOperator after every
> operator. Not sure if this is really scalable solution.
>
> What is the best recommended way to achieve this?
>
>
>
> --
> View this message in context: http://apache-apex-users-list.
> 78494.x6.nabble.com/What-is-recommended-way-to-achieve-
> exactly-once-tuple-processing-in-case-of-operator-failure-
> scenario-tp1740.html
> Sent from the Apache Apex Users list mailing list archive at Nabble.com.
>


Re: Windowing

2017-06-09 Thread Thomas Weise
Here is an example that uses the window operator to compute top words and
time series from Tweets (and optionally outputs the results to websocket):

https://github.com/tweise/apex-samples/tree/master/twitter



On Thu, Jun 8, 2017 at 11:11 AM, Tushar Gosavi 
wrote:

> You can find more apex example applications in malhar repository
> https://github.com/apache/apex-malhar/tree/master/examples
>
> For time based windows aggregation you can take a look at yahoo streaming
> benchmark implemented in Apex.
> https://github.com/yahoo/streaming-benchmarks/blob/
> master/apex-benchmarks/src/main/java/apex/benchmark/
> ApplicationDimensionComputation.java
>
> Regards,
> - Tushar.
>
>
> On Thu, Jun 8, 2017 at 10:05 PM, vikram patil 
> wrote:
>
>> Hi Rick,
>>
>> Hopefully this will help you, if you haven't already looked at it.
>> https://apex.apache.org/docs/malhar/operators/windowedOperator/
>>
>> Regards,
>> Vikram
>>
>> On Thu, Jun 8, 2017 at 10:03 PM, Rick.Magnuson 
>> wrote:
>>
>>> Does anyone have any documentation on using Windowing with Apache Apex?
>>> I’ve checked out the documentation on the Apache Apex website but was
>>> hoping to find some additional resources. Also, if anyone has any examples
>>> outside of the word count example provided by Apex, I would greatly
>>> appreciate that too. Thanks,
>>>
>>>
>>>
>>> Rick Magnuson | Lead Data Engineer | ¤Target | 1000 Nicollet Mall |
>>> Minneapolis, MN 55403
>>>
>>
>>
>


Re: trying to run Apex on Hadoop cluster

2017-06-06 Thread Thomas Weise
Claire,

There shouldn't be a need to run the pipeline like this since the Apex
runner already has the support to launch hadoop with the required
dependencies.

Can you please confirm that you are able to run the basic word count
example as shown here:

https://beam.apache.org/documentation/runners/apex/

Thanks,
Thomas





On Tue, Jun 6, 2017 at 5:07 PM, Claire Yuan 
wrote:

> Hi all,
>   I am the one trying to run Apache Beam example on cluster:
>   I used the following command with my given input in a folder named
> "harrypotter":
> *#!/bin/bash*
>
> *HADOOP_CLASSPATH="$HADOOP_CLASSPATH:/tmp/beam/jars/*" hadoop jar
> /tmp/beam/jars/beam-examples-java-2.1.0-SNAPSHOT.jar
> org.apache.beam.examples.complete.TfIdf --runner=ApexRunner
> --embeddedExecution=false --output=apexrunnertfidf
> --input=/tmp/beam/harrypotter/*
>
> *java -cp /homes/org.apache.beam.examples.complete.TfIdf*
> --
>
> However, some configuration seems to go wrong:
>
> *Exception in thread "main" java.lang.RuntimeException: Failed to launch
> the application on YARN.*
> * at org.apache.beam.runners.apex.ApexRunner.run(ApexRunner.java:204)*
> * at org.apache.beam.runners.apex.ApexRunner.run(ApexRunner.java:82)*
> * at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)*
> * at org.apache.beam.sdk.Pipeline.run(Pipeline.java:283)*
> * at org.apache.beam.examples.complete.TfIdf.main(TfIdf.java:442)*
> * at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)*
> * at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)*
> * at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)*
> * at java.lang.reflect.Method.invoke(Method.java:498)*
> * at org.apache.hadoop.util.RunJar.run(RunJar.java:234)*
> * at org.apache.hadoop.util.RunJar.main(RunJar.java:148)*
> *Caused by: java.io.FileNotFoundException:hadoop/client/dfs.include (No
> such file or directory)*
> * at java.io.FileInputStream.open0(Native Method)*
> * at java.io.FileInputStream.open(FileInputStream.java:195)*
> * at java.io.FileInputStream.(FileInputStream.java:138)*
> * at org.apache.commons.io
> .FileUtils.copyFile(FileUtils.java:1112)*
> * at
> org.apache.beam.runners.apex.ApexYarnLauncher$2.visitFile(ApexYarnLauncher.java:277)*
> * at
> org.apache.beam.runners.apex.ApexYarnLauncher$2.visitFile(ApexYarnLauncher.java:253)*
> * at java.nio.file.Files.walkFileTree(Files.java:2670)*
> * at java.nio.file.Files.walkFileTree(Files.java:2742)*
> * at
> org.apache.beam.runners.apex.ApexYarnLauncher.createJar(ApexYarnLauncher.java:253)*
> * at
> org.apache.beam.runners.apex.ApexYarnLauncher.launchApp(ApexYarnLauncher.java:90)*
> * at org.apache.beam.runners.apex.ApexRunner.run(ApexRunner.java:201)*
>
> I checked the :hadoop/client/ folder and found that the dfs.include
> actually exists.
> May any of you give solution to this?
>
> Claire
>


Re: To run Apex runner of Apache Beam

2017-06-02 Thread Thomas Weise
It may also be helpful to enable logging to see if there are any exceptions
in the execution. The console log only contains output from Maven. Is
logging turned off (or a suitable slf4j backend missing)?


On Fri, Jun 2, 2017 at 5:00 PM, Thomas Weise <t...@apache.org> wrote:

> Hi Claire,
>
> The stack traces suggest that the application was launched and the Apex
> DAG is deployed and might be running (it runs in embedded mode).
>
> Do you see any output? In case it writes to files, anything in output or
> output staging directories?
>
> Thanks,
> Thomas
>
>
> On Thu, Jun 1, 2017 at 3:18 PM, Claire Yuan <clairey...@yahoo-inc.com>
> wrote:
>
>> Hi Vlad,
>>Thanks for replying! Here attached my log and stack trace
>>
>>
>> On Thursday, June 1, 2017 1:31 PM, Vlad Rozov <v.ro...@datatorrent.com>
>> wrote:
>>
>>
>> Hi Claire,
>>
>> Can you provide maven logs? It may also help to obtain stack trace. In a
>> separate terminal window, run "jps -lv" and look for the jvm process id
>> that executes org.apache.beam.examples.TfIdf. Use jstack  to
>> get the stack trace.
>>
>> Thank you,
>>
>> Vlad
>>
>> On 6/1/17 13:00, Claire Yuan wrote:
>>
>> Hi,
>>   I am using the Apex runner to execute Apache Beam example pipeline.
>> However my terminal always get frozen when running the command:
>>
>> mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.TfIdf \
>>  -Dexec.args="--inputFile=pom.xml --output=counts --runner=ApexRunner" 
>> -Papex-runner
>>
>>   Would anyone have solution for this?
>>
>> Claire
>>
>>
>>
>>
>>
>


Re: To run Apex runner of Apache Beam

2017-06-02 Thread Thomas Weise
Hi Claire,

The stack traces suggest that the application was launched and the Apex DAG
is deployed and might be running (it runs in embedded mode).

Do you see any output? In case it writes to files, anything in output or
output staging directories?

Thanks,
Thomas


On Thu, Jun 1, 2017 at 3:18 PM, Claire Yuan 
wrote:

> Hi Vlad,
>Thanks for replying! Here attached my log and stack trace
>
>
> On Thursday, June 1, 2017 1:31 PM, Vlad Rozov 
> wrote:
>
>
> Hi Claire,
>
> Can you provide maven logs? It may also help to obtain stack trace. In a
> separate terminal window, run "jps -lv" and look for the jvm process id
> that executes org.apache.beam.examples.TfIdf. Use jstack  to
> get the stack trace.
>
> Thank you,
>
> Vlad
>
> On 6/1/17 13:00, Claire Yuan wrote:
>
> Hi,
>   I am using the Apex runner to execute Apache Beam example pipeline.
> However my terminal always get frozen when running the command:
>
> mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.TfIdf \
>  -Dexec.args="--inputFile=pom.xml --output=counts --runner=ApexRunner" 
> -Papex-runner
>
>   Would anyone have solution for this?
>
> Claire
>
>
>
>
>


One Year Anniversary of Apache Apex

2017-04-25 Thread Thomas Weise
Hi,

It's been one year for Apache Apex as top level project, congratulations to
the community!

I wrote this blog to reflect and look ahead:

http://www.atrato.io/blog/2017/04/25/one-year-apex/

Your comments and suggestions are welcome.

Thanks,
Thomas


Re: set attribute of type Map in properties.xml

2017-04-14 Thread Thomas Weise
Ram,

Is there are reason why this is not reflected here:

http://apex.apache.org/docs/apex/application_packages/

Thanks,
Thomas




On Wed, Apr 12, 2017 at 5:35 PM, Munagala Ramanath 
wrote:

> http://docs.datatorrent.com/application_packages/
>
> Please take a look at the "Operator properties" section.
>
> Ram
>
> On Wed, Apr 12, 2017 at 4:17 PM, Sunil Parmar 
> wrote:
>
>> We've a use case where we want to set an operator property of type Map
>> (String, Long)  in the property xml. Is there a way to do this ?
>>
>> Thanks,
>> Sunil
>>
>
>
>
> --
>
> ___
>
> Munagala V. Ramanath
>
> Software Engineer
>
> E: r...@datatorrent.com | M: (408) 331-5034 | Twitter: @UnknownRam
>
> www.datatorrent.com  |  apex.apache.org
>
>
>


[ANNOUNCE] Apache Apex Malhar 3.7.0 released

2017-04-05 Thread Thomas Weise
Dear Community,

The Apache Apex community is pleased to announce release 3.7.0 of the Apex
Malhar library.

Apache Apex is an enterprise grade big data-in-motion platform that unifies
stream and batch processing. Apex was built for scalability and low-latency
processing, high availability and operability. The Apex engine is
supplemented by Malhar, the library of pre-built operators, including
connectors that integrate with many existing technologies as sources and
destinations, like message buses, databases, files or social media feeds.

This release adds a number of new operators, including S3 line reader
(parallel block based), S3 output for tuples and file copy, fixed length
parser, Redshift output, and more accumulations for the windowed operator.
The release also adds several new user documentation
 sections.

See full release notes

for all changes in this release.

Apex provides features that similar platforms currently don’t offer, such
as fine grained, incremental recovery to only reset the portion of a
topology that is affected by a failure, support for elastic scaling based
on the ability to acquire (and release) resources as needed as well as the
ability to alter topology and operator properties on running applications.

Apex early on brought the combination of high throughput, low latency and
fault tolerance with strong processing guarantees to the stream data
processing space and gained maturity through important production use cases
at several organizations. See the powered by page and resources on the
project web site for more information:

http://apex.apache.org/powered-by-apex.html
http://apex.apache.org/docs.html

An easy way to get started with Apex is to pick one of the examples as
starting point. They cover many common and recurring tasks, such as data
consumption from different sources, output to various sinks, partitioning
and fault tolerance:

https://github.com/apache/apex-malhar/tree/master/examples/

Apex Malhar and Core (the engine) are separate repositories and releases.
We expect more frequent releases of Malhar to roll out new connectors and
other operators based on a stable engine API. This release 3.7.0 works on
existing Apex Core 3.4.0. Users only need to upgrade the library Maven
dependency in their application project.

The source release can be found at:

http://apex.apache.org/downloads.html

We welcome your help and feedback. For more information on the project and
how to get involved, visit our website at:

http://apex.apache.org/

Regards,
The Apache Apex community


Re: 3.5.0 apex core build failing with hadoop 2.7.3 dependency

2017-03-23 Thread Thomas Weise
Bummer. Thanks Ram, I was just going to look at it. The binaries work on
2.7.x though. Will the issue only show in secure mode?

Please create a JIRA in APEXCORE.

Chinmay, you may need to apply a patch as part of your build to be able to
use the 3.5.0 sources.

Thomas


On Thu, Mar 23, 2017 at 7:15 AM, Munagala Ramanath 
wrote:

> Looks like this was a breaking change introduced in Hadoop 2.7.0:
>
> In https://hadoop.apache.org/docs/r2.6.5/api/org/apache/hadoop/yarn/conf/
> YarnConfiguration.html we have:
>
> static long DELEGATION_TOKEN_MAX_LIFETIME_DEFAULT
> static String DELEGATION_TOKEN_MAX_LIFETIME_KEY
>
> But in https://hadoop.apache.org/docs/r2.7.0/api/org/apache/
> hadoop/yarn/conf/YarnConfiguration.html we have:
>
> static long RM_DELEGATION_TOKEN_MAX_LIFETIME_DEFAULT
> static String RM_DELEGATION_TOKEN_MAX_LIFETIME_KEY
>
> Ram
>
> On Wed, Mar 22, 2017 at 11:59 PM, Chinmay Kolhatkar 
> wrote:
>
>> Hi All,
>>
>> I want to clarify whether my understanding is correct here.
>>
>> 1. I downloaded apache-apex-core-3.5.0-source-release.tar.gz
>> 2. Extracted the tar
>> 3. Ran following command to build apex core:
>> mvn clean package -DskipTests -Dhadoop.version=2.7.3
>> (NOTE: I have overridden hadoop version to 2.7.3)
>>
>> I get following compilation error:
>> [ERROR] Failed to execute goal 
>> org.apache.maven.plugins:maven-compiler-plugin:3.3:compile
>> (default-compile) on project apex-engine: Compilation failure: Compilation
>> failure:
>> [ERROR] /home/chinmay/files/apache-apex-core-3.5.0/engine/src/main/
>> java/com/datatorrent/stram/plan/logical/LogicalPlan.java:[159,87] cannot
>> find symbol
>> [ERROR] symbol:   variable DELEGATION_TOKEN_MAX_LIFETIME_DEFAULT
>> [ERROR] location: class org.apache.hadoop.yarn.conf.YarnConfiguration
>> [ERROR] /home/chinmay/files/apache-apex-core-3.5.0/engine/src/main/
>> java/com/datatorrent/stram/client/StramAppLauncher.java:[586,120] cannot
>> find symbol
>> [ERROR] symbol:   variable DELEGATION_TOKEN_MAX_LIFETIME_KEY
>> [ERROR] location: class org.apache.hadoop.yarn.conf.YarnConfiguration
>> [ERROR] /home/chinmay/files/apache-apex-core-3.5.0/engine/src/main/
>> java/com/datatorrent/stram/client/StramAppLauncher.java:[586,173] cannot
>> find symbol
>> [ERROR] symbol:   variable DELEGATION_TOKEN_MAX_LIFETIME_DEFAULT
>> [ERROR] location: class org.apache.hadoop.yarn.conf.YarnConfiguration
>> [ERROR] -> [Help 1]
>>
>> What should be the expected result while building apex-core when building
>> with different version of hadoop?
>> Is this error expected OR something wrong in my environment?
>>
>> Thanks,
>> Chinmay.
>>
>>
>
>
> --
>
> ___
>
> Munagala V. Ramanath
>
> Software Engineer
>
> E: r...@datatorrent.com | M: (408) 331-5034 | Twitter: @UnknownRam
>
> www.datatorrent.com  |  apex.apache.org
>
>
>


Re: Apex installation instructions

2017-03-22 Thread Thomas Weise
Hi,

Those instructions are indeed for running in a dev environment only. There
are a few other download options:

http://apex.apache.org/downloads.html

The Bigtop binaries can be used, but they are designed to also install
Hadoop.

I believe you are looking to just install the Apex CLI on an edge node?
There wasn't a binary package available for that purpose so far, but now
there is:

https://github.com/atrato/apex-cli-package/releases/tag/v3.5.0

Give it a try and let me know how it works. It can probably at some point
become part of the official release also.

Thanks,
Thomas



On Wed, Mar 22, 2017 at 5:51 PM, Mohammad Kargar  wrote:

> That works in a Dev environment fine. However in a test cluster
> environment (where Apex script copied manually) it's failing to find
> configurations.
>
> On Mar 22, 2017 4:54 PM, "Munagala Ramanath"  wrote:
>
>> Please take a look at: http://apex.apache.org/docs.html
>> The Beginner's Guide is a good place to start. Briefly stated, you'll
>> need to build your application package
>> using maven and deploy it using the commandline tool "apex" that is in
>> the apex-core repository.
>>
>> Ram
>>
>> On Wed, Mar 22, 2017 at 4:19 PM, Mohammad Kargar 
>> wrote:
>>
>>> Hi,
>>>
>>> Is there any instructions for installing open source Apex (no
>>> DataTorrent edition) in a production environment? Haven't had any luck
>>> searching online documents here 
>>> .
>>>
>>> Thanks,
>>>
>>
>>
>>
>> --
>>
>> ___
>>
>> Munagala V. Ramanath
>>
>> Software Engineer
>>
>> E: r...@datatorrent.com | M: (408) 331-5034 | Twitter: @UnknownRam
>>
>> www.datatorrent.com  |  apex.apache.org
>>
>>
>>


Re: Need suggestion about temporary file usage on container node

2017-03-18 Thread Thomas Weise
The localized files will be in the YARN containers working directory and
JVMs classpath, so as long as you know the original name, you will be able
to access via Class.getResourceAsStream or Class.getResource

Thomas

On Fri, Mar 17, 2017 at 8:54 AM, Vikram Patil <vik...@datatorrent.com>
wrote:

> Thanks Thomas. Could you suggest a way so that I can figure out the path
> to particular localized file on container node preferably using
> OperatorContext ?
>
>
> On Fri, Mar 17, 2017 at 8:06 PM, Thomas Weise <t...@apache.org> wrote:
>
>> If you use LIB_JARS or FILES, then those files will be localized by YARN
>> on the container node, you don't need to manually copy them from HDFS or
>> write cleanup code for it.
>>
>> Thomas
>>
>> On Fri, Mar 17, 2017 at 12:49 AM, vikram patil <patilvik...@gmail.com>
>> wrote:
>>
>>> Hello All,
>>>
>>> I am working on operator which would download the file from HDFS and
>>> store in on the local machine during run-time.  For storing files to HDFS
>>> from client machine I will be using LIB_JARS or FILES configuration. Once
>>> operator fails/shuts down, I would like to clean up these files
>>> automatically if possible.
>>>
>>> Thanks & Regards,
>>> Vikram
>>>
>>
>>
>


Re: JavaDocs for Apex Core now available

2017-03-17 Thread Thomas Weise
Yay! Will update the web site soon.

--
sent from mobile
https://ci.apache.org/projects/apex-core/apex-core-
javadoc-release-3.4/index.html
https://ci.apache.org/projects/apex-core/apex-core-
javadoc-release-3.5/index.html

... at the above links.

Ram
-- 

___

Munagala V. Ramanath

Software Engineer

E: r...@datatorrent.com | M: (408) 331-5034 | Twitter: @UnknownRam

www.datatorrent.com  |  apex.apache.org


Re: Need suggestion about temporary file usage on container node

2017-03-17 Thread Thomas Weise
If you use LIB_JARS or FILES, then those files will be localized by YARN on
the container node, you don't need to manually copy them from HDFS or write
cleanup code for it.

Thomas

On Fri, Mar 17, 2017 at 12:49 AM, vikram patil 
wrote:

> Hello All,
>
> I am working on operator which would download the file from HDFS and store
> in on the local machine during run-time.  For storing files to HDFS from
> client machine I will be using LIB_JARS or FILES configuration. Once
> operator fails/shuts down, I would like to clean up these files
> automatically if possible.
>
> Thanks & Regards,
> Vikram
>


Re: One-time Initialization of in-memory data using a data file

2017-01-23 Thread Thomas Weise
First link without frame:

https://ci.apache.org/projects/apex-malhar/apex-malhar-javadoc-release-3.6/org/apache/apex/malhar/lib/state/managed/AbstractManagedStateImpl.html


On Mon, Jan 23, 2017 at 3:33 PM, Thomas Weise <t...@apache.org> wrote:
> Roger,
>
> An Apex operator typically holds state that it uses for processing and
> often that state is mutable. For large state: "Managed state" in
> Malhar (and its predecessor HDHT) were designed for large state that
> can be mutated efficiently under a specific write pattern (semi
> ordered keys). However, there is no benefit of using these for
> immutable data that is already in HDFS.
>
> In such case it would be best to store them (during migration/ingest)
> in HDFS a file format that allows for fast random reads (block
> structured files like HFile or TFile or any other indexed structure
> provide that).
>
> Also, depending on how the data, once in memory, would be used, an
> Apex operator may or may not be the right home. If the goal is to only
> lookup data without further processing with a synchronous
> request/response pattern, then an IMDG or similar system may be a more
> appropriate solution.
>
> Here are pointers for managed state:
>
> https://ci.apache.org/projects/apex-malhar/apex-malhar-javadoc-release-3.6/index.html
> https://github.com/apache/apex-malhar/blob/master/benchmark/src/main/java/com/datatorrent/benchmark/state/ManagedStateBenchmarkApp.java
>
> Thanks,
> Thomas
>
>
> On Sun, Jan 22, 2017 at 11:43 PM, Ashwin Chandra Putta
> <ashwinchand...@gmail.com> wrote:
>> Roger,
>>
>> Depending on the certain requirements on expected latency, size of data etc,
>> the operator's design will change.
>>
>> If latency needs to be lowest possible, meaning completely in-memory and not
>> hitting the disk for read I/O, there are two scenarios
>> 1. If the lookup data size is small --> just load to memory in the setup
>> call, switch off checkpointing to get rid off checkpoint I/O latency in
>> between. In case of operator restarts, the data should be reloaded in setup.
>> 2. If the lookup data is large --> have many partitions of this operator to
>> minimize the footprint of each partition. Still switch off checkpointing and
>> reload in setup in case of operator restart. Having many partitions will
>> ensure that the setup load is fast. The incoming query needs to be
>> partitioned based on the lookup key.
>>
>> You can use the PojoEnricher with FSLoader for above design.
>>
>> Code:
>> https://github.com/apache/apex-malhar/blob/master/contrib/src/main/java/com/datatorrent/contrib/enrich/POJOEnricher.java
>> Example:
>> https://github.com/DataTorrent/examples/tree/master/tutorials/enricher
>>
>> In case of large lookup dataset and latency caused by disk read I/O is fine,
>> then use HDHT or managed state as a backup mechanism for the in-memory data
>> to decrease the checkpoint footprint. I could not find example for managed
>> state but here are the links for HDHT..
>>
>> Code:
>> https://github.com/DataTorrent/Megh/tree/master/contrib/src/main/java/com/datatorrent/contrib/hdht
>> Example:
>> https://github.com/DataTorrent/examples/blob/master/tutorials/hdht/src/test/java/com/example/HDHTAppTest.java
>>
>> Regards,
>> Ashwin.
>>
>> On Sun, Jan 22, 2017 at 10:45 PM, Sanjay Pujare <san...@datatorrent.com>
>> wrote:
>>>
>>> You may want to take a look at com.datatorrent.lib.fileaccess.DTFileReader
>>> in the malhar-library – not sure whether it gives you reading the whole file
>>> into memory.
>>>
>>>
>>>
>>> Also there is a library called Megh at https://github.com/DataTorrent/Megh
>>> where you might find some useful operators like
>>> com.datatorrent.contrib.hdht.hfile.HFileImpl .
>>>
>>>
>>>
>>> From: Roger F <rf301...@gmail.com>
>>> Reply-To: <users@apex.apache.org>
>>> Date: Sunday, January 22, 2017 at 9:32 PM
>>> To: <users@apex.apache.org>
>>> Subject: One-time Initialization of in-memory data using a data file
>>>
>>>
>>>
>>> Hi,
>>>
>>> I have a use case where application business data needs migrated from a
>>> legacy system (such as mainframe) into HDFS and then loaded for use by an
>>> Apex application.
>>>
>>> To get this done, an approach that is being considered to perform one-time
>>> initialization of the data from the HDFS into application memory. This data
>>> will then be queried for various business logic functions of the
>>> application.
>>>
>>> Once the data is loaded, this operator/module (?) should no longer perform
>>> any further function except for acting as a master of this data and then
>>> supporting operations to query the data (via a key).
>>>
>>> Any pointers to how this can be done ? I was looking for an operator or
>>> any other entity which can load this data at startup (Activation or Setup)
>>> and then allow queries to be submitted to it via an input port.
>>>
>>>
>>>
>>> -R
>>
>>
>>
>>
>> --
>>
>> Regards,
>> Ashwin.


Re: One-time Initialization of in-memory data using a data file

2017-01-23 Thread Thomas Weise
Roger,

An Apex operator typically holds state that it uses for processing and
often that state is mutable. For large state: "Managed state" in
Malhar (and its predecessor HDHT) were designed for large state that
can be mutated efficiently under a specific write pattern (semi
ordered keys). However, there is no benefit of using these for
immutable data that is already in HDFS.

In such case it would be best to store them (during migration/ingest)
in HDFS a file format that allows for fast random reads (block
structured files like HFile or TFile or any other indexed structure
provide that).

Also, depending on how the data, once in memory, would be used, an
Apex operator may or may not be the right home. If the goal is to only
lookup data without further processing with a synchronous
request/response pattern, then an IMDG or similar system may be a more
appropriate solution.

Here are pointers for managed state:

https://ci.apache.org/projects/apex-malhar/apex-malhar-javadoc-release-3.6/index.html
https://github.com/apache/apex-malhar/blob/master/benchmark/src/main/java/com/datatorrent/benchmark/state/ManagedStateBenchmarkApp.java

Thanks,
Thomas


On Sun, Jan 22, 2017 at 11:43 PM, Ashwin Chandra Putta
 wrote:
> Roger,
>
> Depending on the certain requirements on expected latency, size of data etc,
> the operator's design will change.
>
> If latency needs to be lowest possible, meaning completely in-memory and not
> hitting the disk for read I/O, there are two scenarios
> 1. If the lookup data size is small --> just load to memory in the setup
> call, switch off checkpointing to get rid off checkpoint I/O latency in
> between. In case of operator restarts, the data should be reloaded in setup.
> 2. If the lookup data is large --> have many partitions of this operator to
> minimize the footprint of each partition. Still switch off checkpointing and
> reload in setup in case of operator restart. Having many partitions will
> ensure that the setup load is fast. The incoming query needs to be
> partitioned based on the lookup key.
>
> You can use the PojoEnricher with FSLoader for above design.
>
> Code:
> https://github.com/apache/apex-malhar/blob/master/contrib/src/main/java/com/datatorrent/contrib/enrich/POJOEnricher.java
> Example:
> https://github.com/DataTorrent/examples/tree/master/tutorials/enricher
>
> In case of large lookup dataset and latency caused by disk read I/O is fine,
> then use HDHT or managed state as a backup mechanism for the in-memory data
> to decrease the checkpoint footprint. I could not find example for managed
> state but here are the links for HDHT..
>
> Code:
> https://github.com/DataTorrent/Megh/tree/master/contrib/src/main/java/com/datatorrent/contrib/hdht
> Example:
> https://github.com/DataTorrent/examples/blob/master/tutorials/hdht/src/test/java/com/example/HDHTAppTest.java
>
> Regards,
> Ashwin.
>
> On Sun, Jan 22, 2017 at 10:45 PM, Sanjay Pujare 
> wrote:
>>
>> You may want to take a look at com.datatorrent.lib.fileaccess.DTFileReader
>> in the malhar-library – not sure whether it gives you reading the whole file
>> into memory.
>>
>>
>>
>> Also there is a library called Megh at https://github.com/DataTorrent/Megh
>> where you might find some useful operators like
>> com.datatorrent.contrib.hdht.hfile.HFileImpl .
>>
>>
>>
>> From: Roger F 
>> Reply-To: 
>> Date: Sunday, January 22, 2017 at 9:32 PM
>> To: 
>> Subject: One-time Initialization of in-memory data using a data file
>>
>>
>>
>> Hi,
>>
>> I have a use case where application business data needs migrated from a
>> legacy system (such as mainframe) into HDFS and then loaded for use by an
>> Apex application.
>>
>> To get this done, an approach that is being considered to perform one-time
>> initialization of the data from the HDFS into application memory. This data
>> will then be queried for various business logic functions of the
>> application.
>>
>> Once the data is loaded, this operator/module (?) should no longer perform
>> any further function except for acting as a master of this data and then
>> supporting operations to query the data (via a key).
>>
>> Any pointers to how this can be done ? I was looking for an operator or
>> any other entity which can load this data at startup (Activation or Setup)
>> and then allow queries to be submitted to it via an input port.
>>
>>
>>
>> -R
>
>
>
>
> --
>
> Regards,
> Ashwin.


[ANNOUNCE] Apache Apex Core 3.5.0 released

2016-12-19 Thread Thomas Weise
Dear Community,

The Apache Apex community is pleased to announce release 3.5.0 of Apex Core
(the engine).

Apache Apex is an enterprise grade big data-in-motion platform that unifies
stream and batch processing. Apex was built for scalability and low-latency
processing, high availability and operability.

This release upgrades the Apache Hadoop YARN dependency from 2.2 to 2.6.
The community determined that current users run on versions equal or higher
than 2.6 and Apex can now take advantage of more recent capabilities of
YARN. The release contains a number of important bug fixes and operability
improvements.

See full release notes

for all changes in this release.

Apex provides features that similar platforms currently don’t offer, such
as fine grained, incremental recovery to only reset the portion of a
topology that is affected by a failure, support for elastic scaling based
on the ability to acquire (and release) resources as needed as well as the
ability to alter topology and operator properties on running applications.

Apex has been developed since 2012 and became ASF top level project earlier
this year, following 8 months of incubation. Apex early on brought the
combination of high throughput, low latency and fault tolerance with strong
processing guarantees to the stream data processing space and gained
maturity through important production use cases at several organizations.
See the powered by page and resources on the project web site for more
information:

http://apex.apache.org/powered-by-apex.html
http://apex.apache.org/docs.html

The Apex engine is supplemented by Malhar, the library of pre-built
operators, including adapters that integrate with many existing
technologies as sources and destinations, like message buses, databases,
files or social media feeds.

An easy way to get started with Apex is to pick one of the examples as
starting point. They cover many common and recurring tasks, such as data
consumption from different sources, output to various sinks, partitioning
and fault tolerance:

https://github.com/DataTorrent/examples/tree/master/tutorials

Apex Malhar and Core (the engine) are separate repositories and releases.
We expect more frequent releases of Malhar to roll out new connectors and
other operators based on a stable engine API. This release 3.6.0 works on
existing Apex Core 3.4.0. Users only need to upgrade the Maven dependency
in their project.

The source release can be found at:

http://apex.apache.org/downloads.html

We welcome your help and feedback. For more information on the project and
how to get involved, visit our website at:

http://apex.apache.org/

Regards,
The Apache Apex community


[ANNOUNCE] Apache Apex Malhar 3.6.0 released

2016-12-08 Thread Thomas Weise
Dear Community,

The Apache Apex community is pleased to announce release 3.6.0 of the
Malhar library. The release resolved 70 JIRAs .

The release adds first iteration of SQL support via Apache Calcite.
Features include SELECT, INSERT, INNER JOIN with non-empty equi join
condition, WHERE clause,
SCALAR functions that are implemented in Calcite, custom scalar functions.
Endpoint can be file, Kafka or internal streaming port for both input and
output. CSV format is implemented for both input and output. See examples

for usage of the new API.

The windowed state management has been improved (WindowedOperator). There
is now an option to use spillable data structures for the state storage.
This enables the operator to store large states and perform efficient
checkpointing.

We also did benchmarking on WindowedOperator with the spillable data
structures. From the result of our findings, we improved greatly how
objects are serialized and reduced garbage collection considerably in the
Managed State layer. Work is still in progress for purging state that is
not needed any more and further improving the performance of Managed State
that the spillable data structures depend on. More information about the
windowing support can be found here
.

This release also adds a new, alternative Cassandra output operator
(non-transactional, upsert based) and support for fixed length file format
to the enrichment operator.

The user documentation  has been
expanded to cover more operators. See https://s.apache.org/9b0t for other
enhancements and fixes in this release.

Apache Apex is an enterprise grade native YARN big data-in-motion platform
that unifies stream and batch processing. Apex was built for scalability
and low-latency processing, high availability and operability.

Apex provides features that similar platforms currently don’t offer, such
as fine grained, incremental recovery to only reset the portion of a
topology that is affected by a failure, support for elastic scaling based
on the ability to acquire (and release) resources as needed as well as the
ability to alter topology and operator properties on running applications.

Apex has been developed since 2012 and became ASF top level project earlier
this year, following 8 months of incubation. Apex early on brought the
combination of high throughput, low latency and fault tolerance with strong
processing guarantees to the stream data processing space and gained
maturity through important production use cases at several organizations.
See the powered by page and resources on the project web site for more
information:

http://apex.apache.org/powered-by-apex.html
http://apex.apache.org/docs.html

The Apex engine is supplemented by Malhar, the library of pre-built
operators, including adapters that integrate with many existing
technologies as sources and destinations, like message buses, databases,
files or social media feeds.

An easy way to get started with Apex is to pick one of the examples as
starting point. They cover many common and recurring tasks, such as data
consumption from different sources, output to various sinks, partitioning
and fault tolerance:

https://github.com/DataTorrent/examples/tree/master/tutorials

Apex Malhar and Core (the engine) are separate repositories and releases.
We expect more frequent releases of Malhar to roll out new connectors and
other operators based on a stable engine API. This release 3.6.0 works on
existing Apex Core 3.4.0. Users only need to upgrade the Maven dependency
in their project.

The source release can be found at:

http://apex.apache.org/downloads.html

We welcome your help and feedback. For more information on the project and
how to get involved, visit our website at:

http://apex.apache.org/

Regards,
The Apache Apex community


Re: [EXTERNAL] Re: KafkaSinglePortInputOperator

2016-12-08 Thread Thomas Weise
If this was fixed, why is the JIRA still open?

On Wed, Dec 7, 2016 at 10:58 PM, Chaitanya Chebolu <
chaita...@datatorrent.com> wrote:

> There was a bug in Malhar-3.4.0 and is fixed in Malhar-3.6.0.
> JIRA details of this issue is here
> .
>
> Regards,
> Chaitanya
>
> On Wed, Dec 7, 2016 at 11:24 PM, Sanjay Pujare 
> wrote:
>
>> Just out of curiosity – what was the problem? Why were the ssl.* project
>> properties not seen by the Kafka consumer?
>>
>>
>>
>> *From: *"Raja.Aravapalli" 
>> *Reply-To: *
>> *Date: *Wednesday, December 7, 2016 at 7:56 AM
>>
>> *To: *"users@apex.apache.org" 
>> *Subject: *Re: [EXTERNAL] Re: KafkaSinglePortInputOperator
>>
>>
>>
>>
>>
>> Thanks a tonnn for the support today Chaitanya!!
>>
>>
>>
>> We were able to successfully download messages from Kafka SSL Secured
>> Topics!!
>>
>>
>>
>> Thanks you very much!!
>>
>>
>>
>>
>>
>> Regards,
>>
>> Raja.
>>
>>
>>
>> *From: *Chaitanya Chebolu 
>> *Reply-To: *"users@apex.apache.org" 
>> *Date: *Wednesday, December 7, 2016 at 11:28 AM
>> *To: *"users@apex.apache.org" 
>> *Subject: *Re: [EXTERNAL] Re: KafkaSinglePortInputOperator
>>
>>
>>
>> Raja,
>>
>>
>>
>>   Issue is the SSL properties(ssl.*.*) are not reflected to Kafka
>> consumer.
>>
>>   Could you please share the complete project ?
>>
>>
>>
>> Thanks,
>>
>> Chaitanya
>>
>>
>>
>>
>>
>> On Wed, Dec 7, 2016 at 7:39 AM, Raja.Aravapalli <
>> raja.aravapa...@target.com> wrote:
>>
>> Hi Chaitanya,
>>
>>
>>
>> Any other thoughts on how I can fix this ??
>>
>>
>>
>> Are Apex doesn’t yet support SSL secured topics ?
>>
>>
>>
>>
>>
>> Thanks a lot.
>>
>>
>>
>> Regards,
>>
>> Raja.
>>
>>
>>
>> *From: *"Raja.Aravapalli" 
>> *Reply-To: *"users@apex.apache.org" 
>> *Date: *Tuesday, December 6, 2016 at 5:32 PM
>>
>>
>> *To: *"users@apex.apache.org" 
>> *Subject: *Re: [EXTERNAL] Re: KafkaSinglePortInputOperator
>>
>>
>>
>>
>>
>>
>>
>> I added the below line as said…  I cannot see any exceptions also
>>
>>
>>
>> Still nothing is happening L
>>
>>
>>
>> I am not sure, why these below as always showing as null… even though I
>> set them in my Application.java class!! Any help on how to set these
>> properties ???
>>
>>
>>
>> ssl.keystore.location = null
>>
>> ssl.truststore.location = null
>>
>> ssl.keystore.password = null
>>
>>
>>
>>
>>
>> Thanks a lot in advance.
>>
>>
>>
>> Regards,
>>
>> Raja.
>>
>>
>>
>> *From: *Chaitanya Chebolu 
>> *Reply-To: *"users@apex.apache.org" 
>> *Date: *Tuesday, December 6, 2016 at 5:17 PM
>> *To: *"users@apex.apache.org" 
>> *Subject: *Re: [EXTERNAL] Re: KafkaSinglePortInputOperator
>>
>>
>>
>> Raja,
>>
>>
>>
>>Please set the consumerProps to the KafkaSinglePortInputOperator.
>>
>>Add the below line in your application:
>>
>>   KafkaSinglePortInputOperator in = dag.addOperator("in", new
>> KafkaSinglePortInputOperator());
>>
>>   --
>>
>>in.setConsumerProps(props);
>>
>>
>>
>>  Please let me know, if you are still facing issues.
>>
>>
>>
>> Regards,
>>
>> Chaitanya
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Dec 6, 2016 at 5:00 PM, Raja.Aravapalli <
>> raja.aravapa...@target.com> wrote:
>>
>>
>>
>> Find below the log I am observing:
>>
>>
>>
>> 2016-12-06 05:17:37,264 INFO  kafka.AbstractKafkaInputOperator
>> (AbstractKafkaInputOperator.java:initPartitioner(311)) - Initialize
>> Partitioner
>>
>> 2016-12-06 05:17:37,265 INFO  kafka.AbstractKafkaInputOperator
>> (AbstractKafkaInputOperator.java:initPartitioner(324)) - Actual
>> Partitioner is class org.apache.apex.malhar.kafka.OneToOnePartitioner
>>
>> 2016-12-06 05:17:37,280 INFO  consumer.ConsumerConfig
>> (AbstractConfig.java:logAll(165)) - ConsumerConfig values:
>>
>> metric.reporters = []
>>
>> metadata.max.age.ms = 30
>>
>> value.deserializer = class org.apache.kafka.common.serial
>> ization.ByteArrayDeserializer
>>
>> group.id = org.apache.apex.malhar.kafka.A
>> bstractKafkaInputOperatorMETA_GROUP
>>
>> partition.assignment.strategy =
>> [org.apache.kafka.clients.consumer.RangeAssignor]
>>
>> reconnect.backoff.ms = 50
>>
>> sasl.kerberos.ticket.renew.window.factor = 0.8
>>
>> max.partition.fetch.bytes = 1048576
>>
>> bootstrap.servers = [10.66.137.116:9093]
>>
>> retry.backoff.ms = 100
>>
>> sasl.kerberos.kinit.cmd = /usr/bin/kinit
>>
>> sasl.kerberos.service.name = null
>>
>> sasl.kerberos.ticket.renew.jitter = 0.05
>>
>> ssl.keystore.type = JKS
>>
>> ssl.trustmanager.algorithm = PKIX
>>
>> 

Re: Connection refused exception

2016-11-29 Thread Thomas Weise
There was a bug in the 3.5 release, this could be it. We fixed it and put
it into 3.5.1-snapshot. The detail are in JIRA, I don't have access to it
right now.

--
sent from mobile
On Nov 29, 2016 6:02 AM, "Feldkamp, Brandon (CONT)" <
brandon.feldk...@capitalone.com> wrote:

> This pops up a little while later. I’ll have to check to see if I can see
> anything before!
>
>
>
> 016-11-24 21:51:02,443 [main] ERROR stram.StreamingAppMaster main -
> Exiting Application Master
>
> java.lang.NullPointerException
>
> at com.datatorrent.contrib.kafka.
> AbstractKafkaInputOperator.isPartitionRequired(AbstractKafkaInputOperator.
> java:804)
>
> at com.datatorrent.contrib.kafka.
> AbstractKafkaInputOperator.processStats(AbstractKafkaInputOperator.
> java:715)
>
> at com.datatorrent.stram.plan.physical.PhysicalPlan$
> StatsListenerProxy.processStats(PhysicalPlan.java:206)
>
> at com.datatorrent.stram.plan.physical.PhysicalPlan.
> onStatusUpdate(PhysicalPlan.java:1797)
>
> at com.datatorrent.stram.StreamingContainerManager.
> processEvents(StreamingContainerManager.java:1035)
>
> at com.datatorrent.stram.StreamingContainerManager.
> monitorHeartbeat(StreamingContainerManager.java:798)
>
> at com.datatorrent.stram.StreamingAppMasterService.
> execute(StreamingAppMasterService.java:1025)
>
> at com.datatorrent.stram.StreamingAppMasterService.run(
> StreamingAppMasterService.java:647)
>
> at com.datatorrent.stram.StreamingAppMaster.main(
> StreamingAppMaster.java:104)
>
>
>
> *From: *Pramod Immaneni 
> *Reply-To: *"users@apex.apache.org" 
> *Date: *Tuesday, November 29, 2016 at 8:57 AM
> *To: *"users@apex.apache.org" 
> *Subject: *Re: Connection refused exception
>
>
>
> Brandon,
>
>
>
> This is not related to kafkaInputOperator talking to Kafka. Do you see any
> other exceptions in the logs around this exception before or after?
>
>
>
> On Tue, Nov 29, 2016 at 6:54 AM, Feldkamp, Brandon (CONT) <
> brandon.feldk...@capitalone.com> wrote:
>
> Hello,
>
>
>
> I’m wondering if anyone has seen a similar stack trace before as there
> isn’t a lot of info provided. I’m wondering if the connection refused is
> from one operator to another or the kafkaInputOperator being unable to
> connect to kafka.
>
>
>
> Any ideas? Here’s the stacktrace:
>
>
>
> 2016-11-24 21:00:51,998 [ProcessWideEventLoop] ERROR netlet.AbstractClient
> handleException - Exception in event loop {id=ProcessWideEventLoop,
> head=7418, tail=7416, capacity=1024}
>
> java.net.ConnectException: Connection refused
>
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native
> Method)
>
> at sun.nio.ch.SocketChannelImpl.finishConnect(
> SocketChannelImpl.java:717)
>
> at com.datatorrent.netlet.DefaultEventLoop.
> handleSelectedKey(DefaultEventLoop.java:371)
>
> at com.datatorrent.netlet.OptimizedEventLoop$
> SelectedSelectionKeySet.forEach(OptimizedEventLoop.java:59)
>
> at com.datatorrent.netlet.OptimizedEventLoop.runEventLoop(
> OptimizedEventLoop.java:192)
>
> at com.datatorrent.netlet.OptimizedEventLoop.runEventLoop(
> OptimizedEventLoop.java:157)
>
> at com.datatorrent.netlet.DefaultEventLoop.run(
> DefaultEventLoop.java:156)
>
> at java.lang.Thread.run(Thread.java:745)
>
> 2016-11-24 21:01:01,839 [IPC Server handler 4 on 44453] ERROR 
> stram.StreamingContainerManager
> processOperatorFailure - Initiating container restart after operator
> failure PTOperator[id=1,name=kafkaInputOperator]
>
> 2016-11-24 21:01:01,898 [IPC Server handler 29 on 44453] ERROR 
> stram.StreamingContainerManager
> processOperatorFailure - Initiating container restart after operator
> failure PTOperator[id=7,name=kafkaInputOperator]
>
> 2016-11-24 21:01:32,991 [IPC Server handler 24 on 44453] ERROR 
> stram.StreamingContainerManager
> processOperatorFailure - Initiating container restart after operator
> failure PTOperator[id=13,name=kafkaInputOperator]
>
> 2016-11-24 21:01:44,189 [IPC Server handler 22 on 44453] ERROR 
> stram.StreamingContainerManager
> processOperatorFailure - Initiating container restart after operator
> failure PTOperator[id=10,name=kafkaInputOperator]
>
> 2016-11-24 21:01:44,604 [IPC Server handler 5 on 44453] ERROR 
> stram.StreamingContainerManager
> processOperatorFailure - Initiating container restart after operator
> failure PTOperator[id=3,name=kafkaInputOperator]
>
> 2016-11-24 21:01:44,744 [IPC Server handler 16 on 44453] ERROR 
> stram.StreamingContainerManager
> processOperatorFailure - Initiating container restart after operator
> failure PTOperator[id=12,name=kafkaInputOperator]
>
> Thanks!
>
> Brandon
>
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to 

Re: Example using Streaming SQL ?

2016-11-18 Thread Thomas Weise
Hi,

This is a new feature and the first cut will go into release 3.6. There are
examples here:

https://github.com/apache/apex-malhar/tree/master/demos/sql

Documentation isn't in place yet. Chinmay, do you have some more
information to share?

Thanks,
Thomas


On Fri, Nov 18, 2016 at 5:58 PM, Shazz zzahS  wrote:

> Hi,
>
> I understood that Calcite was integrated to Apex, is there any example
> which shows how to use Calcite Streaming SQL queries with Apex ?
> I took a look into https://github.com/DataTorrent/examples but without
> any luck
>
> Thanks !
>
> shazz
>


Re: [EXTERNAL] kafka

2016-11-01 Thread Thomas Weise
IMO this should be added here:

http://apex.apache.org/docs/malhar/operators/kafkaInputOperator/


On Tue, Nov 1, 2016 at 5:24 PM, hsy...@gmail.com  wrote:

> Hey Raja,
>
> The setup for secure kafka input operator is not easy. You can follow
> these steps.
> 1. Follow kafka document to setup your brokers properly (
> https://kafka.apache.org/090/documentation.html#security_overview)
> 2. You have to manually create the client JAAS file (
> https://kafka.apache.org/090/documentation.html#security_overview)
>a sample file would look like this:
>  KafkaClient {
>
> com.sun.security.auth.module.Krb5LoginModule required
> useKeyTab=true
> storeKey=true
> keyTab="/etc/security/keytabs/kafka_client.keytab"
> principal="kafka-clien...@example.com";
> };
>
> 3. On the operator you have to set the attribute JVM_OPTS
>  -Djava.security.auth.login.config=/etc/kafka/kafka_client_jaas.conf
>
> 4. On the operator you have to set the consumerProperties, for example
>
> set dt.operator.$your_operator_name.{consumerProps.security.protocol}  to
>  SASL or other security function you use
> set dt.operator.$your_operator_name.{consumerPrope.sasl.
> kerberos.service.name} to kafka
>
>
> Hope this would help you!
>
>
> Regards,
> Siyuan
>
>
>
> On Sun, Oct 30, 2016 at 10:56 PM, Raja.Aravapalli <
> raja.aravapa...@target.com> wrote:
>
>>
>>
>> Hi Team,
>>
>>
>>
>> Can someone pls help me with below requested information ?
>>
>>
>>
>> Does apache apex have any inbuilt kafka input operator to read from Kafka
>> 0.9 secure kafka topics?
>>
>>
>>
>> Thanks a lot.
>>
>>
>>
>> Regards,
>>
>> Raja.
>>
>>
>>
>> *From: *"Raja.Aravapalli" 
>> *Reply-To: *"users@apex.apache.org" 
>> *Date: *Monday, October 24, 2016 at 2:29 PM
>> *To: *"users@apex.apache.org" 
>> *Subject: *[EXTERNAL] kafka
>>
>>
>>
>>
>>
>> Hi,
>>
>>
>>
>> Do we have any kaka operator readily available to consume messages from
>> secure kafka topics in kafka 0.9 clusters?
>>
>>
>>
>> Thanks a lot.
>>
>>
>>
>>
>>
>> Regards,
>>
>> Raja.
>>
>
>


Re: Datatorrent operator for Hbase

2016-10-20 Thread Thomas Weise
This may also help:

http://docs.datatorrent.com/troubleshooting/#hadoop-dependencies-conflicts


On Thu, Oct 20, 2016 at 11:39 AM, Thomas Weise <t...@apache.org> wrote:

> Please see the HBase dependency and its exclusions here:
>
> https://github.com/apache/apex-malhar/blob/master/contrib/pom.xml#L342
>
> Thanks,
> Thomas
>
> On Thu, Oct 20, 2016 at 9:07 AM, Jaspal Singh <jaspal.singh1...@gmail.com>
> wrote:
>
>> Team,
>>
>> While using the Hbase operator with Datatorrent application, we have
>> added hbase client dependency in pom.xml. Do we need to exclude transitive
>> hadoop dependencies using 'exclusion *' ?
>>
>> If we do that then HbaseConfiguration and Bytes methods are also getting
>> excluded and giving error in the application. Is there a way to fix it ??
>>
>>
>> Thanks!!
>>
>> On Thu, Oct 20, 2016 at 1:38 AM, Tushar Gosavi <tus...@datatorrent.com>
>> wrote:
>>
>>> Hi Jaspal,
>>>
>>> You can pass the store name through property file, like
>>>
>>> 
>>>   dt.operator.HbaseOperatorName.store.tableName
>>>   {name of the table}
>>> 
>>>
>>> In the code,  you can set the table name and other properties in
>>> constructor.
>>> {code}
>>> public static class Status2Hbase extends AbstractHBasePutOutputOperator
>>> 
>>> {
>>>
>>> public Status2Hbase()
>>> {
>>>   super();
>>>   // store is initialized to HBaseStore
>>>   store.setTableName("nameofTable");
>>> }
>>>
>>> @Override
>>> public Put operationPut(Status t)
>>> {
>>> Put put = new Put(ByteBuffer.allocate(8).put
>>> Long(t.getCreatedAt().getTime()).array());
>>> put.add("cf".getBytes(), "text".getBytes(), t.getText().getBytes());
>>> put.add("cf".getBytes(), "userid".getBytes(), t.getText().getBytes());
>>> return put;
>>> }
>>> }
>>> {code}
>>>
>>> - Tushar.
>>>
>>>
>>> On Thu, Oct 20, 2016 at 11:59 AM, Jaspal Singh
>>> <jaspal.singh1...@gmail.com> wrote:
>>> > Hi Thomas, Thanks for sharing this example code.
>>> >  Still I couldn't see where the hbase tablename is configured, it says
>>> in
>>> > description that it can be configured.
>>> >
>>> > Can you please highlight where it is specified ?
>>> >
>>> > Thanks!!
>>> >
>>> >
>>> > On Wednesday, October 19, 2016, Thomas Weise <t...@apache.org> wrote:
>>> >>
>>> >> Here is an example that uses HBase that may be helpful:
>>> >>
>>> >>
>>> >> https://github.com/apache/apex-malhar/blob/master/demos/twit
>>> ter/src/main/java/com/datatorrent/demos/twitter/TwitterDumpH
>>> BaseApplication.java
>>> >>
>>> >> Thomas
>>> >>
>>> >> On Wed, Oct 19, 2016 at 6:36 PM, Jaspal Singh <
>>> jaspal.singh1...@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> Where I need to set the table name. In property file or the
>>> application
>>> >>> code ?
>>> >>>
>>> >>>
>>> >>> Thanks!!
>>> >>>
>>> >>>
>>> >>> On Wednesday, October 19, 2016, Sanjay Pujare <
>>> san...@datatorrent.com>
>>> >>> wrote:
>>> >>>>
>>> >>>> Take a look at
>>> >>>> https://github.com/apache/apex-malhar/tree/master/contrib/sr
>>> c/main/java/com/datatorrent/contrib/hbase
>>> >>>> . There are multiple output operators there.
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> You specify the table name using HBaseStore.setTableName
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> From: "Bandaru, Srinivas" <srinivas.band...@optum.com>
>>> >>>> Reply-To: <users@apex.apache.org>
>>> >>>> Date: Wednesday, October 19, 2016 at 3:09 PM
>>> >>>> To: "users@apex.apache.org" <users@apex.apache.org>
>>> >>>> Subject: Datatorrent operator for Hbase
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> Hi,
>>> >>>>
>>> >>>> I want to write the data from an operator to a hbase table.  Which
>>> >>>> operator I can use to write to  Hbase table?
>>> >>>>
>>> >>>> Also how to specify the Hbase table name?
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> Thanks,
>>> >>>>
>>> >>>> Srinivas Bandaru
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> This e-mail, including attachments, may include confidential and/or
>>> >>>> proprietary information, and may be used only by the person or
>>> entity
>>> >>>> to which it is addressed. If the reader of this e-mail is not the
>>> >>>> intended
>>> >>>> recipient or his or her authorized agent, the reader is hereby
>>> notified
>>> >>>> that any dissemination, distribution or copying of this e-mail is
>>> >>>> prohibited. If you have received this e-mail in error, please
>>> notify the
>>> >>>> sender by replying to this message and delete this e-mail
>>> immediately.
>>> >>
>>> >>
>>> >
>>>
>>
>>
>


Re: Datatorrent operator for Hbase

2016-10-20 Thread Thomas Weise
Please see the HBase dependency and its exclusions here:

https://github.com/apache/apex-malhar/blob/master/contrib/pom.xml#L342

Thanks,
Thomas

On Thu, Oct 20, 2016 at 9:07 AM, Jaspal Singh <jaspal.singh1...@gmail.com>
wrote:

> Team,
>
> While using the Hbase operator with Datatorrent application, we have added
> hbase client dependency in pom.xml. Do we need to exclude transitive hadoop
> dependencies using 'exclusion *' ?
>
> If we do that then HbaseConfiguration and Bytes methods are also getting
> excluded and giving error in the application. Is there a way to fix it ??
>
>
> Thanks!!
>
> On Thu, Oct 20, 2016 at 1:38 AM, Tushar Gosavi <tus...@datatorrent.com>
> wrote:
>
>> Hi Jaspal,
>>
>> You can pass the store name through property file, like
>>
>> 
>>   dt.operator.HbaseOperatorName.store.tableName
>>   {name of the table}
>> 
>>
>> In the code,  you can set the table name and other properties in
>> constructor.
>> {code}
>> public static class Status2Hbase extends AbstractHBasePutOutputOperator
>> 
>> {
>>
>> public Status2Hbase()
>> {
>>   super();
>>   // store is initialized to HBaseStore
>>   store.setTableName("nameofTable");
>> }
>>
>> @Override
>> public Put operationPut(Status t)
>> {
>> Put put = new Put(ByteBuffer.allocate(8).put
>> Long(t.getCreatedAt().getTime()).array());
>> put.add("cf".getBytes(), "text".getBytes(), t.getText().getBytes());
>> put.add("cf".getBytes(), "userid".getBytes(), t.getText().getBytes());
>> return put;
>> }
>> }
>> {code}
>>
>> - Tushar.
>>
>>
>> On Thu, Oct 20, 2016 at 11:59 AM, Jaspal Singh
>> <jaspal.singh1...@gmail.com> wrote:
>> > Hi Thomas, Thanks for sharing this example code.
>> >  Still I couldn't see where the hbase tablename is configured, it says
>> in
>> > description that it can be configured.
>> >
>> > Can you please highlight where it is specified ?
>> >
>> > Thanks!!
>> >
>> >
>> > On Wednesday, October 19, 2016, Thomas Weise <t...@apache.org> wrote:
>> >>
>> >> Here is an example that uses HBase that may be helpful:
>> >>
>> >>
>> >> https://github.com/apache/apex-malhar/blob/master/demos/twit
>> ter/src/main/java/com/datatorrent/demos/twitter/TwitterDumpH
>> BaseApplication.java
>> >>
>> >> Thomas
>> >>
>> >> On Wed, Oct 19, 2016 at 6:36 PM, Jaspal Singh <
>> jaspal.singh1...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Where I need to set the table name. In property file or the
>> application
>> >>> code ?
>> >>>
>> >>>
>> >>> Thanks!!
>> >>>
>> >>>
>> >>> On Wednesday, October 19, 2016, Sanjay Pujare <san...@datatorrent.com
>> >
>> >>> wrote:
>> >>>>
>> >>>> Take a look at
>> >>>> https://github.com/apache/apex-malhar/tree/master/contrib/
>> src/main/java/com/datatorrent/contrib/hbase
>> >>>> . There are multiple output operators there.
>> >>>>
>> >>>>
>> >>>>
>> >>>> You specify the table name using HBaseStore.setTableName
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> From: "Bandaru, Srinivas" <srinivas.band...@optum.com>
>> >>>> Reply-To: <users@apex.apache.org>
>> >>>> Date: Wednesday, October 19, 2016 at 3:09 PM
>> >>>> To: "users@apex.apache.org" <users@apex.apache.org>
>> >>>> Subject: Datatorrent operator for Hbase
>> >>>>
>> >>>>
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> I want to write the data from an operator to a hbase table.  Which
>> >>>> operator I can use to write to  Hbase table?
>> >>>>
>> >>>> Also how to specify the Hbase table name?
>> >>>>
>> >>>>
>> >>>>
>> >>>> Thanks,
>> >>>>
>> >>>> Srinivas Bandaru
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> This e-mail, including attachments, may include confidential and/or
>> >>>> proprietary information, and may be used only by the person or entity
>> >>>> to which it is addressed. If the reader of this e-mail is not the
>> >>>> intended
>> >>>> recipient or his or her authorized agent, the reader is hereby
>> notified
>> >>>> that any dissemination, distribution or copying of this e-mail is
>> >>>> prohibited. If you have received this e-mail in error, please notify
>> the
>> >>>> sender by replying to this message and delete this e-mail
>> immediately.
>> >>
>> >>
>> >
>>
>
>


Re: balanced of Stream Codec

2016-10-15 Thread Thomas Weise
Without knowing the operations following the indeterministic partitioning,
assume that you cannot have exactly-once results because processing won't
be idempotent. If there are only stateless operations, then it should be
OK. If there are stateful operations (windowing with any form of
aggregation etc.), then key based partitioning is needed.

Thomas


On Sat, Oct 15, 2016 at 8:39 PM, Amol Kekre  wrote:

>
> Sunil,
> Round robin in an internal operator could be used in exactly once writes
> to external system for certain operations. I do know what your business
> logic is, but in case it can be split into partitions and then unified (for
> example aggregates), you have a situation where you can use round robin and
> then unify the result. The result can then be fed into an output operator
> that handles exactly once semantics with external system like Cassandra
> etc. Apex engine will guarantee that all the tuples in a window are same
> for the logical operator, so for the operator that you partition by
> round-robin, the result post-unifier should be identical (aka effectively
> idempotent) even if each partition is not idempotent. Often "each partition
> be idempotent" is not useful for internal operators. The case where this
> would not work is if order of tuples based on category_name is important.
>
> Thks
> Amol
>
>
> On Sat, Oct 15, 2016 at 6:03 PM, Sandesh Hegde 
> wrote:
>
>> Round robin is not idempotent, so you can't have exactly once.
>>
>> On Sat, Oct 15, 2016 at 4:49 PM Munagala Ramanath 
>> wrote:
>>
>>> If you want round-robin distribution which will give you uniform load
>>> across all partitions you can use
>>> a StreamCodec like this (provided the number of partitions is known and
>>> static):
>>>
>>> *public class CatagoryStreamCodec extends
>>> KryoSerializableStreamCodec {*
>>> *  private int n = 0;*
>>> *  @Override*
>>> *  public int getPartition(Object in) {*
>>> *return n++ % nPartitions;// nPartitions is the number of
>>> partitions*
>>> *  }*
>>> *}*
>>>
>>> If you want certain category names to go to certain partitions, you can
>>> create that mapping
>>> within the StreamCodec (map category names to integers in the range
>>> *0..nPartitions-1*), and, for each tuple, lookup the category name in
>>> the map and return the corresponding value.
>>>
>>> Ram
>>> 
>>>
>>> On Fri, Oct 14, 2016 at 1:17 PM, Sunil Parmar 
>>> wrote:
>>>
>>> We’re using Stream codec to consistently / parallel processing of the
>>> data across the operator partitions. Our requirement is to serialize
>>> processing of the data based on particular tuple attribute let’s call it
>>> ‘catagory_name’ . In order to achieve the parallel processing of different
>>> category names we’re written our stream codec as following.
>>>
>>>public class CatagoryStreamCodec extends
>>> KryoSerializableStreamCodec {
>>>
>>> private static final long serialVersionUID = -687991492884005033L;
>>>
>>>
>>>
>>> @Override
>>>
>>> public int getPartition(Object in) {
>>>
>>> try {
>>>
>>> InputTuple tuple = (InputTuple) in;
>>>
>>> String partitionKehy = tuple.getName();
>>>
>>> if(partitionKehy != null) {
>>>
>>> return partitionKehy.hashCode();
>>>
>>> }
>>>
>>> }
>>>}
>>>
>>> It’s working as expected *but *we observed inconsistent partitions when
>>> we run this in production env with 20 partitioner of the operator following
>>> the codec in the dag.
>>>
>>>- Some operator instance didn’t process any data
>>>- Some operator instance process as many tuples as combined
>>>everybody else
>>>
>>>
>>> Questions :
>>>
>>>- getPartition method supposed to return the actual partition or
>>>just some lower bit used for deciding partition ?
>>>- Number of partitions is known to application properties and can
>>>vary between deployments or environments. Is it best practice to use that
>>>property in the stream codec ?
>>>- Any recommended hash function for getting consistent variations in
>>>the lower bit with less variety of data. we’ve ~100+ categories and I’m
>>>thinking to have 10+ operator partitions.
>>>
>>>
>>> Thanks,
>>> Sunil
>>>
>>>
>


Re: DT Fault Tolerance UHG

2016-10-13 Thread Thomas Weise
If the tuples processed by the output operator are not of type String, then
the recovery code may fail because it attempts to interpret the messages
that were already stored as String. That's a bug in the operator. The
workaround is to convert the object to String in the upstream operator and
then pass the String to the Kafka output operator.

Thanks,
Thomas


On Thu, Oct 13, 2016 at 10:46 AM, Bandaru, Srinivas <
srinivas.band...@optum.com> wrote:

> Hi,
>
> Need some help. While running DT application with “*Kafka”* running into
> issues with application. When monitoring the application, We are observing
> that the *operator getting inactive* and  restarts continuously. Could
> you please refer the below log let us know if any configuration need to
> be changed?
>
>
>
>
>
> 2016-10-122016-10-12 16:40:16,835 INFO 
> org.apache.kafka.common.utils.AppInfoParser:
> Kafka version : 0.9.0.1
>
> 2016-10-12 16:40:16,835 INFO org.apache.kafka.common.utils.AppInfoParser:
> Kafka commitId : 23c69d62a0cabf06
>
> 16:40:17,153 INFO org.apache.apex.malhar.kafka.
> KafkaSinglePortExactlyOnceOutputOperator: Rebuild the partial window
> after 6340695403456888950
>
> 2016-10-12 16:40:18,827 ERROR com.datatorrent.stram.engine.StreamingContainer:
> Operator set 
> [OperatorDeployInfo[id=4,name=kafkaOut1,type=GENERIC,checkpoint={57feace0003b,
> 0, 0},inputs=[OperatorDeployInfo.InputDeployInfo[portName=
> inputPort,streamId=updateTopic1,sourceNodeId=3,sourcePortName=outTopic1,
> locality=,partitionMask=0,partitionKeys=]],outputs=[]]]
> stopped running due to an exception.
>
> java.lang.RuntimeException: Violates Exactly once. Not all the tuples
> received after operator reset.
>
> at org.apache.apex.malhar.kafka.KafkaSinglePortExactlyOnceOutp
> utOperator.endWindow(KafkaSinglePortExactlyOnceOutputOperator.java:174)
>
> at com.datatorrent.stram.engine.GenericNode.processEndWindow(
> GenericNode.java:146)
>
> at com.datatorrent.stram.engine.GenericNode.run(GenericNode.
> java:357)
>
> at com.datatorrent.stram.engine.StreamingContainer$2.run(
> StreamingContainer.java:1407)
>
> 2016-10-12 16:40:18,838 INFO org.apache.kafka.clients.producer.KafkaProducer:
> Closing the Kafka producer with timeoutMillis = 9223372036854775807 ms.
>
> 2016-10-12 16:40:18,862 INFO com.datatorrent.stram.engine.StreamingContainer:
> Undeploy request: [4]
>
> 2016-10-12 16:40:18,863 INFO com.datatorrent.stram.engine.StreamingContainer:
> Undeploy complete.
>
> __stderr__0__stdout__0,_*container_e31_1476212051326_
> 0045_01_87¨Oø__dt.log__204352016-10-12 16:51:42,333 INFO
> com.datatorrent.stram.engine.StreamingContainer: Child starting with
> classpath: ./commons-beanutils-1.8.3.jar:./apex-api-3.4.0.jar:./apex-
> bufferserver-3.4.0.jar:./commons-lang3-3.1.jar:./
> httpcore-4.3.2.jar:./snappy-java-1.1.1.7.jar:./zkclient-0.
> 7.jar:./jctools-core-1.1.jar:./jopt-simple-3.2.jar:./apex-
> shaded-ning19-1.0.0.jar:./malhar-library-3.5.0.jar:./
> Kafka2Datatorrent-1.0-SNAPSHOT.jar:./bval-jsr303-0.
> 5.jar:./httpclient-4.3.5.jar:./jackson-mapper-asl-1.9.2.jar:
> ./kafka_2.10-0.9.0.1.jar:./bval-core-0.5.jar:./minlog-1.
> 2.jar:./jersey-apache-client4-1.9.jar:./malhar-contrib-3.4.
> 0.jar:./metrics-core-2.2.0.jar:./jackson-core-asl-1.9.2.
> jar:./validation-api-1.1.0.Final.jar:./gson-2.0.jar:./
> kryo-2.24.0.jar:./netlet-1.2.1.jar:./lz4-1.2.0.jar:./
> mbassador-1.1.9.jar:./slf4j-api-1.7.5.jar:./kafka-clients-
> 0.9.0.1.jar:./scala-library-2.10.5.jar:./kafka-unit-0.4.jar:
> ./apex-common-3.4.0.jar:./xbean-asm5-shaded-4.3.jar:./
> jersey-client-1.9.jar:./apex-engine-4.jar:./malhar-kafka-3.
> 5.0.jar:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/
> mapr/hadoop/hadoop-2.7.0/share/hadoop/common/hadoop-
> nfs-2.7.0-mapr-1602.jar:/opt/mapr/hadoop/hadoop-2.7.0/
> share/hadoop/common/hadoop-common-2.7.0-mapr-1602.jar:/
> opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/hadoop-
> common-2.7.0-mapr-1602-tests.jar:/opt/mapr/hadoop/hadoop-2.
> 7.0/share/hadoop/common/lib/stax-api-1.0-2.jar:/opt/mapr/
> hadoop/hadoop-2.7.0/share/hadoop/common/lib/maprfs-
> diagnostic-tools-5.1.0-mapr.jar:/opt/mapr/hadoop/hadoop-2.
> 7.0/share/hadoop/common/lib/apacheds-i18n-2.0.0-M15.jar:/
> opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/
> commons-compress-1.4.1.jar:/opt/mapr/hadoop/hadoop-2.7.0/
> share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/
> opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/
> curator-client-2.7.1.jar:/opt/mapr/hadoop/hadoop-2.7.0/
> share/hadoop/common/lib/maprdb-5.1.0-mapr-tests.jar:/
> opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/
> jets3t-0.9.0.jar:/opt/mapr/hadoop/hadoop-2.7.0/share/
> hadoop/common/lib/jsp-api-2.1.jar:/opt/mapr/hadoop/hadoop-2.
> 7.0/share/hadoop/common/lib/curator-framework-2.7.1.jar:/
> opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/mapr-
> hbase-5.1.0-mapr.jar:/opt/mapr/hadoop/hadoop-2.7.0/
> share/hadoop/common/lib/hamcrest-core-1.3.jar:/opt/
> 

Re: KafkaSinglePortExactlyOnceOutputOperator

2016-10-12 Thread Thomas Weise
Yes, it may be better to pull out the topic selection into an upstream
operator that emits the messages on separate ports per topic and then you
can use the exactly-once output operator without customization.


On Wed, Oct 12, 2016 at 11:29 AM, Bandaru, Srinivas <
srinivas.band...@optum.com> wrote:

> Just a correction for use case.
>
>
>
> Use case: kafka producer emits messages and processing messages using
> Datatorrent application, Based on the DT application logic divide the
> messages into two different groups and write it to two different kafka
> topics
>
>
>
> Thanks,
>
> Srinivas
>
>
>
>
>
> *From:* Bandaru, Srinivas [mailto:srinivas.band...@optum.com]
> *Sent:* Wednesday, October 12, 2016 1:25 PM
> *To:* users@apex.apache.org
> *Cc:* Singh, Jaspal
> *Subject:* KafkaSinglePortExactlyOnceOutputOperator
>
>
>
> Hi, Need some help with the errors we are having with” KafkaSinglePort
> *ExactlyOnce*OutputOperator”. When monitoring the application, We are
> observing that the operator getting inactive and  restarts continuously.
> Could anyone help us with identifying an issue?
>
>
>
> Use case: kafka producer emits messages and processing messages using
> Datatorrent application, Based on the DT application logic divide the
> messages into two different groups and write it to two different Mapr
> strem topics.
>
>
>
>
>
>
>
> Resource manager log snippet.
>
> *2016-10-12 10:42:35,134 INFO
> org.apache.apex.malhar.kafka.KafkaSinglePortExactlyOnceOutputOperator:
> Rebuild the partial window after 6340600660773306680*
>
> *2016-10-12 10:40:29,603 INFO
> com.example.datatorrent.TenantUpdateValidator: error*
>
> *2016-10-12 10:40:29,605 ERROR
> com.datatorrent.stram.engine.StreamingContainer: Operator set
> [OperatorDeployInfo[id=3,name=topicUpdate,type=GENERIC,checkpoint={57fe56b5012b,
> 0,
> 0},inputs=[OperatorDeployInfo.InputDeployInfo[portName=inputPort,streamId=jsonObject,sourceNodeId=2,sourcePortName=out,locality=,partitionMask=0,partitionKeys=]],outputs=[]]]
> stopped running due to an exception.*
>
> *java.lang.RuntimeException: Violates Exactly once. Not all the tuples
> received after operator reset.*
>
> *at
> org.apache.apex.malhar.kafka.KafkaSinglePortExactlyOnceOutputOperator.endWindow(KafkaSinglePortExactlyOnceOutputOperator.java:174)*
>
> *at
> com.datatorrent.stram.engine.GenericNode.processEndWindow(GenericNode.java:146)*
>
> *at
> com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:357)*
>
> *at
> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1407)*
>
> *2016-10-12 10:40:29,617 INFO
> org.apache.kafka.clients.producer.KafkaProducer: Closing the Kafka producer
> with timeoutMillis = 9223372036854775807 ms.*
>
> *2016-10-12 10:40:29,681 INFO
> com.datatorrent.stram.engine.StreamingContainer: Undeploy request: [3]*
>
> *2016-10-12 10:40:29,682 INFO
> com.datatorrent.stram.engine.StreamingContainer: Undeploy complete.*
>
>
>
>
>
> *2016-10-12 10:36:13,513 INFO com.datatorrent.bufferserver.server.Server:
> Received subscriber request: SubscribeRequestTuple{version=1.0,
> identifier=tcp://dbslt0080:60777/2.out.1, windowId=57fe57bf,
> type=jsonObject/3.inputPort, upstreamIdentifier=2.out.1, mask=0,
> partitions=null, bufferSize=0}*
>
> *2016-10-12 10:36:15,534 ERROR
> com.datatorrent.netlet.AbstractLengthPrependerClient: Disconnecting
> Server.Subscriber{type=jsonObject/3.inputPort, mask=0, partitions=null}
> because of an exception.*
>
> *java.io.IOException: Connection reset by peer*
>
> *at sun.nio.ch.FileDispatcherImpl.read0(Native Method)*
>
> *at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)*
>
> *at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)*
>
> *at sun.nio.ch.IOUtil.read(IOUtil.java:197)*
>
> *at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)*
>
> *at
> com.datatorrent.netlet.AbstractClient.read(AbstractClient.java:166)*
>
> *at
> com.datatorrent.netlet.DefaultEventLoop.handleSelectedKey(DefaultEventLoop.java:356)*
>
> *at
> com.datatorrent.netlet.OptimizedEventLoop$SelectedSelectionKeySet.forEach(OptimizedEventLoop.java:59)*
>
> *at
> com.datatorrent.netlet.OptimizedEventLoop.runEventLoop(OptimizedEventLoop.java:192)*
>
> *at
> com.datatorrent.netlet.OptimizedEventLoop.runEventLoop(OptimizedEventLoop.java:157)*
>
> *at
> com.datatorrent.netlet.DefaultEventLoop.run(DefaultEventLoop.java:156)*
>
> *at java.lang.Thread.run(Thread.java:745)*
>
>
>
>
>
> Thanks,
>
> *Srinivas*
>
>
>
>
> This e-mail, including attachments, may include confidential and/or
> proprietary information, and may be used only by the person or entity
> to which it is addressed. If the reader of this e-mail is not the intended
> recipient or his or her authorized agent, the reader is hereby notified
> that any dissemination, distribution or copying of this e-mail is
> 

Re: DataTorrent application Parllel processing

2016-10-11 Thread Thomas Weise
There is some draft documentation for 0.9 here:

https://github.com/chaithu14/incubator-apex-malhar/blob/APEXMALHAR-2242-kafka0.9doc/docs/operators/kafkaInputOperator.md#09-version-of-kafka-input-operator


On Tue, Oct 11, 2016 at 3:21 PM, Thomas Weise <thomas.we...@gmail.com>
wrote:

> Hi,
>
> There is general information on partitioning here:
>
> http://apex.apache.org/docs/apex/application_development/#partitioning
>
> and a more recent presentation here:
>
> http://www.slideshare.net/ApacheApex/smart-partitioning-
> with-apache-apex-webinar
>
> For MapR streams you are using the 0.9 operator, and you can set the
> partitioning strategy on it:
>
> https://github.com/apache/apex-malhar/blob/master/kafka/
> src/main/java/org/apache/apex/malhar/kafka/PartitionStrategy.java
>
> Thanks,
> Thomas
>
>
>
> On Tue, Oct 11, 2016 at 11:30 AM, Bandaru, Srinivas <
> srinivas.band...@optum.com> wrote:
>
>> Hi, Need some help. How we can achieve parallelism for data torrent
>> application (running with maprstresms) for benchmarking?
>>
>>
>>
>> Thanks,
>>
>> *Srinivas Bandaru*
>>
>>
>>
>>
>> This e-mail, including attachments, may include confidential and/or
>> proprietary information, and may be used only by the person or entity
>> to which it is addressed. If the reader of this e-mail is not the intended
>> recipient or his or her authorized agent, the reader is hereby notified
>> that any dissemination, distribution or copying of this e-mail is
>> prohibited. If you have received this e-mail in error, please notify the
>> sender by replying to this message and delete this e-mail immediately.
>>
>
>


Re: DataTorrent application Parllel processing

2016-10-11 Thread Thomas Weise
Hi,

There is general information on partitioning here:

http://apex.apache.org/docs/apex/application_development/#partitioning

and a more recent presentation here:

http://www.slideshare.net/ApacheApex/smart-partitioning-with-apache-apex-webinar

For MapR streams you are using the 0.9 operator, and you can set the
partitioning strategy on it:

https://github.com/apache/apex-malhar/blob/master/kafka/src/main/java/org/apache/apex/malhar/kafka/PartitionStrategy.java

Thanks,
Thomas



On Tue, Oct 11, 2016 at 11:30 AM, Bandaru, Srinivas <
srinivas.band...@optum.com> wrote:

> Hi, Need some help. How we can achieve parallelism for data torrent
> application (running with maprstresms) for benchmarking?
>
>
>
> Thanks,
>
> *Srinivas Bandaru*
>
>
>
>
> This e-mail, including attachments, may include confidential and/or
> proprietary information, and may be used only by the person or entity
> to which it is addressed. If the reader of this e-mail is not the intended
> recipient or his or her authorized agent, the reader is hereby notified
> that any dissemination, distribution or copying of this e-mail is
> prohibited. If you have received this e-mail in error, please notify the
> sender by replying to this message and delete this e-mail immediately.
>


Re: Datatorrent fault tolerance

2016-10-06 Thread Thomas Weise
For recovery you need to set the window data manager like so:

https://github.com/DataTorrent/examples/blob/master/tutorials/exactly-once/src/main/java/com/example/myapexapp/Application.java#L33

That will also apply to stateful restart of the entire application
(relaunch from previous instance's checkpointed state).

For cold restart, you would need to consider the property you mention and
decide what is applicable to your use case.

Thomas


On Thu, Oct 6, 2016 at 4:16 PM, Jaspal Singh <jaspal.singh1...@gmail.com>
wrote:

> Ok now I get it. Thanks for the nice explaination !!
>
> One more thing, so you mentioned about checkpointing the offset ranges
> to replay in same order from kafka.
>
> Is there any property we need to configure to do that? like initialOffset
> set to APPLICATION_OR_LATEST.
>
>
> Thanks
> Jaspal
>
>
> On Thursday, October 6, 2016, Thomas Weise <thomas.we...@gmail.com> wrote:
>
>> What you want is the effect of exactly-once output (that's why we call it
>> also end-to-end exactly-once). There is no such thing as exactly-once
>> processing in a distributed system. In this case it would be rather
>> "produce exactly-once. Upstream operators, on failure, will recover to
>> checkpointed state and re-process the stream from there. This is
>> at-least-once, the default behavior. Because in the input operator you have
>> configured to replay in the same order from Kafka (this is done by
>> checkpointing the offset ranges), the computation in the DAG is idempotent
>> and the output operator can discard the results that were already published
>> instead of producing duplicates.
>>
>> On Thu, Oct 6, 2016 at 3:57 PM, Jaspal Singh <jaspal.singh1...@gmail.com>
>> wrote:
>>
>>> I think this is something called a customized operator implementation
>>> that is taking care of exactly once processing at output.
>>>
>>> What if any previous operators fail ? How we can make sure they also
>>> recover using EXACTLY_ONCE processing mode ?
>>>
>>>
>>> Thanks
>>> Jaspal
>>>
>>>
>>> On Thursday, October 6, 2016, Thomas Weise <thomas.we...@gmail.com>
>>> wrote:
>>>
>>>> In that case please have a look at:
>>>>
>>>> https://github.com/apache/apex-malhar/blob/master/kafka/src/
>>>> main/java/org/apache/apex/malhar/kafka/KafkaSinglePortExactl
>>>> yOnceOutputOperator.java
>>>>
>>>> The operator will ensure that messages are not duplicated, under the
>>>> stated assumptions.
>>>>
>>>>
>>>> On Thu, Oct 6, 2016 at 3:37 PM, Jaspal Singh <
>>>> jaspal.singh1...@gmail.com> wrote:
>>>>
>>>>> Hi Thomas,
>>>>>
>>>>> In our case we are writing the results back to maprstreams topic based
>>>>> on some validations.
>>>>>
>>>>>
>>>>> Thanks
>>>>> Jaspal
>>>>>
>>>>>
>>>>> On Thursday, October 6, 2016, Thomas Weise <t...@apache.org> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> which operators in your application are writing to external systems?
>>>>>>
>>>>>> When you look at the example from the blog (
>>>>>> https://github.com/DataTorrent/examples/tree/master/tutoria
>>>>>> ls/exactly-once), there is Kafka input, which is configured to be
>>>>>> idempotent. The results are written to JDBC. That operator by itself
>>>>>> supports exactly-once through transactions (in conjunction with 
>>>>>> idempotent
>>>>>> input), hence there is no need to configure the processing mode at all.
>>>>>>
>>>>>> Thomas
>>>>>>
>>>>>>
>>


Re: Datatorrent fault tolerance

2016-10-06 Thread Thomas Weise
What you want is the effect of exactly-once output (that's why we call it
also end-to-end exactly-once). There is no such thing as exactly-once
processing in a distributed system. In this case it would be rather
"produce exactly-once. Upstream operators, on failure, will recover to
checkpointed state and re-process the stream from there. This is
at-least-once, the default behavior. Because in the input operator you have
configured to replay in the same order from Kafka (this is done by
checkpointing the offset ranges), the computation in the DAG is idempotent
and the output operator can discard the results that were already published
instead of producing duplicates.

On Thu, Oct 6, 2016 at 3:57 PM, Jaspal Singh <jaspal.singh1...@gmail.com>
wrote:

> I think this is something called a customized operator implementation that
> is taking care of exactly once processing at output.
>
> What if any previous operators fail ? How we can make sure they also
> recover using EXACTLY_ONCE processing mode ?
>
>
> Thanks
> Jaspal
>
>
> On Thursday, October 6, 2016, Thomas Weise <thomas.we...@gmail.com> wrote:
>
>> In that case please have a look at:
>>
>> https://github.com/apache/apex-malhar/blob/master/kafka/src/
>> main/java/org/apache/apex/malhar/kafka/KafkaSinglePortExactl
>> yOnceOutputOperator.java
>>
>> The operator will ensure that messages are not duplicated, under the
>> stated assumptions.
>>
>>
>> On Thu, Oct 6, 2016 at 3:37 PM, Jaspal Singh <jaspal.singh1...@gmail.com>
>> wrote:
>>
>>> Hi Thomas,
>>>
>>> In our case we are writing the results back to maprstreams topic based
>>> on some validations.
>>>
>>>
>>> Thanks
>>> Jaspal
>>>
>>>
>>> On Thursday, October 6, 2016, Thomas Weise <t...@apache.org> wrote:
>>>
>>>> Hi,
>>>>
>>>> which operators in your application are writing to external systems?
>>>>
>>>> When you look at the example from the blog (
>>>> https://github.com/DataTorrent/examples/tree/master/tutoria
>>>> ls/exactly-once), there is Kafka input, which is configured to be
>>>> idempotent. The results are written to JDBC. That operator by itself
>>>> supports exactly-once through transactions (in conjunction with idempotent
>>>> input), hence there is no need to configure the processing mode at all.
>>>>
>>>> Thomas
>>>>
>>>>


Re: Datatorrent fault tolerance

2016-10-06 Thread Thomas Weise
Hi,

It would be necessary to know a bit more about your application for
specific recommendations, but from what I see above, a few things don't
look right.

It appears that you are setting the processing mode on the input operator,
which only reads from Kafka. Exactly-once is important for operators that
affect the state of external systems, for example when writing to a
database, or, producing messages to Kafka.

There is also a problem with the documentation that may contribute to the
confusion. The processing mode you are trying to use isn't really
exactly-once and probably will be deprecated in a future release. Here are
some better resources on fault tolerance and processing semantics:

http://apex.apache.org/docs.html
https://www.youtube.com/watch?v=FCMY6Ii89Nw
http://www.slideshare.net/ApacheApexOrganizer/webinar-fault-toleranceandprocessingsemantics
https://www.datatorrent.com/blog/end-to-end-exactly-once-with-apache-apex/

HTH,
Thomas


On Thu, Oct 6, 2016 at 11:26 AM, Bandaru, Srinivas <
srinivas.band...@optum.com> wrote:

> Hi, Need some help!
>
> Need your advice in implementing the fault tolerance mechanism using
> datatorrent. Currently we are trying to implement with EXACTLY_ONCE
> scenario.
>
> We are referring the below documentation. The highlighted line says if we
> need to use the EXACTLY_ONCE processing mode, the downstream operators
> should have AT_MOST_ONCE. We added the operator level attributes in the
> properties.xml (attached below) and when we launch the application, no
> messages are coming through the last operator *topicUpdate *but we are
> able to see the messages when we comment those attributes.
>
> *Please have a look at it, either we are specifying the attributes
> incorrectly or is there any additional step to be done*.
>
>
>
>
>
> · *PROCESSING_MODE*
>
> static final Attribute 
>   
> >
>  PROCESSING_MODE
>
> The payload processing mode for this operator - at most once, exactly
> once, or default at least once. If the processing mode for an operator is
> specified as AT_MOST_ONCE and no processing mode is specified for the
> downstream operators if any, the processing mode of the downstream
> operators is automatically set to AT_MOST_ONCE. If a different processing
> mode is specified for the downstream operators it will result in an error. If
> the processing mode for an operator is specified as EXACTLY_ONCE then the
> processing mode for all downstream operators should be specified as
> AT_MOST_ONCE otherwise it will result in an error.
>
>
>
>
>
>
>
>
>
>
>
>
>
> Thanks,
>
> Srinivas
>
>
> This e-mail, including attachments, may include confidential and/or
> proprietary information, and may be used only by the person or entity
> to which it is addressed. If the reader of this e-mail is not the intended
> recipient or his or her authorized agent, the reader is hereby notified
> that any dissemination, distribution or copying of this e-mail is
> prohibited. If you have received this e-mail in error, please notify the
> sender by replying to this message and delete this e-mail immediately.
>


Re: Apex and Malhar Java 8 Certified

2016-10-03 Thread Thomas Weise
Apex is built against Java 7 and expected to work as is on Java 8 (Hadoop
distros still support 1.7 as well). Are you running into specific issues?

Thanks,
Thomas

On Mon, Oct 3, 2016 at 12:06 PM, McCullough, Alex <
alex.mccullo...@capitalone.com> wrote:

> Hey All,
>
>
>
> I know there were talks about this at some point but is Apex and/or Malhar
> Java 8 certified? If not, is there a current plan and date to be so?
>
>
>
> Thanks,
>
> Alex
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>


Re: datatorrent gateway application package problem

2016-09-29 Thread Thomas Weise
Glad you have it working now. And yes, the gateway (like the apex CLI) only
depends on the hadoop script. Different distributions of Hadoop have
different ways of managing their bits and we want to be agnostic. So if you
would like to use a different conf dir, then you would make that change
inside your Hadoop setup and verify it works with just the hadoop script.

Thomas


On Thu, Sep 29, 2016 at 2:48 PM,  wrote:

> David,
>
> I double-checked and I was wrong in my previous email.
> The fs.defaultFS setting is in core-site.xml, not hdfs-site, so that
> wasn't the problem.
>
> But you were essentially right about the cause.
>
> I have my hadoop xml files in a directory pointed to by the
> $HADOOP_CONF_DIR environment variable, but this directory is separate from
> the hadoop installation tree (ie, it's not
> a subdir of HADOOP_HOME or HADOOP_PREFIX).
>
> I copied all my xml files into $HADOOP_HOME/etc/hadoop/, restarted
> dtgateway, and ran the
> instalation wizard again, and this time it didn't get the error.  Also, I
> can import
> packages from dtHub successfully, which didn't work before either.
>
> So that fixes the problem.  My guess is that dtgateway doesn't use
> HADOOP_CONF_DIR to find the hadoop xml files.
>
> Thanks very much for your help!
>
> -david
>
>
> On Wed, Sep 28, 2016 at 9:15 PM,  wrote:
>>
>> On 2016-09-28 17:25, David Yan wrote:
>>>
>>> FYI, I'm getting the exact same error after I set fs.defaultFS to
 file:/// in core-site.xml in the hadoop installation and restarted
 yarn.

 Wrong FS: hdfs://localhost:9000/user/dtadmin/datatorrent,
 expected:
 file:///

 For now, dtgateway only works if fs.defaultFS is the same FS as
 the
 "DFS location" in the dtgateway configuration. I'll log this as a
 bug.

 Please double check the correct value is picked up for
 fs.defaultFS in
 your installation. I think that value should be in core-site.xml,
 not
 hdfs-site.xml, and if not specified in core-site.xml, it would
 default
 to file:///

 David

>>>
>>> I'll check the next time I have access to the hardware.  Thanks
>>> very much.
>>> There's no way I would have found this by myself.  I really
>>> appreciate
>>> you spending the time on this.
>>>
>>> -david
>>>
>>>
>>> On Wed, Sep 28, 2016 at 4:02 PM,  wrote:
>>>
>>> On 2016-09-28 14:56, Vlad Rozov wrote:
>>>
>>> Check namenode logs. Are there any exceptions/errors raised at the
>>> same time as the gateway exception?
>>>
>>> Nope.  Nothing interesting there.
>>>
>>> On 9/28/16 14:43, d...@softiron.co.uk wrote:
>>> On 2016-09-28 14:09, David Yan wrote:
>>> dtgateway does not use the native libraries.
>>>
>>> I just tried a clean install of RTS 3.5.0 community edition on plain
>>> vanilla hadoop 2.7.3 on ubuntu 14.04 and was able to launch the pi
>>> demo from the web interface. I was able to do the same even with the
>>> native libraries removed.
>>>
>>> Thanks.  I thought it was a long shot.
>>>
>>> What is your fs.defaultFS property in hadoop? Can you verify that
>>> it
>>> points to your hdfs (hdfs://namenode:9000/) ?
>>>
>>> In $HADOOP_CONF_DIR/hdfs-site.xml, fs.defaultFS is
>>> hdfs://namenode:9000
>>>
>>> Note also that if I try to give a URL to the dtgateway Installation
>>> Wizard
>>> that isn't the namenode host/port, it complains immediately, instead
>>> of
>>> proceeding to the next page.
>>>
>>> Unfortunately I don't have access to an ARM64 so any finding you
>>> post
>>> here would be helpful.
>>>
>>> I'm not surprised.  A classic chicken-and-egg problem.  There aren't
>>> a lot of machines around, so the software doesn't get a lot of
>>> attention, so people don't buy machines.
>>>
>>> David
>>>
>>> On Wed, Sep 28, 2016 at 11:04 AM,  wrote:
>>>
>>> I had a thought of what might be the problem:
>>> the hadoop native library doesn't build on SUSE/SLES ARM64,
>>> so I'm running with the pure Java library.
>>>
>>> Is it possible that the dtgateway is trying to use the native
>>> library
>>> directly?
>>>
>>> Thanks,
>>> -david
>>>
>>> On 2016-09-26 14:45, David Yan wrote:
>>>
>>> Do you see any exception stacktrace in the log when this error
>>> occurred:
>>>
>>> Wrong FS: hdfs://namenode:9000/user/dtadmin/datatorrent, expected:
>>> file:///
>>>
>>> David
>>>
>>> On Mon, Sep 26, 2016 at 2:39 PM,  wrote:
>>>
>>> FromDavid Yan 
>>> DateSat 00:12
>>>
>>> Can you please provide any exception stacktrace in the
>>> dtgateway.log file when that happens?
>>>
>>> I reran the dtgateway installation wizard, which failed (same as
>>> before) with error msg:
>>>
>>> | DFS directory cannot be written to with error message "Mkdirs
>>> failed to create /user/dtadmin/datatorrent (exists=false,
>>> cwd=file:/opt/datatorrent/releases/3.4.0)"
>>>
>>> The corresponding exception in dtgateway.log is:
>>>
>>> | 2016-09-26 13:43:54,767 

Re: datatorrent gateway application package problem

2016-09-26 Thread Thomas Weise
Hi,

Can you try to launch the application using apex CLI instead of the UI?
That might help to determine if it is a problem with the Hadoop install or
the gateway:

http://apex.apache.org/docs/apex/apex_cli/

Thanks,
Thomas


On Mon, Sep 26, 2016 at 5:04 PM, David Yan  wrote:

> Added back users@ to the thread.
>
> On Mon, Sep 26, 2016 at 5:03 PM, David Yan  wrote:
>
>> Can you try one of the released hadoop versions, like 2.7.3 or 2.7.2 or
>> 2.6.4?
>> We will check that sha commit on our side as well.
>>
>> David
>>
>> On Mon, Sep 26, 2016 at 4:40 PM,  wrote:
>>
>>> On 2016-09-26 16:02, David Yan wrote:
>>>
 From your first email, you said you're using hadoop 2.7.4, but that
 hadoop version has not been released.

>>>
>>> Did you build it yourself? If so, can you provide a github tag or
 something similar so we can build that ourselves so that we can
 reproduce the issue?

>>>
>>> Yes.  Built from apache git repo, branch branch-2.7, date Aug29, sha is:
>>>
>>> 32a86f199cd8e7f32c264af55e3459e4b4751963
>>>
>>> I changed 1 file:
>>>
>>> diff --git a/hadoop-project/pom.xml b/hadoop-project/pom.xml
>>> index 1765363..c35df02 100644
>>> --- a/hadoop-project/pom.xml
>>> +++ b/hadoop-project/pom.xml
>>> @@ -74,3 +74,3 @@
>>>  
>>> -2.5.0
>>> +2.6.1
>>>  ${env.HADOOP_PROTOC_PATH}
>>> @@ -95,3 +95,3 @@
>>>  
>>> --Xmx4096m -XX:MaxPermSize=768m
>>> -XX:+HeapDumpOnOutOfMemoryError
>>> +-Xmx4096m
>>> -XX:+HeapDumpOnOutOfMemoryError
>>>  2.17
>>>
>>>
 Also, what Apex or DT RTS version are you using?

>>>
>>> 3.4.0, also built from Apache source, using tar file:
>>> apex-3.4.0-source-release.tar.gz
>>>
>>> Likewise for malhar, using apache-apex-malhar-3.4.0-sourc
>>> e-release.tar.gz
>>>
>>> I had to change one thing to get it to build.
>>> This disables the maven check against deploying from a snapshot.
>>>
>>> apache-apex-malhar-3.4.0> diff -p -C1 contrib/pom.xml
>>> contrib/pom.xml.orig
>>> *** contrib/pom.xml Tue Sep 13 12:02:27 2016
>>> --- contrib/pom.xml.origFri May 20 00:19:42 2016
>>> ***
>>> *** 199,201 
>>>
>>> - >> ***
>>> *** 214,216 
>>>
>>> ! -->
>>> 
>>> --- 213,215 
>>>
>>> !
>>> 
>>>
>>>
>>> Note: I also tried malhar v3.5 on top of apex3.4, and got the same
>>> results.
>>>
>>> I think that covers everything.
>>>
>>> thanks again,
>>> -david
>>>
>>>
>>>
>>>
 Thanks!

 David

 On Mon, Sep 26, 2016 at 3:15 PM,  wrote:

 On 2016-09-26 14:52, David Yan wrote:
>
> Also, one of the first steps in the installation wizard is to
>> enter
>> the hadoop location (the same screen as the DFS directory).
>> Can you please double check the hadoop location points to the
>> correct
>> hadoop binary in your system?
>>
>
> That, I've already done.  It's definitely correct
> (/opt/hadoop/bin/hadoop)
>
> -dbs
>
>
> David
>
> On Mon, Sep 26, 2016 at 2:45 PM, David Yan 
> wrote:
>
> Do you see any exception stacktrace in the log when this error
> occurred:
>
> Wrong FS: hdfs://namenode:9000/user/dtadmin/datatorrent, expected:
> file:///
>
> David
>
> On Mon, Sep 26, 2016 at 2:39 PM,  wrote:
> FromDavid Yan 
> DateSat 00:12
>
> Can you please provide any exception stacktrace in the dtgateway.log
> file when that happens?
>
> I reran the dtgateway installation wizard, which failed (same as
> before) with error msg:
>
> | DFS directory cannot be written to with error message "Mkdirs
> failed to create /user/dtadmin/datatorrent (exists=false,
> cwd=file:/opt/datatorrent/releases/3.4.0)"
>
> The corresponding exception in dtgateway.log is:
>
> | 2016-09-26 13:43:54,767 ERROR com.datatorrent.gateway.I: DFS
> Directory cannot be written to with exception:
> | java.io.IOException: Mkdirs failed to create
> /user/dtadmin/datatorrent (exists=false,
> cwd=file:/opt/datatorrent/releases/3.4.0)
> | at
>
>
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileS
 ystem.java:455)

> | at
>
>
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileS
 ystem.java:440)

> | at
> org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
> | at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
> | at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789)
> | at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:778)
> | at
> com.datatorrent.stram.client.FSAgent.createFile(FSAgent.java:77)
> | at com.datatorrent.gateway.I.h(gc:324)
> | at 

Re: Can Apex launch command be submitted without using Apex CLI interactively?

2016-09-13 Thread Thomas Weise
Hi,

The apex CLI an also be used in batch mode

-e 

you can also use a script file like so:

apex < mycmds.txt

Other apex CLI documentation is here:

http://apex.apache.org/docs/apex/apex_cli/

We will add above to it.

HTH,
Thomas



On Tue, Sep 13, 2016 at 1:02 PM, Meemee Bradley 
wrote:

> Hi,
>
> I am new to Apache Apex. I have a script to monitor submitted Apex
> application using yarn utilities, when application fails, if application
> can not relaunch/recover by itself, I would like to launch the application
> again. Currently, I can do that by using Apex CLI, and launch the
> application with original ID interactively. I would prefer to launch the
> application again from a script without using Apex CLI interactively. Any
> ideas regarding how do I accomplish that?
>
> Thanks!
>
>


Re: Optional output ports.

2016-09-13 Thread Thomas Weise
Thanks!

On Tue, Sep 13, 2016 at 9:20 AM, McCullough, Alex <
alex.mccullo...@capitalone.com> wrote:

> Created…. https://issues.apache.org/jira/browse/APEXCORE-528
>
>
>
> Not sure if I was or wasn’t supposed to fill out some of the options, but
> took a stab at it.
>
>
>
> Thanks,
>
> Alex
>
>
>
> *From: *Thomas Weise <thomas.we...@gmail.com>
> *Reply-To: *"users@apex.apache.org" <users@apex.apache.org>
> *Date: *Tuesday, September 13, 2016 at 11:58 AM
>
> *To: *"users@apex.apache.org" <users@apex.apache.org>
> *Subject: *Re: Optional output ports.
>
>
>
> Alex, can you create a JIRA (https://issues.apache.org/
> jira/browse/APEXCORE/)?
>
>
>
> On Tue, Sep 13, 2016 at 8:52 AM, McCullough, Alex <
> alex.mccullo...@capitalone.com> wrote:
>
> I think it’s a little confusing too that the default value for the
> optional attribute on the output ports is true, but there is a requirement
> implemented by the validator that negates this unless you explicitly set it
> to on all your output ports (to the default no less).
>
>
>
>
>
> Thanks,
>
> Alex
>
>
>
> *From: *Thomas Weise <t...@apache.org>
> *Reply-To: *"users@apex.apache.org" <users@apex.apache.org>
> *Date: *Tuesday, September 13, 2016 at 11:10 AM
> *To: *"users@apex.apache.org" <users@apex.apache.org>
> *Subject: *Re: Optional output ports.
>
>
>
> That's right, if there are multiple output ports, the validation demands
> that at least one is connected.
>
>
>
> I actually think this validation is incorrect. It should be up to the
> application developer to decide how the output of an operator is consumed.
>
>
>
> It is similar to the return value of a function, you don't force the user
> to assign or use it?
>
>
>
> Thomas
>
>
>
>
>
> On Tue, Sep 13, 2016 at 7:29 AM, Munagala Ramanath <r...@datatorrent.com>
> wrote:
>
> Yes, if you have ports at least one must be connected if there are no
> annotations on them.
>
>
>
> The code is in LogicalPlan.validate() -- checkout the allPortsOptional
> variable.
>
>
>
> Ram
>
>
>
> On Tue, Sep 13, 2016 at 3:17 AM, Tushar Gosavi <tus...@datatorrent.com>
> wrote:
>
> Hi All,
>
> I have an input operator with one output port without any annotation.
> When I launch the application using just this operator I get
> ValidationException "At least one output port must be connected".  By
> default connecting output ports are optional, or is it mandatory to
> connect at least one output port of an operator, if there are no
> annotation on them.
>
> Regards.
> - Tushar.
>
>
>
>
>
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>
>
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>


Re: Optional output ports.

2016-09-13 Thread Thomas Weise
Alex, can you create a JIRA (https://issues.apache.org/jira/browse/APEXCORE/
)?

On Tue, Sep 13, 2016 at 8:52 AM, McCullough, Alex <
alex.mccullo...@capitalone.com> wrote:

> I think it’s a little confusing too that the default value for the
> optional attribute on the output ports is true, but there is a requirement
> implemented by the validator that negates this unless you explicitly set it
> to on all your output ports (to the default no less).
>
>
>
>
>
> Thanks,
>
> Alex
>
>
>
> *From: *Thomas Weise <t...@apache.org>
> *Reply-To: *"users@apex.apache.org" <users@apex.apache.org>
> *Date: *Tuesday, September 13, 2016 at 11:10 AM
> *To: *"users@apex.apache.org" <users@apex.apache.org>
> *Subject: *Re: Optional output ports.
>
>
>
> That's right, if there are multiple output ports, the validation demands
> that at least one is connected.
>
>
>
> I actually think this validation is incorrect. It should be up to the
> application developer to decide how the output of an operator is consumed.
>
>
>
> It is similar to the return value of a function, you don't force the user
> to assign or use it?
>
>
>
> Thomas
>
>
>
>
>
> On Tue, Sep 13, 2016 at 7:29 AM, Munagala Ramanath <r...@datatorrent.com>
> wrote:
>
> Yes, if you have ports at least one must be connected if there are no
> annotations on them.
>
>
>
> The code is in LogicalPlan.validate() -- checkout the allPortsOptional
> variable.
>
>
>
> Ram
>
>
>
> On Tue, Sep 13, 2016 at 3:17 AM, Tushar Gosavi <tus...@datatorrent.com>
> wrote:
>
> Hi All,
>
> I have an input operator with one output port without any annotation.
> When I launch the application using just this operator I get
> ValidationException "At least one output port must be connected".  By
> default connecting output ports are optional, or is it mandatory to
> connect at least one output port of an operator, if there are no
> annotation on them.
>
> Regards.
> - Tushar.
>
>
>
>
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>


Re: Optional output ports.

2016-09-13 Thread Thomas Weise
That's right, if there are multiple output ports, the validation demands
that at least one is connected.

I actually think this validation is incorrect. It should be up to the
application developer to decide how the output of an operator is consumed.

It is similar to the return value of a function, you don't force the user
to assign or use it?

Thomas


On Tue, Sep 13, 2016 at 7:29 AM, Munagala Ramanath 
wrote:

> Yes, if you have ports at least one must be connected if there are no
> annotations on them.
>
> The code is in LogicalPlan.validate() -- checkout the allPortsOptional
> variable.
>
> Ram
>
> On Tue, Sep 13, 2016 at 3:17 AM, Tushar Gosavi 
> wrote:
>
>> Hi All,
>>
>> I have an input operator with one output port without any annotation.
>> When I launch the application using just this operator I get
>> ValidationException "At least one output port must be connected".  By
>> default connecting output ports are optional, or is it mandatory to
>> connect at least one output port of an operator, if there are no
>> annotation on them.
>>
>> Regards.
>> - Tushar.
>>
>
>


Re: HDHT operator

2016-09-12 Thread Thomas Weise
That documentation is not for Malhar. You find the user documentation for
Apex here:

http://apex.apache.org/docs.html

See links on top.

So you want to enrich the data after it was emitted from the window?




On Mon, Sep 12, 2016 at 6:41 PM, Naresh Guntupalli <
naresh.guntupa...@gmail.com> wrote:

> Thanks Ashwin and Thomas for your quick help.
>
> @Thomas:  http://docs.datatorrent.com/operators/hdht/ - This is the doc I
> was referring to earlier.
>
> The use case is store and enrich certain user events in a (bigger) rolling
> window.
>
> -Naresh
>
> On Mon, Sep 12, 2016 at 5:43 PM, Thomas Weise <thomas.we...@gmail.com>
> wrote:
>
>> Here is an example for the equivalent functionality in Malhar:
>>
>> https://github.com/apache/apex-malhar/blob/master/benchmark/
>> src/main/java/com/datatorrent/benchmark/state/StoreOperator.java
>>
>> https://issues.apache.org/jira/browse/APEXMALHAR-2205
>>
>> It will still be useful to understand the use case as there are also a
>> couple ready to use operators that use this state management facility.
>>
>> Thomas
>>
>>
>> On Mon, Sep 12, 2016 at 5:34 PM, Thomas Weise <thomas.we...@gmail.com>
>> wrote:
>>
>>> Naresh,
>>>
>>> Which doc references this operator? Malhar has a replacement for this.
>>> Can you share some more info about your use case so that we can point you
>>> to the appropriate operator to start from?
>>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>> On Mon, Sep 12, 2016 at 5:13 PM, Naresh Guntupalli <
>>> naresh.guntupa...@gmail.com> wrote:
>>>
>>>> The AbstractSinglePortHDHTWriter (HDHT) operator is missing from
>>>> Apex-Malhar 3.x releases, but the user docs still references it. Was the
>>>> operator removed or moved somewhere else?
>>>>
>>>
>>>
>>
>


[ANNOUNCE] Apache Apex Malhar 3.5.0 released

2016-09-05 Thread Thomas Weise
Dear Community,

The Apache Apex community is pleased to announce release 3.5.0 of the
Malhar library.

The release resolved 63 JIRAs and comes with exciting new features and
enhancements, including:

 - High level Java stream API now supports stateful transformations with
Apache Beam style windowing semantics. The demo package (
https://github.com/apache/apex-malhar/tree/v3.5.0/demos/highlevelapi) has
examples for usage of the API and an earlier presentation is at
http://www.slideshare.net/ApacheApex/java-high-level-stream-api

 - Improvements to underlying state management components for windowing,
ease of use in custom operators and incremental state saving. Work is
underway to take this further in the next release, along with benchmarking
for key cardinality and throughput.

 - Deduper solves a frequent task in processing stream data, to decide
whether a given record is a duplicate or not. The documentation explains
this in detail: http://apex.apache.org/docs/malhar/operators/deduper

 - JDBC poll operator, another operator frequently required in Apex use
cases. The operator can function as bounded or unbounded source, is
idempotent for exactly-once processing and partitionable. An example can be
found at https://github.com/DT-Priyanka/examples/tree/SPOI-
8548-jdbc-poller-example/tutorials/jdbcIngest

 - Enricher that essentially joins a stream with a lookup source can can
operate on any POJO object. The user can solve this through configuration
and does not need to write code for the operator. See documentation and the
example linked from there: http://apex.apache.org/docs/
malhar/operators/enricher/

There are more features, enhancements and fixes is this release, see
https://s.apache.org/5vQi for full changes.

Apache Apex is an enterprise grade native YARN big data-in-motion platform
that unifies stream and batch processing. Apex was built for scalability
and low-latency processing, high availability and operability.

Apex provides features that similar platforms currently don’t offer, such
as fine grained, incremental recovery to only reset the portion of a
topology that is affected by a failure, support for elastic scaling based
on the ability to acquire (and release) resources as needed as well as the
ability to alter topology and operator properties on running applications.

Apex has been developed since 2012 and became ASF top level project earlier
this year, following 8 months of incubation. Apex early on brought the
combination of high throughput, low latency and fault tolerance with strong
processing guarantees to the stream data processing space and gained
maturity through important production use cases at several organizations.
See the powered by page and resources on the project web site for more
information:

http://apex.apache.org/powered-by-apex.html
http://apex.apache.org/docs.html

The Apex engine is supplemented by Malhar, the library of pre-built
operators, including adapters that integrate with many existing
technologies as sources and destinations, like message buses, databases,
files or social media feeds.

An easy way to get started with Apex is to pick one of the examples as
starting point. They cover many common and recurring tasks, such as data
consumption from different sources, output to various sinks, partitioning
and fault tolerance:

https://github.com/DataTorrent/examples/tree/master/tutorials

Apex Malhar and Core (the engine) are separate repositories and releases.
We expect more frequent releases of Malhar to roll out new connectors and
other operators based on a stable engine API. This release 3.5.0 works on
existing Apex Core 3.4.0. Users only need to upgrade the Maven dependency
in their project.

The source release can be found at:

http://www.apache.org/dyn/closer.lua/apex/apache-apex-malhar-3.5.0/

or visit:

http://apex.apache.org/downloads.html

We welcome your help and feedback. For more information on the project and
how to get involved, visit our website at:

http://apex.apache.org/

Regards,
The Apache Apex community


Re: Apex Application failing to run.

2016-09-04 Thread Thomas Weise
Since it is the virtual memory, please see:

http://docs.datatorrent.com/configuration/#hadoop-tuning

--
sent from mobile
On Sep 4, 2016 12:10 PM, "Munagala Ramanath"  wrote:

> It looks like there is not enough memory for all the containers. If you're
> running in the sandbox, how much memory is allocated to the sandbox ?
> Can you try increasing it ?
>
> There is also a detailed discussion of how to manage memory using
> configuration
> parameters in the properties files at: http://docs.datatorrent.
> com/tutorials/topnwords-c7/
>
> As a first step, can you simply build the archetype generated project
> (without copying
> over the source files for the word count application) and try to run the
> resulting app ?
> It has only 2 operators and so minimal memory requirements. If it runs, it
> will confirm
> that your setup is OK but availability of adequate memory is the issue.
>
> Ram
>
> On Sun, Sep 4, 2016 at 11:47 AM, Ambarish Pande <
> ambarish.pande2...@gmail.com> wrote:
>
>> Hello Guys,
>>
>> I have followed the steps shown in this video https://www.youtube.com/
>> watch?v=LwRWBudOjg4=youtu.be to build my first apex application.
>> My application is failing and the following diagnostics report is shown in
>> the web admin interface on master:8088
>>
>> --
>>
>> Application application_1473011986197_0001 failed 2 times due to AM
>> Container for appattempt_1473011986197_0001_02 exited with exitCode:
>> -103
>> For more detailed output, check application tracking page:
>> http://master:8088/cluster/app/application_1473011986197_0001Then, click
>> on links to logs of each attempt.
>> Diagnostics: Container [pid=8339,containerID=containe
>> r_1473011986197_0001_02_01] is running beyond virtual memory limits.
>> Current usage: 254.7 MB of 1 GB physical memory used; 2.4 GB of 2.1 GB
>> virtual memory used. Killing container.
>> Dump of the process-tree for container_1473011986197_0001_02_01 :
>> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>> |- 8389 8339 8339 8339 (java) 976 19 2575908864 64519
>> /usr/lib/jvm/java-8-oracle/bin/java -Djava.io.tmpdir=/home/hadoopu
>> ser/tmp/nm-local-dir/usercache/hadoopuser/appcache/applicati
>> on_1473011986197_0001/container_1473011986197_0001_02_01/tmp
>> -Xmx768m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/dt-heap-1.bin
>> -Dhadoop.root.logger=INFO,RFA -Dhadoop.log.dir=/home/hadoopu
>> ser/hadoop/logs/userlogs/application_1473011986197_
>> 0001/container_1473011986197_0001_02_01
>> -Ddt.attr.APPLICATION_PATH=hdfs://master:54310/user/hadoopus
>> er/datatorrent/apps/application_1473011986197_0001
>> com.datatorrent.stram.StreamingAppMaster
>> |- 8339 8337 8339 8339 (bash) 0 0 17043456 679 /bin/bash -c
>> /usr/lib/jvm/java-8-oracle/bin/java -Djava.io.tmpdir=/home/hadoopu
>> ser/tmp/nm-local-dir/usercache/hadoopuser/appcache/applicati
>> on_1473011986197_0001/container_1473011986197_0001_02_01/tmp
>> -Xmx768m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/dt-heap-1.bin
>> -Dhadoop.root.logger=INFO,RFA -Dhadoop.log.dir=/home/hadoopu
>> ser/hadoop/logs/userlogs/application_1473011986197_
>> 0001/container_1473011986197_0001_02_01
>> -Ddt.attr.APPLICATION_PATH=hdfs://master:54310/user/hadoopus
>> er/datatorrent/apps/application_1473011986197_0001
>> com.datatorrent.stram.StreamingAppMaster 1>/home/hadoopuser/hadoop/logs
>> /userlogs/application_1473011986197_0001/container_147301198
>> 6197_0001_02_01/AppMaster.stdout 2>/home/hadoopuser/hadoop/logs
>> /userlogs/application_1473011986197_0001/container_147301198
>> 6197_0001_02_01/AppMaster.stderr
>> Container killed on request. Exit code is 143
>> Container exited with a non-zero exit code 143
>> Failing this attempt. Failing the application.
>>
>>
>> --
>>
>>
>> Any help would be appreciated.
>>
>> Thank you in advance.
>>
>>
>>
>>
>


Re: Support for dynamic topology

2016-08-30 Thread Thomas Weise
Hi,

This depends on the operator Z. If it has multiple input ports and those
are optional, you can add P, then connect P to Z (and X), then remove Y.

If Z has a single port, then Z and everything downstream would need to be
removed or else the change won't result in a valid DAG and won't be
accepted.

Thomas


On Tue, Aug 30, 2016 at 1:48 PM, Hyunseok Chang 
wrote:

> So if I replace Y in pipeline X -> Y -> Z -> U -> W -> V with P, then what
> I would have is X -> P -> Z' -> U' -> W' -> V' ?   Where Z', U', W' and V'
> are new operator instances that need to be deployed along with P.
>
> Is my understanding correct?   If so is there any reason why we cannot
> re-use existing operators downstream?
>
> -hs
>
>
> On Tue, Aug 30, 2016 at 2:46 PM, Amol Kekre  wrote:
>
>>
>> Hyunseok,
>> The new route in the pipeline will have a new Z operator. If you want to
>> use the old Z operator (state?) then things get tricky. Do confirm that you
>> do not plan to use old Z operator.
>>
>> Thks,
>> Amol
>>
>>
>> On Tue, Aug 30, 2016 at 11:02 AM, Sandesh Hegde 
>> wrote:
>>
>>> Hello hs,
>>>
>>> Yes, you can change the topology from the Apex CLI.
>>>
>>> One possible sequence of commands for your scenario is described below,
>>>
>>> connect appid
>>> begin-logical-plan-change
>>> create-operator 
>>> add-stream-sink ...  ( for the input of P )
>>> add-stream-sink ... ( for the output of P )
>>> remove-operator ...
>>> submit
>>>
>>> Note: All the required operators needs to be in the package.
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Aug 30, 2016 at 7:22 AM Hyunseok Chang 
>>> wrote:
>>>
 Hi,

 I'd like to know more about Apex support for dynamic topology.

 From my readings on Apex, I understand we can add additional parallel
 tasks for each operator and change data partitioning among them dynamically
 at run time (so-called data partitioning and unification features).

 My question is can we change the "logical" DAG at run time?

 Let's say my logical DAG is a chain of three operators X, Y & Z (i.e.,
 X -> Y -> Z).  Now at run time I want to replace operator Y with operator
 P, such that the new logical DAG would look like X -> P -> Z.

 Is it something I can do with Apex?

 Thanks!
 -hs


>>
>


Re: kryo Serealization Exception

2016-08-22 Thread Thomas Weise
.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43) at 
> java.lang.reflect.Method.invoke(Method.java:606)
> at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
> at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440)
> at com.datatorrent.stram.plan.logical.LogicalPlan$
> OperatorMeta.writeObject(LogicalPlan.java:804) at sun.reflect.
> NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.
> NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43) at 
> java.lang.reflect.Method.invoke(Method.java:606)
> at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at
> java.util.HashMap.writeObject(HashMap.java:1129) at sun.reflect.
> NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.
> NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43) at 
> java.lang.reflect.Method.invoke(Method.java:606)
> at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at
> com.datatorrent.stram.plan.logical.LogicalPlan.write(LogicalPlan.java:2068)
> at com.datatorrent.stram.StramClient.startApplication(StramClient.java:518)
> at 
> com.datatorrent.stram.client.StramAppLauncher.launchApp(StramAppLauncher.java:529)
> at com.datatorrent.stram.cli.DTCli$LaunchCommand.execute(DTCli.java:2050)
> at com.datatorrent.stram.cli.DTCli.launchAppPackage(DTCli.java:3456) at
> com.datatorrent.stram.cli.DTCli.access$7100(DTCli.java:106) at
> com.datatorrent.stram.cli.DTCli$LaunchCommand.execute(DTCli.java:1895) at
> com.datatorrent.stram.cli.DTCli$3.run(DTCli.java:1449) Caused by: 
> java.io.NotSerializableException:
> com.rbc.aml.silver.operator.AvroFileOutputOperator at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at
> com.esotericsoftware.kryo.serializers.JavaSerializer.write(JavaSerializer.java:30)
> ... 149 more
>
>
>
>
>
>
>
> Regards,
>
> Surya Vamshi
>
>
>
> *From:* Thomas Weise [mailto:thomas.we...@gmail.com]
> *Sent:* 2016, August, 22 3:09 PM
>
> *To:* users@apex.apache.org
> *Subject:* Re: kryo Serealization Exception
>
>
>
> There is some information available here:
>
>
>
> http://docs.datatorrent.com/troubleshooting/#application-
> throwing-following-kryo-exception
>
>
>
> If the object is Java serializable, you can set the stream codec or wrap
> into KryoJdkContainer:
>
>
>
> https://github.com/apache/apex-malhar/tree/master/
> library/src/main/java/com/datatorrent/lib/codec
>
>
>
>
>
>
>
>
>
>
>
> On Mon, Aug 22, 2016 at 11:42 AM, Mukkamula, Suryavamshivardhan (CWM-NR) <
> suryavamshivardhan.mukkam...@rbc.com> wrote:
>
> Hi Tushar,
>
> In Our case, Generic Record fields are generated at run time from
> database.  I cannot convert into a predefined POJO to pass through output
> port.
> Is it mandatory that Generic Record class must have no-arg constructor for
> kryo serialization ?
>
> Regards,
> Surya Vamshi
> -Original Message-
> From: Tushar Gosavi [mailto:tus...@datatorrent.com]
> Sent: 2016, August, 20 2:33 AM
> To: users@apex.apache.org
> Subject: Re: kryo Serealization Exception
>
> 

Re: kryo Serealization Exception

2016-08-22 Thread Thomas Weise
There is some information available here:

http://docs.datatorrent.com/troubleshooting/#application-throwing-following-kryo-exception

If the object is Java serializable, you can set the stream codec or wrap
into KryoJdkContainer:

https://github.com/apache/apex-malhar/tree/master/library/src/main/java/com/datatorrent/lib/codec





On Mon, Aug 22, 2016 at 11:42 AM, Mukkamula, Suryavamshivardhan (CWM-NR) <
suryavamshivardhan.mukkam...@rbc.com> wrote:

> Hi Tushar,
>
> In Our case, Generic Record fields are generated at run time from
> database.  I cannot convert into a predefined POJO to pass through output
> port.
> Is it mandatory that Generic Record class must have no-arg constructor for
> kryo serialization ?
>
> Regards,
> Surya Vamshi
> -Original Message-
> From: Tushar Gosavi [mailto:tus...@datatorrent.com]
> Sent: 2016, August, 20 2:33 AM
> To: users@apex.apache.org
> Subject: Re: kryo Serealization Exception
>
> Hi
>
> Another option is to create your own Java object and populate the fields
> you need for further processing from GenericRecord, and send it on the
> output port. You can use this approach if you can not put the operators in
> single container, because 1) you need to shuffle based on key or 2)
> resource constraints.
>
> -Tushar.
>
>
> On Sat, Aug 20, 2016 at 3:23 AM, Devendra Tagare <
> devend...@datatorrent.com> wrote:
> > Hi,
> >
> > You can set the Locality of the parser and the writer to Container local.
> >
> > This will ensure that Generic Record from the parser does not get
> > serialized between containers.
> >
> > Thanks,
> > Dev
> >
> > On Fri, Aug 19, 2016 at 2:21 PM, Mukkamula, Suryavamshivardhan
> > (CWM-NR)  wrote:
> >>
> >> Hi,
> >>
> >> Can you please help resolve the below issue?
> >>
> >> In our project we are using ‘org.apache.avro.generic.GenericRecord’
> >> as Tuple writing to a parquet file and we are using avro schema for
> >> each record. We are getting the below exception, I suppose
> >> GenericRecord does not have no-arg constructor, and looking for some
> ideas to solve this problem.
> >>
> >> # Exception ##
> >>
> >> 2016-08-19 16:29:12,845 [5/silverFileOut:AvroFileOutputOperator]
> >> ERROR codec.Def aultStatefulStreamCodec fromDataStatePair -
> >> Catastrophic Error: Execution halted due to Kryo exception!
> >> com.esotericsoftware.kryo.KryoException: Class cannot be created
> >> (missing no-arg
> >> constructor): org.apache.avro.generic.GenericData$Record
> >> at
> >> com.esotericsoftware.kryo.Kryo$DefaultInstantiatorStrategy.newInstant
> >> iatorOf(Kryo.java:1228)
> >> at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.
> java:1049)
> >> at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1058)
> >> at
> >> com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSer
> >> ializer.java:547)
> >> at
> >> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSeria
> >> lizer.java:523)
> >> at
> >> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
> >> at
> >> com.datatorrent.stram.codec.DefaultStatefulStreamCodec.fromDataStateP
> >> air(DefaultStatefulStreamCodec.java:99)
> >> at
> >> com.datatorrent.stram.stream.BufferServerSubscriber$BufferReservoir.p
> >> rocessPayload(BufferServerSubscriber.java:364)
> >> at
> >> com.datatorrent.stram.stream.BufferServerSubscriber$BufferReservoir.s
> >> weep(BufferServerSubscriber.java:316)
> >> at
> >> com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:252)
> >> at
> >> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContai
> >> ner.java:1382)
> >> 2016-08-19 16:30:09,336 [main] INFO  stram.StreamingContainerManager
> >> updateCheck
> >>
> >> Regards,
> >> Surya Vamshi
> >>
> >>
> >> _
> >> __
> >>
> >> If you received this email in error, please advise the sender (by
> >> return email or otherwise) immediately. You have consented to receive
> >> the attached electronically at the above-noted email address; please
> >> retain a copy of this confirmation for future reference.
> >>
> >> Si vous recevez ce courriel par erreur, veuillez en aviser
> >> l'expéditeur immédiatement, par retour de courriel ou par un autre
> >> moyen. Vous avez accepté de recevoir le(s) document(s) ci-joint(s)
> >> par voie électronique à l'adresse courriel indiquée ci-dessus;
> >> veuillez conserver une copie de cette confirmation pour les fins de
> reference future.
> >
> >
>
> ___
> If you received this email in error, please advise the sender (by return
> email or otherwise) immediately. You have consented to receive the attached
> electronically at the above-noted email address; please retain a copy of
> this confirmation for future reference.
>
> Si vous recevez ce 

Re: Local CLI Logging

2016-08-19 Thread Thomas Weise
Hi Alex,

The CLI is automatically changing the threshold for the log4j console
appender. You can supply - to the CLI to enable output at debug level.
You can also supply your own log4j.properties with customized loggers
through the environment (the example below also includes how you enable
debugging):

DT_CLIENT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n
-Dlog4j.configuration=file:///somewhere/mylog4j.properties

HTH,
Thomas


On Fri, Aug 19, 2016 at 7:52 AM, McCullough, Alex <
alex.mccullo...@capitalone.com> wrote:

> Hey,
>
>
>
> I have an app I am launching an app through the local cli
> (apex-cor/engine/src/main/scripts/apex, then launch –local) and I was
> wondering if there was a way to get the logs to print to the apex console?
> I could log to some file and then tail the file separately, but it would be
> easier if I could launch the app and then watch my logs.
>
>
>
> Thanks,
>
> Alex
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>


Re: Round Robin Partitioning with Dynamic Partitioning

2016-08-12 Thread Thomas Weise
If idempotency is needed (replay on recovery) and the order of tuples in
the stream can change within a window (multiple upstream partitions), then
it may be better to use a stateless hash function that ensures even
distribution.


On Fri, Aug 12, 2016 at 10:10 AM, Munagala Ramanath 
wrote:

> I assume the situation is that you have operators A -> {B1, B2, ...} where
> Bi are partitions of B and
> you want to distribute incoming tuples from A to the Bi in round-robin
> fashion.
>
> For this, you'll need to create a StreamCodec; please see:
> https://github.com/DataTorrent/examples/blob/master/tutorials/partition/
> src/main/java/com/example/myapexapp/Codec3.java
>
> For round robin, you'll probably need to save the current partition id
> (i.e. an integer in the range 0...N
> where N is the number of partitions) and increment it mod N for each input
> tuple.
>
> Ram
>
> On Fri, Aug 12, 2016 at 9:53 AM, McCullough, Alex <
> alex.mccullo...@capitalone.com> wrote:
>
>> Thanks Ram.
>>
>>
>>
>> If I didn’t want dynamic partitioning and just round robin on a fixed #
>> of partitions, can it just be set through a property? If so, what is the
>> property?
>>
>>
>>
>> *From: *Munagala Ramanath 
>> *Reply-To: *"users@apex.apache.org" 
>> *Date: *Friday, August 12, 2016 at 12:50 PM
>> *To: *"users@apex.apache.org" 
>> *Subject: *Re: Round Robin Partitioning with Dynamic Partitioning
>>
>>
>>
>> Alex,
>>
>>
>>
>> Please take a look at: https://github.com/DataTor
>> rent/examples/blob/master/tutorials/dynamic-partition/
>> src/main/java/com/example/dynamic/Gen.java
>>
>>
>>
>> It shows an operator that implements both the *Partitioner* and the
>> *StatsListener* interface.
>>
>> The *processStats()* method of the latter interface checks the number of
>> emitted tuples from the argument
>>
>> and, if the count exceeds 500, it sets the number of partitions to
>> *MAX_PARTITIONS* (which is 4).
>>
>> The number of partitions initially starts out at 2.
>>
>>
>>
>> The *definePartitions()* method of the former interface checks the
>> desired number of partitions and
>>
>> if it differs from the current partition count, it performs a dynamic
>> repartition. If you turn on *DEBUG*,
>>
>> you should see log messages at some of these steps.
>>
>>
>>
>> Naturally, this is a contrived example to illustrate how to exercise the
>> functionality.
>>
>>
>>
>> Let me know if any of this is not clear or if you have further questions.
>>
>>
>>
>> Ram
>>
>>
>>
>>
>>
>> On Fri, Aug 12, 2016 at 9:36 AM, McCullough, Alex <
>> alex.mccullo...@capitalone.com> wrote:
>>
>> I can’t find any examples of how to set round robin partitioning or how
>> to set dynamic partitioning, are there any example applications I can look
>> at?
>>
>>
>>
>> The DataTorrent/Examples github page has a dynamic partitioning example
>> but it just looks to be a shell, I don’t see any actual logic implemented…
>>
>>
>>
>> Thanks,
>>
>> Alex
>>
>>
>> --
>>
>> The information contained in this e-mail is confidential and/or
>> proprietary to Capital One and/or its affiliates and may only be used
>> solely in performance of work or services for Capital One. The information
>> transmitted herewith is intended only for use by the individual or entity
>> to which it is addressed. If the reader of this message is not the intended
>> recipient, you are hereby notified that any review, retransmission,
>> dissemination, distribution, copying or other use of, or taking of any
>> action in reliance upon this information is strictly prohibited. If you
>> have received this communication in error, please contact the sender and
>> delete the material from your computer.
>>
>>
>>
>> --
>>
>> The information contained in this e-mail is confidential and/or
>> proprietary to Capital One and/or its affiliates and may only be used
>> solely in performance of work or services for Capital One. The information
>> transmitted herewith is intended only for use by the individual or entity
>> to which it is addressed. If the reader of this message is not the intended
>> recipient, you are hereby notified that any review, retransmission,
>> dissemination, distribution, copying or other use of, or taking of any
>> action in reliance upon this information is strictly prohibited. If you
>> have received this communication in error, please contact the sender and
>> delete the material from your computer.
>>
>
>


Re: Kafka Input Operator questions

2016-06-29 Thread Thomas Weise
Hi Eric,

I would recommend to use the latest version of Apex Core (3.4.0) as well as
the latest version of Malhar (3.4.0 as well).

My responses inline based on above versions:

On Wed, Jun 29, 2016 at 5:54 AM, Martin, Eric 
wrote:

> Hi,
>
>
>
> I have a few quick questions around the Kafka Input Operators, to confirm
> that I am implementing them correctly.
>
>
>
> 1)We are using kafka version 0.8.x, and I want to confirm that the
> 0.8.x  input operators for kafka are located in
> https://github.com/apache/apex-malhar/tree/master/contrib/src/main/java/com/datatorrent/contrib/kafka
>

Yes, this is the location for the Kafka 0.8.x compatible consumer.


>
> 2)For setting zookeeper and topic in the input operator, I see that
> setZookeeper and setTopic methods are both deprecated. I just want to
> confirm that I should not be creating a consumer instance and setting these
> properties in the consumer vs the operator level I am currently setting
> them on the operator via the properties file as shown below and I want to
> confirm that this is correct:
>
>
>
> <*property*>
> <*name*>dt.operator.kafkaInputOperator.prop.topic
> <*value*>
> 
>
>
>


Please use the setters of the consumer instead. Path would be dt.operator.
kafkaInputOperator.prop.consumer.topic


> 3)For exactly once processing, we in examples we see that we need to
> setIdempotentStorageManager on the on the input operator. It appears though
> that IdempotentStorageManager is deprecated. Is there an alternative that
> we can use if we are running 3.3.0 of apex engine and 3.3.1 of Malhar?
>

The replacement is WindowDataManager and the operator was updated in v3.4.0
to reflect this.



>
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>


Re: Kafka input operator

2016-06-19 Thread Thomas Weise
kafka.message.Message is the problem, MutablePair has a no-arg constructor
and should be serializable for Kryo,


On Sun, Jun 19, 2016 at 3:10 PM,  wrote:

> The Pairs in Apache common are not Kryo serializable. You can use other
> pair data structure. For example KeyValuePair in Malhar library
>
> Siyuan
>
> Sent from my iPhone
>
> On Jun 19, 2016, at 14:58, Raja.Aravapalli 
> wrote:
>
>
> Hi Priyanka,
>
> I am writing to read the messages in the next operator with input port
> defined like the below,
>
> public transient DefaultInputPort Integer>>> input = new DefaultInputPort MutablePair>>()
>
>
> Application is failing with below exception:
>
> 2016-06-19 16:54:45,498 ERROR codec.DefaultStatefulStreamCodec 
> (DefaultStatefulStreamCodec.java:fromDataStatePair(98)) - Catastrophic Error: 
> Execution halted due to Kryo exception!
> com.esotericsoftware.kryo.KryoException: Class cannot be created (missing 
> no-arg constructor): kafka.message.Message
> Serialization trace:
> left (org.apache.commons.lang3.tuple.MutablePair)
>   at 
> com.esotericsoftware.kryo.Kryo$DefaultInstantiatorStrategy.newInstantiatorOf(Kryo.java:1228)
>   at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1049)
>   at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1058)
>
>
>
> Any help please.
>
> Regards,
> Raja.
>
> From: "Raja.Aravapalli" 
> Reply-To: "users@apex.apache.org" 
> Date: Sunday, June 19, 2016 at 12:22 AM
> To: "users@apex.apache.org" 
> Subject: Re: Kafka input operator
>
>
> Thanks for the response Priyanka…
>
> But, when I try to put in my own package, some of the protected variables
> are not accessible
>
>
> Regards,
> Raja.
>
> From: Priyanka Gugale 
> Reply-To: "users@apex.apache.org" 
> Date: Saturday, June 18, 2016 at 10:29 AM
> To: "users@apex.apache.org" 
> Subject: Re: Kafka input operator
>
> Hi,
>
> Yes sure, you can use any package name you want. In fact better you put
> this class outside Malhar jar. Just keep the Malhar jar in your class path.
>
> -Priyanka
> On Jun 17, 2016 8:03 PM, "Raja.Aravapalli" 
> wrote:
>
>>
>> Hi Priyanka,
>>
>> Can this be done from a class outside the package “
>> com.datatorrent.contrib.kafka;” ?
>>
>> I don’t want to disturb the source :(
>>
>>
>>
>> Regards,
>> Raja.
>>
>> From: "Raja.Aravapalli" 
>> Date: Friday, June 17, 2016 at 5:38 AM
>> To: "users@apex.apache.org" 
>> Subject: Re: Kafka input operator
>>
>>
>> Hi Priyanka,
>>
>> I am using kafka version 0.8.x.
>>
>> Awesome. Yes. This is what is want. I shall test this and share my
>> updates. Having one kafka operator like this in Malhar, will be a very good
>> one. I don’t see such availability in Storm as well!!
>>
>>
>>
>> Regards,
>> Raja.
>>
>> From: Priyanka Gugale 
>> Reply-To: "users@apex.apache.org" 
>> Date: Friday, June 17, 2016 at 2:05 AM
>> To: "users@apex.apache.org" 
>> Subject: Re: Kafka input operator
>>
>> Hi Raja,
>>
>> I have quickly wrote an operator to fulfill your requirement. The code is
>> available here
>> .
>> Let me know if this addresses your usecase.
>>
>> -Priyanka
>>
>> On Fri, Jun 17, 2016 at 11:32 AM, Priyanka Gugale <
>> priya...@datatorrent.com> wrote:
>>
>>> Hi Raja,
>>>
>>> You will need to update other places as well (I guess it's replay other
>>> than emitTuples) . But I think it is not feasible to replicate emitTuples
>>> code in subclass as many of the parent class variables are private. I would
>>> try to figure out if there is any other way.
>>>
>>> Can you please confirm which Kafka version you are using?
>>>
>>> -Priyanka
>>>
>>> On Thu, Jun 16, 2016 at 8:39 PM, Raja.Aravapalli <
>>> raja.aravapa...@target.com> wrote:
>>>

 Hi Chaitanya,

 Would the below changes you proposed enough to retrieve partition &
 offset ?

 I see *emitTuple(Message msg) i*s being called at various places in
 the code… please advise. Thank you.


 Regards,
 Raja.

 From: "Raja.Aravapalli" 
 Date: Tuesday, June 14, 2016 at 9:50 PM
 To: "users@apex.apache.org" 
 Subject: Re: Kafka input operator


 Thanks for the response Chaitanya. I will follow the suggestions to
 retrieve Kafka partitionId & offset!!


 Regards,
 Raja.

 From: Chaitanya Chebolu 
 Reply-To: "users@apex.apache.org" 
 Date: Monday, June 13, 2016 at 3:06 AM
 To: 

Re: A few questions about the Kafka 0.9 operator

2016-06-14 Thread Thomas Weise
You can also have a look at this blog and linked example that specifically
covers exactly-once with input from Kafka:

https://www.datatorrent.com/blog/end-to-end-exactly-once-with-apache-apex/


On Tue, Jun 14, 2016 at 2:47 PM, Thomas Weise <thomas.we...@gmail.com>
wrote:

> See response below:
>
> On Tue, Jun 14, 2016 at 12:41 PM, Ananth Gundabattula <
> agundabatt...@gmail.com> wrote:
>
>> Hello Siyuan/All,
>>
>> I have a couple of questions regarding the Kafka 0.9 operator. Could you
>> please help me in understanding this operator a bit better?
>>
>>
>>- As stated in
>>http://www.slideshare.net/ApacheApex/apache-apex-kafka-input-operator
>>, kafka 0.9 operator stores it "check-pointed offsets" in Kafka itself
>>using the App name ? It sounds like -originalAppID is not used by this
>>operator at all - In other words, I cant force an app to process starting
>>from the beginning until I change the App name if the App is based on the
>>Kafka 0.9 operator as the input operator
>>
>> The start offset configuration option should determine where the operator
> starts consuming on cold start (earliest, latest, last consumed). If that's
> not the case then it would be a bug. Siyuan, please comment.
>
>>
>>-
>>- How does the kafka 0.9 operator handle downstream operators failure
>>? By this I mean, an Apex downstream operator fails, and is brought back 
>> up
>>by STRAM. However this operator was significantly lagging behind the
>>current window of the kafka 0.9 operator window. Does the buffer server
>>within the Kafka 0.9 operator buffer many windows to handle this situation
>>? ( and hence replays accordingly ? ) . I ask this to fine tune the buffer
>>memory property.
>>
>> The upstream buffer server will hold the data until processed by the
> downstream operator. The buffer server, by default, will start to spool to
> disk when the allocated memory is used up. Back pressure will cause the
> consumer to slow down accordingly.
>
>>
>>- Is EXACTLY_ONCE processing supported in this operator ? if yes, is
>>it fair to assume that HDFS would be used to manage this type of
>>configuration ?
>>
>> Yes, when you enable idempotency on the operator, exactly once processing
> semantics in downstream operators are supported (affects those that write
> to external systems). To enable this you can configure to use the window
> data manager that writes to HDFS, essentially it will keep track of the
> consumer offsets for each window.
>
>>
>>-
>>- Is EXACTLY_ONCE based off the streaming window or the Application
>>Window in Apex ?
>>
>> The operator only sees the "application window". Make sure to align the
> checkpoint window interval.
>
> For more information about the Kafka input operator, please see:
> http://www.slideshare.net/ApacheApex/apache-apex-kafka-input-operator
>
>
>
>


Re: A few questions about the Kafka 0.9 operator

2016-06-14 Thread Thomas Weise
See response below:

On Tue, Jun 14, 2016 at 12:41 PM, Ananth Gundabattula <
agundabatt...@gmail.com> wrote:

> Hello Siyuan/All,
>
> I have a couple of questions regarding the Kafka 0.9 operator. Could you
> please help me in understanding this operator a bit better?
>
>
>- As stated in
>http://www.slideshare.net/ApacheApex/apache-apex-kafka-input-operator
>, kafka 0.9 operator stores it "check-pointed offsets" in Kafka itself
>using the App name ? It sounds like -originalAppID is not used by this
>operator at all - In other words, I cant force an app to process starting
>from the beginning until I change the App name if the App is based on the
>Kafka 0.9 operator as the input operator
>
> The start offset configuration option should determine where the operator
starts consuming on cold start (earliest, latest, last consumed). If that's
not the case then it would be a bug. Siyuan, please comment.

>
>-
>- How does the kafka 0.9 operator handle downstream operators failure
>? By this I mean, an Apex downstream operator fails, and is brought back up
>by STRAM. However this operator was significantly lagging behind the
>current window of the kafka 0.9 operator window. Does the buffer server
>within the Kafka 0.9 operator buffer many windows to handle this situation
>? ( and hence replays accordingly ? ) . I ask this to fine tune the buffer
>memory property.
>
> The upstream buffer server will hold the data until processed by the
downstream operator. The buffer server, by default, will start to spool to
disk when the allocated memory is used up. Back pressure will cause the
consumer to slow down accordingly.

>
>- Is EXACTLY_ONCE processing supported in this operator ? if yes, is
>it fair to assume that HDFS would be used to manage this type of
>configuration ?
>
> Yes, when you enable idempotency on the operator, exactly once processing
semantics in downstream operators are supported (affects those that write
to external systems). To enable this you can configure to use the window
data manager that writes to HDFS, essentially it will keep track of the
consumer offsets for each window.

>
>-
>- Is EXACTLY_ONCE based off the streaming window or the Application
>Window in Apex ?
>
> The operator only sees the "application window". Make sure to align the
checkpoint window interval.

For more information about the Kafka input operator, please see:
http://www.slideshare.net/ApacheApex/apache-apex-kafka-input-operator


Re: kafka input is processing records in a jumbled order

2016-06-07 Thread Thomas Weise
Raja,

Are you expecting ordering across multiple Kafka partitions?

All messages from a given Kafka partition are received by the same consumer
and thus will be ordered. However, when messages come from multiple
partitions there is no such guarantee.

Thomas


On Tue, Jun 7, 2016 at 3:34 PM, Raja.Aravapalli 
wrote:

>
> Hi
>
> I have built a DAG, that reads from kafka and in the next operators, does
> lookup to a hbase table and update hbase table based on some business
> logic.
>
> Some times my operator which does hbase lookup and update in the same
> operator(Custom written), is processing the records it receives from kafka
> in a jumbled order, which is causing, many records being ignored from
> processing!!
>
> I am not using any parallel partitions/instance, and with
> KafkaInputOperator I am using only partition strategy ONE_TO_MANY.
>
> I am very new to Apex. I expected, Apex will guarantee the ordering.
>
> Can someone pls share your knowledge on the issue…?
>
>
> Thanks a lot in advance…
>
>
> Regards,
> Raja.
>


Re: kafka offset commit

2016-06-06 Thread Thomas Weise
Hi Raja,

Which Kafka version are you using?

With the new 0.9 connector there is no need for the offset manager:

https://github.com/apache/apex-malhar/tree/master/kafka/src/main/java/org/apache/apex/malhar/kafka

Thanks,
Thomas


On Mon, Jun 6, 2016 at 3:06 PM, Raja.Aravapalli 
wrote:

> Hi
>
> Can someone please help me understand, where will the offsets be stored
> when consuming with “*KafkaSinglePortStringInputOperator*”  ?
>
> And, how to handle restarts ?
>
>
> I worked with Storm earlier, Storm maintains the offsets in zookeeper and
> client id is maintained for every consumer, using which
>
> - we can see what is the current offset status for a given partition &
> modify them as well using zookeeper-cli !!
> - restarts can be handled
>
>
> As per the Apex documentation, I can see, that using OffsetManager we can
> handle the restarts effectively, but couldn’t find any examples to refer…
>
> How clientId can be used to retrieve offsets status
> And ability to edit the offsets etc
>
> can someone pls help me find this ?
>
>
> Thanks a lot!!
>
>
> -Regards,
> Raja.
>
>
>
>


Re: avrò deserialization fails when using kafka

2016-06-06 Thread Thomas Weise
Since you are creating the decoder in setup(), please mark the property
transient. No need to checkpoint it.

Thanks,
Thomas


On Mon, Jun 6, 2016 at 10:06 AM, Munagala Ramanath 
wrote:

>
> http://docs.datatorrent.com/troubleshooting/#application-throwing-following-kryo-exception
>
> Please try the suggestions at the above link.
>
> It appears from
>
> https://github.com/uber/confluent-schema-registry/blob/master/avro-serializer/src/main/java/io/confluent/kafka/serializers/KafkaAvroDecoder.java
> that the class does not have a default constructor.
>
> Ram
>
> On Mon, Jun 6, 2016 at 9:40 AM, Raja.Aravapalli <
> raja.aravapa...@target.com> wrote:
>
>>
>> Hi,
>>
>> I am trying to read data from kafka, and my input in kafka is avro
>> messages.
>>
>> So I am using class “KafkaSinglePortByteArrayInputOperator” to emit
>> records from kafka.. And in the next operator I am reading input as
>> "byte[]” and deserializing the message!!
>>
>> But the tuple deserialization is failing with below error in the log…
>>
>> Can someone pls share your thoughts and help me fix this?
>>
>>
>> Caused by: com.esotericsoftware.kryo.KryoException: Class cannot be created 
>> (missing no-arg constructor): io.confluent.kafka.serializers.KafkaAvroDecoder
>> Serialization trace:
>> decoder (com.tgt.mighty.apexapps.AvroBytesConversionOperator)
>>  at 
>> com.esotericsoftware.kryo.Kryo$DefaultInstantiatorStrategy.newInstantiatorOf(Kryo.java:1228)
>>  at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1049)
>>  at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1058)
>>  at 
>> com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:547)
>>  at 
>> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:523)
>>  at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:679)
>>
>>
>>
>> *Code FYR:*
>>
>>
>> *Application.java* file:
>>
>> public void populateDAG(DAG dag, Configuration conf)
>> {
>>   //KafkaSinglePortStringInputOperator kafkaInput =  
>> dag.addOperator("Kafka_Input", KafkaSinglePortStringInputOperator.class);
>>
>>   KafkaSinglePortByteArrayInputOperator kafkaInput =  
>> dag.addOperator("Kafka_Input", new KafkaSinglePortByteArrayInputOperator());
>>
>>   AvroBytesConversionOperator avroConversion = 
>> dag.addOperator("Avro_Convert", new 
>> AvroBytesConversionOperator(*“*schemaRegURL"));
>>
>>   HDFSWrite hdfs = dag.addOperator("To_HDFS", HDFSWrite.class);
>>
>>   //dag.addStream("Kafka_To_Hdfs_Ingestion", kafkaInput.outputPort, 
>> hdfs.input);
>>   dag.addStream("Kafka_Avro_Msg_Byte_Stream", kafkaInput.outputPort, 
>> avroConversion.input);
>>   dag.addStream("Avro_To_String_Stream", avroConversion.output, hdfs.input);
>>
>> }
>>
>>
>> *Operator Code:*
>>
>> public class AvroBytesConversionOperator extends BaseOperator {
>>
>> private String schemaRegURL;
>> private KafkaAvroDecoder decoder;
>>
>> public AvroBytesConversionOperator(){
>>
>> }
>>
>> public AvroBytesConversionOperator(String schemaRegURL){
>> this.schemaRegURL = schemaRegURL;
>> }
>>
>> /**
>>  * Defines Input Port - DefaultInputPort
>>  * Accepts data from the upstream operator
>>  * Type byte[]
>>  */
>> public transient DefaultInputPort input = new 
>> DefaultInputPort() {
>> @Override
>> public void process(byte[] tuple)
>> {
>> processTuple(tuple);
>> }
>> };
>>
>>
>> /**
>>  * Defines Output Port - DefaultOutputPort
>>  * Sends data to the down stream operator which can consume this data
>>  * Type String
>>  */
>> public transient DefaultOutputPort output = new 
>> DefaultOutputPort();
>>
>>
>> /**
>>  * Setup call
>>  */
>> @Override
>> public void setup(OperatorContext context)
>> {
>> Properties props = new Properties();
>> props.setProperty("schema.registry.url", this.schemaRegURL);
>> this.decoder = new KafkaAvroDecoder(new VerifiableProperties(props));
>> }
>>
>> /**
>>  * Begin window call for the operator.
>>  * @param windowId
>>  */
>> public void beginWindow(long windowId)
>> {
>>
>> }
>>
>> /**
>>  * Defines what should be done with each incoming tuple
>>  */
>> protected void processTuple(byte[] tuple)
>> {
>> GenericRecord record = (GenericRecord) decoder.fromBytes(tuple);
>> output.emit(record.toString());
>> }
>>
>> /**
>>  * End window call for the operator
>>  * If sending per window, emit the updated counts here.
>>  */
>> @Override
>> public void endWindow()
>> {
>>
>> }
>>
>> }
>>
>>
>