[ANNOUNCE] Apache Apex Core 3.4.0 released

2016-05-12 Thread Thomas Weise
The Apache Apex community is pleased to announce release 3.4.0 (Apex Core).

This is the first release after graduation. It adds the support for
(anti-)affinity of operators,
several documentation and other improvements and important bug fixes. The
release has 63 resolved JIRAs.

There is a backward incompatible change necessary to address a security
issue
of a 3rd party library. Applications that use WebSockeInputOperator and
WebSocketOutputOperator or any derivative of these two classes on 3.3.x or
below will need to recompile against the upcoming Apex Malhar 3.4.0 release
to run
on Apex Core 3.4.x.

Changes:
https://github.com/apache/incubator-apex-core/blob/v3.4.0/CHANGELOG.md

Apache Apex is an enterprise grade native YARN big data-in-motion platform
that unifies stream and batch processing. Apex was built for
scalability and low-latency processing, high availability and operability.

Apache Apex is Java based and strives to ease application development on a
platform that takes care of aspects such as stateful fault tolerance,
partitioning, processing guarantees, buffering and synchronization,
auto-scaling etc. Apex comes with Malhar, a rich library of pre-built
operators, including adapters that integrate with existing technologies as
sources and destinations, like message buses, databases, files or social
media feeds.

The source release can be found at:

http://www.apache.org/dyn/closer.lua/incubator/apex/v3.4.0/apex-3.4.0-source-release.tar.gz

or visit:

http://apex.apache.org/downloads.html

We welcome your help and feedback. For more information on the project and
how to get involved, visit our website at:

http://apex.apache.org/

Regards,
The Apache Apex community


Re: YARN memory settings and the Apex memory model

2016-05-12 Thread Thomas Weise
Ananth,

Please have a look at:

http://docs.datatorrent.com/troubleshooting/#configuring-memory

Thanks,
Thomas


On Thu, May 12, 2016 at 4:00 AM, Ananth Gundabattula <
agundabatt...@gmail.com> wrote:

> Thanks Shubham. I shall bump up the memory a bit more.
>
> I was wondering how the operator memory relates to the YARN container
> memory settings ? Or it depends on the deployment models ?
>
> For example , if the deployment model is thread local, the YARN container
> needs to be ( considering above example ) configured for atleast memory of
> 2048 * number of operators + Buffer Server Size ?
>
> If the deployment model were not Thread local, it would make YARN
> container requirements for memory lower per container ?
>
> Regards,
> Ananth
>
> On Thu, May 12, 2016 at 7:19 PM, Shubham Pathak 
> wrote:
>
>> Hello Ananth,
>>
>> Looks like operator requires more memory.
>> You may add this property to have more memory allocated to the container.
>>
>> In properties.xml , for operator O in the application you may specify the
>> property :
>>
>> 
>>  dt.operator.*O*.attr.MEMORY_MB
>> 2048
>>  
>>
>> Thanks,
>> Shubham
>>
>> On Thu, May 12, 2016 at 1:35 PM, Ananth Gundabattula <
>> agundabatt...@gmail.com> wrote:
>>
>>> Hello All,
>>>
>>> I am seeing the following log from the web ui ocassionally when my
>>> operators are getting killed. Is there any way  I can control the memory
>>> settings that are used to communicate with YARN when negotiating a
>>> container ?
>>>
>>> How does the typical yarn settings for a container heap and max memory
>>> relate to the Apex memory allocation model.
>>>
>>> The info messages I see in the web UI are as follows:
>>>
>>> Container [pid=14699,containerID=container_1462863487071_0015_01_12] is 
>>> running beyond physical memory limits. Current usage: 1.5 GB of 1.5 GB 
>>> physical memory used; 6.1 GB of 3.1 GB virtual memory used. Killing 
>>> container.
>>> Dump of the process-tree for container_1462863487071_0015_01_12 :
>>> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
>>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>>> |- 14817 14699 14699 14699 (java) 1584 1654 6426968064 393896 
>>> /usr/java/default/bin/java -Xmx4429185024 
>>> -Ddt.attr.APPLICATION_PATH=hdfs://dwh109.qaperf2.sac.int.threatmetrix.com:8020/user/dtadmin/datatorrent/apps/application_1462863487071_0015
>>>  
>>> -Djava.io.tmpdir=/data3/yarn/nm/usercache/root/appcache/application_1462863487071_0015/container_1462863487071_0015_01_12/tmp
>>>  -Ddt.cid=container_1462863487071_0015_01_12 
>>> -Dhadoop.root.logger=INFO,RFA 
>>> -Dhadoop.log.dir=/data3/yarn/container-logs/application_1462863487071_0015/container_1462863487071_0015_01_12
>>>  -Ddt.loggers.level=com.datatorrent.*:INFO,org.apache.*:INFO 
>>> com.datatorrent.stram.engine.StreamingContainer
>>> |- 14699 14697 14699 14699 (bash) 1 2 108646400 303 /bin/bash -c 
>>> /usr/java/default/bin/java  -Xmx4429185024  
>>> -Ddt.attr.APPLICATION_PATH=hdfs://dwh109.qaperf2.sac.int.threatmetrix.com:8020/user/dtadmin/datatorrent/apps/application_1462863487071_0015
>>>  
>>> -Djava.io.tmpdir=/data3/yarn/nm/usercache/root/appcache/application_1462863487071_0015/container_1462863487071_0015_01_12/tmp
>>>  -Ddt.cid=container_1462863487071_0015_01_12 
>>> -Dhadoop.root.logger=INFO,RFA 
>>> -Dhadoop.log.dir=/data3/yarn/container-logs/application_1462863487071_0015/container_1462863487071_0015_01_12
>>>  -Ddt.loggers.level=com.datatorrent.*:INFO,org.apache.*:INFO 
>>> com.datatorrent.stram.engine.StreamingContainer 
>>> 1>/data3/yarn/container-logs/application_1462863487071_0015/container_1462863487071_0015_01_12/stdout
>>>  
>>> 2>/data3/yarn/container-logs/application_1462863487071_0015/container_1462863487071_0015_01_12/stderr
>>>
>>> Container killed on request. Exit code is 143
>>> Container exited with a non-zero exit code 143
>>>
>>>
>>> Regards,
>>>
>>> Ananth
>>>
>>>
>>>
>>
>


Re: Set schemaRequired differently on different operator implementations

2016-05-05 Thread Thomas Weise
Yes, for an output port the operator is making the call, so you would need
to override the method that emits.

On Thu, May 5, 2016 at 2:33 PM, Bhupesh Chawda <bhup...@datatorrent.com>
wrote:

> Seems like the parent instance actually uses the hidden port for emitting
> tuples.
> So, as a solution, planning to create override methods and override them
> in the child operators.
>
> ~Bhupesh
>
> On Thu, May 5, 2016 at 2:09 PM, Bhupesh Chawda <bhup...@datatorrent.com>
> wrote:
>
>> It does not seem to be working when I override the output port in the
>> child operator.
>> The tuples are processed, but are never emitted out.
>>
>> I'll create a bug ticket for this.
>>
>> ~Bhupesh
>>
>> On Mon, May 2, 2016 at 5:15 PM, Bhupesh Chawda <bhup...@datatorrent.com>
>> wrote:
>>
>>> Okay. I tried this using dummy classes. But if the behaviour is
>>> different in case of ports in Apex, then it might solve the problem.
>>> I'll try it out.
>>>
>>> Thanks.
>>> ~Bhupesh
>>>
>>> On Mon, May 2, 2016 at 5:09 PM, Thomas Weise <thomas.we...@gmail.com>
>>> wrote:
>>>
>>>> The port is called by the engine. The engine will only call the port in
>>>> the derived class. If not, then it is a bug.
>>>>
>>>> On Mon, May 2, 2016 at 5:02 PM, Bhupesh Chawda <bhup...@datatorrent.com
>>>> > wrote:
>>>>
>>>>> Thanks Thomas.
>>>>> I tried that. But in cases where the port is used even in the base
>>>>> class, different instances are used by both parent and child.
>>>>>
>>>>> ~Bhupesh
>>>>>
>>>>> On Mon, May 2, 2016 at 5:00 PM, Thomas Weise <thomas.we...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Yes, you can override the port with a new port field with the same
>>>>>> name.
>>>>>>
>>>>>>
>>>>>> On Mon, May 2, 2016 at 4:22 PM, Bhupesh Chawda <
>>>>>> bhup...@datatorrent.com> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I have a base operator which is parametrized:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *class OperatorBase extends BaseOperator{*
>>>>>>>
>>>>>>> *public final transient DefaultOutputPort output = new
>>>>>>> DefaultOutputPort();}*
>>>>>>>
>>>>>>> I also have two implementations of this operator as follows:
>>>>>>>
>>>>>>>- Instantiating with  type -
>>>>>>>- *class Operator1 extends OperatorBase*
>>>>>>>- POJO Implementation using  -
>>>>>>>- *class Operator1 extends OperatorBase*
>>>>>>>
>>>>>>> The problem is the following:
>>>>>>>
>>>>>>> I need to enable the annotation "*schemaRequired*" on the output
>>>>>>> port "*output*" only if a POJO implementation is used and not
>>>>>>> otherwise. Please suggest some way to achieve this.
>>>>>>>
>>>>>>> Can this be done without moving the port to individual child classes?
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> ~Bhupesh
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Set schemaRequired differently on different operator implementations

2016-05-02 Thread Thomas Weise
Yes, you can override the port with a new port field with the same name.


On Mon, May 2, 2016 at 4:22 PM, Bhupesh Chawda 
wrote:

> Hi All,
>
> I have a base operator which is parametrized:
>
>
>
> *class OperatorBase extends BaseOperator{*
>
> *public final transient DefaultOutputPort output = new
> DefaultOutputPort();}*
>
> I also have two implementations of this operator as follows:
>
>- Instantiating with  type -
>- *class Operator1 extends OperatorBase*
>- POJO Implementation using  -
>- *class Operator1 extends OperatorBase*
>
> The problem is the following:
>
> I need to enable the annotation "*schemaRequired*" on the output port "
> *output*" only if a POJO implementation is used and not otherwise. Please
> suggest some way to achieve this.
>
> Can this be done without moving the port to individual child classes?
>
> Thanks.
>
> ~Bhupesh
>


Apache Apex announced as Top-Level Project

2016-04-25 Thread Thomas Weise
Dear Community,

Apex graduation was approved by the ASF board last week and the
announcement went out this morning:

https://blogs.apache.org/foundation/entry/the_apache
_software_foundation_announces90

https://twitter.com/TheASF/status/724538689993474048

Congratulations everyone and we are looking forward to take it forward from
here. Thanks to the ASF and especially our mentors for the support.
Everyone please help promote the news!

In recent weeks we had many meetup events and there is a ton of information
about Apex available to get started, have a look and share it with your
friends and colleagues:

http://apex.apache.org/docs.html

http://www.slideshare.net/ApacheApex

Do help and let everyone know about the awesome capabilities of Apex.

We have a lot of plans ahead and everyone is invited to help:


http://apex.apache.org/roadmap.html

Next up will be the 3.4.0 release which will add support for anti-affinity
of operators, to complement the already existing stream locality. It will
also add the foundation for large state management in operators, which will
make it into the join operators etc. in subsequent releases.

Also in the works are several new operators to simplify development of
ingest and transform pipelines (enrichment, more file formats etc.)

Almost complete is the first iteration of high level Java (stream) API.

We are also working on other higher level abstractions and integrations,
including SAMOA, Storm compatibility, optimizations for batch, broader
support for event time windowing etc.

Follow @ApacheApex  https://twitter.com/ApacheApex
Check out upcoming meetups:  http://www.meetup.com/topics/apache-apex

Stay tuned for more.

Thomas on behalf of the Apache Apex PMC.


Re: Error finding KafkaSinglePortStringInputOperator (NoClassDefFoundError)

2016-02-29 Thread Thomas Weise
Suhas,

Do not copy any operator libraries into the lib folder. These dependencies
need to be packaged with the application into the .apa application package.
The .apa will also need to contain the Kafka dependencies.

Can you please share what .jar files are in your app package (unzip -l
yourapp.apa)

Thomas


On Mon, Feb 29, 2016 at 9:10 PM, Suhas Gogate  wrote:

> I have single node data torrent/Apex installation. Software installed
> under "/opt/datatorrent/releases/3.2.0”. When I uploaded and launched my
> application I get error finding the class
>  “KafkaSinglePortStringInputOperator”.  I could compile the application
> after adding the dt-contrib dependency but not sure how dt-contrib jars be
> available for application when running on single node data torrent
> installation.
>
> Appreciate help!
>
> —Suhas
> PS: I explicitly copied the  dt-contrib-3.1.1.jar to
> /opt/datatorrent/current/lib/dt-contrib-3.1.1.jar and restarted
> the gateway..
>
>
> Error launching the application:
>
> An error occurred trying to launch the application. Server message:
> java.lang.NoClassDefFoundError:
> com/datatorrent/contrib/kafka/KafkaSinglePortStringInputOperator at
> io.ampool.demo.adtech.Application.populateDAG(Application.java:44) at
> com.datatorrent.stram.plan.logical.LogicalPlanConfiguration.prepareDAG(LogicalPlanConfiguration.java:2108)
> at
> com.datatorrent.stram.client.StramAppLauncher$1.createApp(StramAppLauncher.java:407)
> at
> com.datatorrent.stram.client.StramAppLauncher.launchApp(StramAppLauncher.java:482)
> at com.datatorrent.stram.cli.DTCli$LaunchCommand.execute(DTCli.java:2047)
> at com.datatorrent.stram.cli.DTCli.launchAppPackage(DTCli.java:3450) at
> com.datatorrent.stram.cli.DTCli.access$7000(DTCli.java:106) at
> com.datatorrent.stram.cli.DTCli$LaunchCommand.execute(DTCli.java:1892) at
> com.datatorrent.stram.cli.DTCli$3.run(DTCli.java:1449) Caused by:
> java.lang.ClassNotFoundException:
> com.datatorrent.contrib.kafka.KafkaSinglePortStringInputOperator at
> java.net.URLClassLoader.findClass(URLClassLoader.java:381) at
> java.lang.ClassLoader.loadClass(ClassLoader.java:424) at
> java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:810) at
> java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 9 more Fatal
> error encountered
>
>
>
>
> Application.java
>
>
> import com.datatorrent.api.StreamingApplication;
>
> import com.datatorrent.api.DAG;
>
> import com.datatorrent.api.DAG.Locality;
>
> import com.datatorrent.lib.io.ConsoleOutputOperator;
>
> import com.datatorrent.contrib.kafka.KafkaSinglePortStringInputOperator;
>
>
> @ApplicationAnnotation(name = "AdStream")
>
> public class Application implements StreamingApplication
>
> {
>
>   @Override
>
>   public void populateDAG(DAG dag, Configuration entries)
>
>   {
>
> KafkaSinglePortStringInputOperator input =  dag.addOperator(
> "MessageReader", new KafkaSinglePortStringInputOperator());
>
>
> ConsoleOutputOperator output = dag.addOperator("Output", new
> ConsoleOutputOperator());
>
>
> dag.addStream("MessageData", input.outputPort, output.input);
>
>   }
>
> }
>
>


Re: Kinesis Operator Help

2016-02-16 Thread Thomas Weise
Ram,

The recovery path, when under the application directory, will be
automatically copied to the new app directory when relaunch option is used.
This is how the previous instance data is available to the new app.

Thomas

On Tue, Feb 16, 2016 at 5:23 PM, Munagala Ramanath 
wrote:

> Ah, I understand now.
>
> The path is set in
> IdempotentStorageManager.FSIdempotentStorageManager,setup() near line 146:
> appPath = new Path(context.getValue(DAG.APPLICATION_PATH) + Path.SEPARATOR
> + recoveryPath);
>
> You can try creating a new class that extends FSIdempotentStorageManager
> and override setup() to use a local property
> for the appPath and simply duplicate the rest of the code.
>
> Ram
>
> On Tue, Feb 16, 2016 at 3:59 PM, Jim  wrote:
>
>> Ram,
>>
>>
>>
>> I am not 100% fluent in the details of the base kinesis operator and how
>> it interacts with Hadoop (hence my posting); if it would support that, then
>> yes, you could.
>>
>>
>>
>> My goal is to make it so one can easily pick up where they left off
>> reading the Kinesis stream, regardless of if you kill the application and
>> re-launch it, etc., without needing to go out to the cli to do some
>> commands (because at some point some operator will forget and then we will
>> reprocess a bunch of transactions; that would not be good!
>>
>>
>>
>> Jim
>>
>>
>>
>> *From:* Munagala Ramanath [mailto:r...@datatorrent.com]
>> *Sent:* Tuesday, February 16, 2016 5:21 PM
>> *To:* users@apex.incubator.apache.org
>> *Subject:* Re: Kinesis Operator Help
>>
>>
>>
>> Why use the application id ? Could you generate and use a java.util.UUID
>> for example and save it in HDFS ?
>>
>>
>>
>> Ram
>>
>>
>>
>> On Tue, Feb 16, 2016 at 11:40 AM, Jim  wrote:
>>
>> Good morning,
>>
>>
>>
>> I am new to Apex, Hadoop and Yarn (nothing like tackling something new,
>> is there?).
>>
>>
>>
>> I have my first Apex apps working that are edi processors that read new
>> edi transactions from an Amazon Kinesis stream, look at the data, and
>> routes the edi data to an appropriate handler for processing (note the
>> operatorEs pushes the data to ElasticSearch for logging).  Here is a
>> diagram:
>>
>>
>>
>>
>>
>> Everything launches, and is working fine with the above diagram from the
>> edi router through the transaction operators.
>>
>>
>>
>> The final challenge I am having, being new to all of this, is that the
>> Kinesis operator, by default, stores it’s app id in into
>> IdempotentStorageManager (aka WindowDataManager) when it is launched, so if
>> the app it shutdown and restarted this same app id is used by default with
>> the checkpoint so you don’t reprocess the same records again when the
>> application is restarted.
>>
>>
>>
>> You can see this id immediately to the right of the Operations / apps in
>> gray lettering ‘application_1453741656046_0520’ in the image from the
>> datatorrent console below:
>>
>>
>>
>> [image: cid:image004.png@01D168BA.5FE56550]
>>
>>
>>
>> However, if you kill the application, and re-launch, this id changes, and
>> it starts reading from the Kinesis stream back from the beginning; and the
>> only way to restart it so it starts where it left off is using the cli as
>> follows:
>>
>>
>>
>> 1.)Run ‘dtcli’ from the command line.
>>
>> 2.)Run ‘launch -originalAppId “application_1453741656046_0520” > to .apa file>’
>>
>>
>>
>> This will launch the application using the same app id identified in the
>> console screen above.
>>
>>
>>
>> I want to make this easier, but need some experts help in tweaking this
>> so it works.
>>
>>
>>
>> I am thinking that there should be a way with Kinesis to:
>>
>>
>>
>> 1.)Define in the properties, a Kinesis app id string value.
>>
>> 2.)If this value is defined, it will use that, when launching the
>> application, to check if an Hadoop app id has already been assigned to that
>> identifier.
>>
>> 3.)If that value is not yet stored in the database, it will launch
>> the app, creating a new app id, and store the app id under the identifier
>> key value.
>>
>> 4.)Now if I kill the app, or install new software, it will always
>> pick up where it left off by using the identifier key value to retrieve and
>> assign the app id.
>>
>>
>>
>> Sounds simple, right?  J
>>
>>
>>
>> Can one of the experts out there help me figure this out as I don’t want
>> to reprocess already processed edi transactions?
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Jim
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Jim
>>
>>
>> jim@facility.supplies (414) 760-7711
>> --
>>
>> *The information contained in this communication, including any files or
>> attachments transmitted with it, may contain copyrighted information or
>> information that is confidential and exempt from disclosure under
>> applicable laws and regulations, is intended only for the use of the
>> recipient(s) named above, and may be legally privileged. If the reader of
>> this message is not the 

Re: What is the backpressure story ?

2016-01-31 Thread Thomas Weise
That's incorrect. Backpressure works when spooling is enabled (which is
default). It's not handled only when you turn spooling off explicitly.

On Sun, Jan 31, 2016 at 3:50 PM, Sandesh Hegde 
wrote:

> According to Vlad, disabling the spooling will crash the buffer server
> after it runs out of memory.
>
> It means Apex doesn't have a mechanism to handle backpressure yet.
>
> On Fri, Jan 29, 2016 at 9:34 AM Pramod Immaneni 
> wrote:
>
>> By default buffer spooling is enabled so data gets spooled to file system
>> once the buffer limits are reached, there will be some slow down but
>> upstream will continue to process, if buffer spooling is disabled then when
>> the buffers are filled the sender is blocked and this back pressure will
>> propagate upstream to the first operator.
>>
>> On Fri, Jan 29, 2016 at 7:47 AM, Sandesh Hegde 
>> wrote:
>>
>>> Hello Team,
>>>
>>> My understanding of the backpressure in Apex is, Buffer server will slow
>>> down ( because of TCP/IP congestion control ) the upstream operator if the
>>> downstream is slow. Is there more to it?
>>> I don't see this topic covered in docs.
>>>
>>> Thanks
>>> Sandesh
>>>
>>>
>>>
>>>
>>


Re: read from multiple kafka topics

2016-01-27 Thread Thomas Weise
The new operator for Kafka 0.9 can read multiple topics.


On Wed, Jan 27, 2016 at 5:02 PM, Munagala Ramanath 
wrote:

> Since a single AbstractKafkaInputOperator has a single KafkaConsumer and
> the latter has only a single topic,
> it appears that the only option is to have one input operator per topic.
>
> There also seems to be some machinery to get a reasonable mapping of input
> operator partitions to
> topic partitions, so this also points to needing an input operator per
> topic.
>
> Ram
>
> On Wed, Jan 27, 2016 at 4:24 PM, Ashwin Chandra Putta <
> ashwinchand...@gmail.com> wrote:
>
>> Hi,
>>
>> Does the kafka input operator support reading from multiple kafka topics?
>> What is the suggested approach to deal with reading from multiple topics?
>>
>> --
>>
>> Regards,
>> Ashwin.
>>
>
>


Re: Stateless operator and Properties

2016-01-22 Thread Thomas Weise
Properties are part of the operator state. So if the intention is to change
properties after the application was launched the operator cannot be marked
stateless.


On Fri, Jan 22, 2016 at 9:11 AM, Sandesh Hegde 
wrote:

> Hello Team,
>
> Is it possible mark the operators stateless but the properties needs to be
> applied from the config after recovery?
>
> Thanks
> Sandesh
>


Re: Making inherited operator port optional

2016-01-21 Thread Thomas Weise
You should be able to override the port.

On Thu, Jan 21, 2016 at 7:18 AM, Munagala Ramanath 
wrote:

> In OutputPortFieldAnnotation.java we have this:
>
>   public boolean optional() default true;
>
> So looks like output ports are optional by default. If A annotates the
> port as "optional = true", you're out of luck.
>
> Ram
>
> On Thu, Jan 21, 2016 at 6:57 AM, Yogi Devendra 
> wrote:
>
>> I have Operator B extends Operator A.
>>
>> Operator A is from malhar library.
>> Operator A declares output port P.
>>
>> For Operator B; I wish to make port P (inherited from A) as optional
>> (without changing original operator A source code from malhar).
>>
>> Is there any way to achieve this?
>>
>> ~ Yogi
>>
>
>