Why DataStreamer.flush() is not flushing?

2017-08-28 Thread Pranas Baliuka
I'm trying to add 100M time series measurements  in chunks of BLOCK = 4_500
per value using structure:

Key:
public class Key {
  private int securityId;
  private long date;

Value:
public class OHLC {
  private long date;
  private int securityId;
  private int size;
  private long[] time;
  private double[] open;
  private double[] high;
  private double[] low;
  private double[] close;
  private double[] marketVWAP;

I need some kind of checkpoints to flush the queues to the cache  ideally
30second.

I've made attempts by configuring streamer:
streamer.allowOverwrite(true);
  streamer.perNodeBufferSize(20);
  streamer.autoFlushFrequency(TimeUnit.SECONDS.toMillis(30));
  streamer.skipStore(false);
  streamer.keepBinary(true);

and even explicitly  flushing :
if (blockId % 20 == 0) 
  streamer.flush();

After the flush() invoked (suppose to be blocking operation). I'm checking
the count of the cache:
  final IgniteCache cache =
ignite.getOrCreateCache(CACHE_NAME);
  System.out.println(" >>> Simulator - Inserted " + cache.size() *
BLOCK_SIZE + " " + new Date());
  Thread.sleep(TimeUnit.SECONDS.toMillis(40));
  System.out.println(" >>> Simulator - Inserted " + cache.size() *
BLOCK_SIZE + " " + new Date());
  Thread.sleep(TimeUnit.SECONDS.toMillis(40));
  System.out.println(" >>> Simulator - Inserted " + cache.size() *
BLOCK_SIZE + " " + new Date());

But getting .size() == 1

According documentation for 

flush(): "Streams any remaining data, ... this method blocks and doesn't
allow to add any data until all data is streamed."

size(): "Gets the number of all entries cached across all nodes. By default,
if {@code peekModes} value isn't defined, only size of primary copies across
all nodes will be returned."

It does not work from what I understand on 2.1.0. Is there some know work
around how to flush the data from streamer to the cache?

Thanks a lot
Pranas




--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Why-DataStreamer-flush-is-not-flushing-tp16466.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: DataStreamer operation failed

2017-08-28 Thread Pranas Baliuka
Thanks Konstantin in looking at it,

The issue was what server node and client node serialization was not
compatible. Moved to 
and using single server
type node for further testing. 



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/DataStreamer-operation-failed-tp16439p16465.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


DataStreamer operation failed

2017-08-28 Thread Pranas Baliuka
Im am prototyping Apache Ignite to store some simple data grid with simple
objects:

Key:
  private final int securityId;
  private final long time;

OHLC: 
  private final int securityId;
  private final long time;
  private final double open;
  private final double high;
  private final double low;
  private final double close;
  private final double marketVWAP;

After inserting 20 to 30 such key-value entries getting:


[16:33:32]__   
[16:33:32]   /  _/ ___/ |/ /  _/_  __/ __/ 
[16:33:32]  _/ // (7 7// /  / / / _/   
[16:33:32] /___/\___/_/|_/___/ /_/ /___/  
[16:33:32] 
[16:33:32] ver. 2.1.0#20170721-sha1:a6ca5c8a
[16:33:32] 2017 Copyright(C) Apache Software Foundation
[16:33:32] 
[16:33:32] Ignite documentation: http://ignite.apache.org
[16:33:32] 
[16:33:32] Quiet mode.
[16:33:32]   ^-- Logging to file
'/Users/pranas/Apps/apache-ignite-fabric-2.1.0-bin/work/log/ignite-2ac0c6b2.log'
[16:33:32]   ^-- To see **FULL** console log here add -DIGNITE_QUIET=false
or "-v" to ignite.{sh|bat}
[16:33:32] 
[16:33:32] OS: Mac OS X 10.12.6 x86_64
[16:33:32] VM information: Java(TM) SE Runtime Environment 1.8.0_40-b27
Oracle Corporation Java HotSpot(TM) 64-Bit Server VM 25.40-b25
[16:33:32] Configured plugins:
[16:33:32]   ^-- None
[16:33:32] 
[16:33:32] Message queue limit is set to 0 which may lead to potential OOMEs
when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to
message queues growth on sender and receiver sides.
[16:33:32] Security status [authentication=off, tls/ssl=off]
[16:33:32] REST protocols do not start on client node. To start the
protocols on client node set '-DIGNITE_REST_START_ON_CLIENT=true' system
property.
[16:33:33] Refer to this page for more performance suggestions:
https://apacheignite.readme.io/docs/jvm-and-system-tuning
[16:33:33] 
[16:33:33] To start Console Management & Monitoring run
ignitevisorcmd.{sh|bat}
[16:33:33] 
[16:33:33] Ignite node started OK (id=2ac0c6b2)
[16:33:33] Topology snapshot [ver=6, servers=1, clients=1, CPUs=8,
heap=11.0GB]
 >>> Apache Ignite node is up and running.
 >>> Simulator - Real code would process journal events ... TODO.
Processed 0 events so far
Processed 10 events so far
Processed 20 events so far
[2017-08-28 16:33:46,592][ERROR][data-streamer-#36%null%][DataStreamerImpl]
DataStreamer operation failed.
class org.apache.ignite.IgniteCheckedException: Failed to finish operation
(too many remaps): 32
at
org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$5.apply(DataStreamerImpl.java:869)
at
org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$5.apply(DataStreamerImpl.java:834)
at
org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:382)
at
org.apache.ignite.internal.util.future.GridFutureAdapter.unblock(GridFutureAdapter.java:346)
at
org.apache.ignite.internal.util.future.GridFutureAdapter.unblockAll(GridFutureAdapter.java:334)
at
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:494)
at
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:473)
at
org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$Buffer.onResponse(DataStreamerImpl.java:1803)
at
org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$3.onMessage(DataStreamerImpl.java:333)
at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
at
org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1184)
at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:126)
at
org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1097)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: class org.apache.ignite.IgniteCheckedException: DataStreamer
request failed [node=74d746ea-6b57-49be-85eb-3064262ab039]
at
org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$Buffer.onResponse(DataStreamerImpl.java:1792)
... 8 more
Caused by: class org.apache.ignite.IgniteCheckedException: Failed to marshal
response error, see node log for details.
at
org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.start(DataStreamProcessor.java:102)
at
org.apache.ignite.internal.IgniteKernal.startProcessor(IgniteKernal.java:1788)
at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:938)
at
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:1896)
at

Re: Market data binary messages processed with Ignite and Spark

2016-10-15 Thread Pranas Baliuka
Thanks for your input.

> Depends how you do the lookup. Is it by ID? Then keep the ids as small as 
> possible. Lookup is fastest in a hash map type of datastructure. In case of a 
> distributed setting supported by a bloom filter. 
> Apache Ignite can be seen as suitable. 

My take from it: use int instead of long for a key.

> I do not think the caching of the filesystem benefits here, because the key 
> is the datastructure here (hash map). 
My concern what cache can be polluted easy … at the same time with modern OS 
prefetching techniques for sequential access it may be suitable. 
Reading from SSD uncompressed data from single consumer (I’d have multiple) is 
capable to feed processing pipeline.

> Maybe you can tell a little bit more about the data. Are the messages 
> dependent? What type of calculation do you do? 
Data would be binary payload e.g. 20 levels of bid/ask of the order book for 
each change on the primary market for specific security. I’d have > 3K 
securities i.e 3K cashes to read from.
The purpose is to back-test trading algorithm (with feedback of algorithm to 
the market liquidity) and find stable regions of parameters delivering good 
performance of the fitness function (diff from selected benchmark).
The algorithm would be iterative to select focus areas of parameters grid for 
ability to zoom-in and see possible risky choices in the “good” areas of 
parameters space.

Each point in the grid (e.g. 2 parameters optimised) is result of fitness 
function from simulation of single security for 1 day market data. 
1st iteration calculate evenly distributed points in parameters grid. For each 
point perform simulation (trading for 1 day)
nth iteration select best performing points and estimate density with 
parametric statistics. Use the density function results to calculate new grid.
Stop once density estimates stable and/or computational resources exhausted.

The resulting grid wold be visualised and provided for analyst to reevaluate 
production settings. e.g. see what parameters are stable for particular market 
conditions.

> On 15 Oct. 2016, at 10:32 pm, Jörn Franke [via Apache Ignite Users] 
> <ml-node+s70518n8315...@n6.nabble.com> wrote:
> 
> Depends how you do the lookup. Is it by ID? Then keep the ids as small as 
> possible. Lookup is fastest in a hash map type of datastructure. In case of a 
> distributed setting supported by a bloom filter. 
> Apache Ignite can be seen as suitable. 
> 
> Depending on what you need to do (maybe your approach requires hyperlolog 
> structured etc) you may look also at redis, but from what you describe Ignite 
> is suitable. 
> 
> I do not think the caching of the filesystem benefits here, because the key 
> is the datastructure here (hash map). 
> The concrete physical infrastructure to meet your SLAs can only be determined 
> when you experiment with real data. 
> 
> Maybe you can tell a little bit more about the data. Are the messages 
> dependent? What type of calculation do you do? 
> 
> > On 15 Oct 2016, at 07:23, Pranas Baliuka <[hidden email] 
> > > wrote: 
> > 
> > Dear Ignite enthusiasts, 
> > 
> > I am beginner in Apache Ingnite, but want to prototype solution for using 
> > Ignite cashes with market data distributed across multiple nodes running 
> > Spark RDD. 
> > 
> > I'd like to be able to send sequenced (from 1) binary messages (size from 
> > 40 
> > bytes to max 1 Kb) to custom Spark job processing multidimensional cube of 
> > parameters. 
> > Each market data event must be processed once from #1 to #records for each 
> > parameter. 
> > Number of messages ~40-50 M in one batch. 
> > 
> > It would be great if you can share your experience with similar imp. 
> > 
> > My high level thinking: 
> > * Prepare system by loading Ignite Cashe (unzipping market data drop-copy 
> > file, converting to preferred binary format and publish IgniteCache<Long, 
> > BinaryObject>; 
> > * Spawn Spark job to process input cube of parameters (SparkRDD) each using 
> > cashed the same IgniteCashe (accessed sequentially by sequence number from 
> > 1 
> > - #messages as key); 
> > * Store results in RDMS/NoSQL storage; 
> > * Perform reports from Apache Zeppelin using Spark.R interpreter. 
> > 
> > I need for Cache outlive Spark jobs i.e. may run different cube of 
> > parameters after one is finished. 
> > 
> > I am not sure if Ignite would be able to lookup messages efficiently (I'd 
> > need ~400 Km/s sustained retrieval). 
> > Or should I consider something more file oriented e.g. use memory mounted 
> > file system on each node ... 
> > 
> > Thank in advance to share your ideas/proposals/know-how! 
&g

Market data binary messages processed with Ignite and Spark

2016-10-14 Thread Pranas Baliuka
Dear Ignite enthusiasts,

I am beginner in Apache Ingnite, but want to prototype solution for using
Ignite cashes with market data distributed across multiple nodes running
Spark RDD.

I'd like to be able to send sequenced (from 1) binary messages (size from 40
bytes to max 1 Kb) to custom Spark job processing multidimensional cube of
parameters. 
Each market data event must be processed once from #1 to #records for each
parameter. 
Number of messages ~40-50 M in one batch.

It would be great if you can share your experience with similar imp. 

My high level thinking:
* Prepare system by loading Ignite Cashe (unzipping market data drop-copy
file, converting to preferred binary format and publish IgniteCache;
* Spawn Spark job to process input cube of parameters (SparkRDD) each using
cashed the same IgniteCashe (accessed sequentially by sequence number from 1
- #messages as key);
* Store results in RDMS/NoSQL storage;
* Perform reports from Apache Zeppelin using Spark.R interpreter.

I need for Cache outlive Spark jobs i.e. may run different cube of
parameters after one is finished.

I am not sure if Ignite would be able to lookup messages efficiently (I'd
need ~400 Km/s sustained retrieval). 
Or should I consider something more file oriented e.g. use memory mounted
file system on each node ...

Thank in advance to share your ideas/proposals/know-how!




--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Market-data-binary-messages-processed-with-Ignite-and-Spark-tp8313.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.