My key is phone_number and they are all unique... I'll check with the command...
On Wed., Mar. 1, 2023, 11:20 a.m. Stephen Darlington, < stephen.darling...@gridgain.com> wrote: > The streamer doesn’t determine where the data goes. It just efficiently > sends it to the correct place. > > If your data is skewed in some way so that there is more data in some > partitions than others, then you could find one machine with more work to > do than others. All else being equal, you’ll also get better distribution > with more than three nodes. > > On 1 Mar 2023, at 15:45, John Smith <java.dev....@gmail.com> wrote: > > Ok thanks. I just thought the streamer would be more uniform. > > On Wed, Mar 1, 2023 at 4:41 AM Stephen Darlington < > stephen.darling...@gridgain.com> wrote: > >> You might want to check the data distribution. You can use control.sh >> —cache distribution to do that. >> >> On 28 Feb 2023, at 20:32, John Smith <java.dev....@gmail.com> wrote: >> >> The last thing I can add to clarify is, the 3 node cluster is a >> centralized cluster and the CSV loader is a thick client running on its own >> machine. >> >> On Tue, Feb 28, 2023 at 2:52 PM John Smith <java.dev....@gmail.com> >> wrote: >> >>> Btw when I run a query like SELECT COLUMN_2, COUNT(COLUMN_1) FROM >>> MY_TABLE GROUP BY COLUMN_2; The query runs full tilt 100% on all 3 nodes >>> and returns in a respectable manager. >>> >>> So not sure whats going on but with the data streamer I guess most of >>> the writes are pushed to THE ONE node mostly and the others are busy making >>> the backups or the network to push/back up can't keep up? >>> The same behaviour happens with replicated table when using the data, >>> one node seems to be running almost 100% while the others hover at 40-50% >>> The fastest I could get the streamer to work is to turn off backups, but >>> same thing, one node runs full tilt while the others are "slowish" >>> >>> Queries are ok, all nodes are fully utilized. >>> >>> On Tue, Feb 28, 2023 at 12:54 PM John Smith <java.dev....@gmail.com> >>> wrote: >>> >>>> Hi so I'm using it in a pretty straight forward kind of way at least I >>>> think... >>>> >>>> I'm loading 35 million lines from CSV to an SQL table. Decided to use >>>> streamer as I figured it would still be allot faster than batching SQL >>>> INSERTS. >>>> I tried with backup=0 and backup=1 (Prefer to have backup on) >>>> 1- With 0 backups: 6 minutes to load >>>> 2- With 1 backups: 15 minutes to load. >>>> >>>> In both cases I still see the same behaviour, the 1 machine seems to be >>>> taking the brunt of the work... >>>> >>>> I'm reading a CSV file line by line and doing streamer.add() >>>> >>>> The table definition is as follows... >>>> CREATE TABLE PUBLIC.MY_TABLE ( >>>> COLUMN_1 VARCHAR(32) NOT NULL, >>>> COLUMN_2 VARCHAR(64) NOT NULL, >>>> CONSTRAINT PHONE_CARRIER_IDS_PK PRIMARY KEY (COLUMN_1) >>>> ) with "template=parallelTpl, backups=0, key_type=String, >>>> value_type=MyObject"; >>>> CREATE INDEX MY_TABLE_COLUMN_2_IDX ON PUBLIC.MY_TABLE (COLUMN_2); >>>> >>>> String fileName = "my_file"; >>>> >>>> final String cacheNameDest = "MY_TABLE"; >>>> >>>> try( >>>> Ignite igniteDest = >>>> configIgnite(Arrays.asList("...:47500..47509", "...:47500..47509", >>>> "...:47500..47509"), "ignite-dest"); >>>> IgniteCache<BinaryObject, BinaryObject> cacheDest = >>>> igniteDest.getOrCreateCache(cacheNameDest).withKeepBinary(); >>>> IgniteDataStreamer<BinaryObject, BinaryObject> streamer >>>> = igniteDest.dataStreamer(cacheNameDest); >>>> ) { >>>> System.out.println("Ignite started."); >>>> long start = System.currentTimeMillis(); >>>> >>>> System.out.println("Cache size: " + >>>> cacheDest.size(CachePeekMode.PRIMARY)); >>>> System.out.println("Default"); >>>> System.out.println("1d"); >>>> >>>> IgniteBinary binaryDest = igniteDest.binary(); >>>> >>>> try (BufferedReader br = new BufferedReader(new >>>> FileReader(fileName))) { >>>> int count = 0; >>>> >>>> String line; >>>> while ((line = br.readLine()) != null) { >>>> >>>> String[] parts = line.split("\\|"); >>>> >>>> BinaryObjectBuilder keyBuilder = >>>> binaryDest.builder("String"); >>>> keyBuilder.setField("COLUMN_1", parts[1], >>>> String.class); >>>> BinaryObjectBuilder valueBuilder = >>>> binaryDest.builder("PhoneCarrier"); >>>> valueBuilder.setField("COLUMN_2", parts[3], >>>> String.class); >>>> >>>> streamer.addData(keyBuilder.build(), >>>> valueBuilder.build()); >>>> >>>> count++; >>>> >>>> if ((count % 10000) == 0) { >>>> System.out.println(count); >>>> } >>>> } >>>> streamer.flush(); >>>> long end = System.currentTimeMillis(); >>>> System.out.println("Ms: " + (end - start)); >>>> } catch (IOException e) { >>>> e.printStackTrace(); >>>> } >>>> } >>>> >>>> On Tue, Feb 28, 2023 at 11:00 AM Jeremy McMillan < >>>> jeremy.mcmil...@gridgain.com> wrote: >>>> >>>>> Have you tried tracing the workload on the 100% and 40% nodes for >>>>> comparison? There just isn't enough detail in your question to help >>>>> predict >>>>> what should be happening with the cluster workload. For a starting point, >>>>> please identify your design goals. It's easy to get confused by advice >>>>> that >>>>> seeks to help you do something you don't want to do. >>>>> >>>>> Some things to think about include how the stream workload is >>>>> composed. How should/would this work if there were only one node? How >>>>> should behavior change as nodes are added to the topology and the test is >>>>> repeated? >>>>> >>>>> Gedanken: what if the data streamer is doing some really expensive >>>>> operations as it feeds the data into the stream, but the nodes can very >>>>> cheaply put the processed data into their cache partitions? In this case, >>>>> for example, the expensive operations should be refactored into a stream >>>>> transformer that will move the workload from the stream sender to the >>>>> stream receivers. >>>>> https://ignite.apache.org/docs/latest/data-streaming#stream-transformer >>>>> >>>>> Also gedanken: what if the data distribution is skewed such that one >>>>> node gets more data than 2x the data sent to other partitions because of >>>>> affinity? In this case, for example, changes to affinity/colocation design >>>>> or changes to cluster topology (more nodes with greater CPU to RAM ratio?) >>>>> can help distribute the load so that no single node becomes a bottleneck. >>>>> >>>>> On Tue, Feb 28, 2023 at 9:27 AM John Smith <java.dev....@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi I'm using the data streamer to insert into a 3 cluster node. I >>>>>> have noticed that 1 node is pegging at 100% cpu while the others are at >>>>>> 40ish %. >>>>>> >>>>>> Is that normal? >>>>>> >>>>>> >>>>>> >> >