3000 filters don’t look like something reasonable. This is very difficult to 
test and verify as well as impossible to maintain.
Could it be that your filters are another table that you should join with ?
The example is a little bit artificial to understand the underlying business 
case. Can you provide a more realistic example?

Maybe a bloom filter or something similar can make sense for you ?  Basically 
you want  to know if the key pair is in a given set of pairs?

> On 15. May 2018, at 11:48, Alessandro Solimando 
> <alessandro.solima...@gmail.com> wrote:
> 
> From the information you provided I would tackle this as a batch problem, 
> because this way you have access to more sophisticated techniques and you 
> have more flexibility (maybe HDFS and a SparkJob, but also think about a 
> datastore offering good indexes for the kind of data types and values you 
> have for your keys, and benefit from filter push-downs).
> 
> I personally use streaming only when real-time ingestion is needed.
> 
> Hth,
> Alessandro
> 
>> On 15 May 2018 at 09:11, onmstester onmstester <onmstes...@zoho.com> wrote:
>> 
>> How many distinct key1 (resp. key2) values do you have? Are these values 
>> reasonably stable over time?
>> 
>> less than 10 thousands and this filters would change each 2-3 days. They 
>> would be written and loaded from a database
>> 
>> Are these records ingested in real-time or are they loaded from a datastore?
>> 
>> records would be loaded from some text files that would be copied in some 
>> directory over and over
>> 
>> Are you suggesting that i dont need to use spark-streaming?
>> Sent using Zoho Mail
>> 
>> 
>> 
>> ---- On Tue, 15 May 2018 11:26:42 +0430 Alessandro Solimando 
>> <alessandro.solima...@gmail.com> wrote ----
>> 
>> Hi,
>> I am not familiar with ATNConfigSet, but some thoughts that might help.
>> 
>> How many distinct key1 (resp. key2) values do you have? Are these values 
>> reasonably stable over time?
>> 
>> Are these records ingested in real-time or are they loaded from a datastore?
>> 
>> If the latter case the DB might be able to efficiently perform the 
>> filtering, especially if equipped with a proper index over key1/key2 (or a 
>> composite one).
>> 
>> In such case the filter push-down could be very effective (I didn't get if 
>> you just need to count or do something more with the matching record).
>> 
>> Alternatively, you could try to group by (key1,key2), and then filter (it 
>> again depends on the kind of output you have in mind).
>> 
>> If the datastore/stream is distributed and supports partitioning, you could 
>> partition your records by either key1 or key2 (or key1+key2), so they are 
>> already "separated" and can be consumed more efficiently (e.g., the groupby 
>> could then be local to a single partition).
>> 
>> Best regards,
>> Alessandro
>> 
>> On 15 May 2018 at 08:32, onmstester onmstester <onmstes...@zoho.com> wrote:
>> 
>> 
>> Hi, 
>> 
>> I need to run some queries on huge amount input records. Input rate for 
>> records are 100K/seconds.
>> A record is like (key1,key2,value) and the application should report 
>> occurances of kye1 = something && key2 == somethingElse.
>> The problem is there are too many filters in my query: more than 3 thousands 
>> pair of key1 and key2 should be filtered.
>> I was simply puting 1 millions of records in a temptable each time and 
>> running a query sql using spark-sql on temp table:
>> select * from mytemptable where (kye1 = something && key2 == somethingElse) 
>> or (kye1 = someOtherthing && key2 == someAnotherThing) or ...(3thousands 
>> or!!!)
>> And i encounter StackOverFlow at ATNConfigSet.java line 178.
>> 
>> So i have two options IMHO:
>> 1. Either put all key1 and key2 filter pairs in another temp table and do a 
>> join between  two temp table
>> 2. Or use spark-stream that i'm not familiar with and i don't know if it 
>> could handle 3K of filters.
>> 
>> Which way do you suggest? what is the best solution for my problem 
>> 'performance-wise'?
>> 
>> Thanks in advance
>> 
>> 
> 

Reply via email to