realno commented on issue #30:
URL: https://github.com/apache/arrow-ballista/issues/30#issuecomment-1133527214

   @thinkharderdev Thanks for the nice write up. Among the three options I am 
leaning towards 3 is more realistic to achieve. That is, the community can be 
more focused on making it work really well with a specific set of use cases 
first, which will hopefully grow the community further. 
   
   For streaming v.s. batch I don't have a strong opinion at the point, I 
believe whoever can use it in a real use case should try to drive the project 
forward. 
   
   We are not ready yet to do anything serious, though we have two major use 
cases in mind:
   1. Replace Spark for batch processing - 100s of billions of rows regularly 
with mega plans (100s of millions of expressions and projections)
   2. High concurrency, medium latency (sub second) queries from data storage 
of 100s of billions or rows - result sets are not large   
   
   Use case 2 seems similar to your use case. I am curious when you said fully 
streaming execution did you mean like Flink? I think there is value to support 
operators/algorithms that needs to see the entire dataset/partition multiple 
times (for example ML), so a hybrid model would be good. For example, if the 
compiler can analyze the query and turn part of it "fully streaming" if 
possible.
   
   Some other requirements for us are:
   1. easier and cheaper than Spark to operate 
   2. natively support k8s - delicate resource and cluster management entirely 
to k8s


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to