Re: Sharing data in columnar storage between two applications
NOt so much about between applications, rather multiple frameworks within an application, but still related: https://cs.stanford.edu/~matei/papers/2017/cidr_weld.pdf On Sun, Dec 25, 2016 at 8:12 PM, Kazuaki Ishizakiwrote: > Here is an interesting discussion to share data in columnar storage > between two applications. > https://github.com/apache/spark/pull/15219#issuecomment-265835049 > > One of the ideas is to prepare interfaces (or trait) only for read or > write. Each application can implement only one class to want to do (e.g. > read or write). For example, FiloDB wants to provide a columnar storage > that can be read from Spark. In that case, it is easy to implement only > read APIs for Spark. These two classes can be prepared. > However, it may lead to incompatibility in ColumnarBatch. ColumnarBatch > keeps a set of ColumnVector that can be read or written. The ColumnVector > class should have read and write APIs. How can we put the new ColumnVector > with only read APIs? Here is an example to case incompatibility at > https://gist.github.com/kiszk/00ab7d0c69f0e598e383cdc8e72bcc4d > > Another possible idea is that both applications supports Apache Arrow APIs. > Other approaches could be. > > What approach would be good for all of applications? > > Regards, > Kazuaki Ishizaki >
Sharing data in columnar storage between two applications
Here is an interesting discussion to share data in columnar storage between two applications. https://github.com/apache/spark/pull/15219#issuecomment-265835049 One of the ideas is to prepare interfaces (or trait) only for read or write. Each application can implement only one class to want to do (e.g. read or write). For example, FiloDB wants to provide a columnar storage that can be read from Spark. In that case, it is easy to implement only read APIs for Spark. These two classes can be prepared. However, it may lead to incompatibility in ColumnarBatch. ColumnarBatch keeps a set of ColumnVector that can be read or written. The ColumnVector class should have read and write APIs. How can we put the new ColumnVector with only read APIs? Here is an example to case incompatibility at https://gist.github.com/kiszk/00ab7d0c69f0e598e383cdc8e72bcc4d Another possible idea is that both applications supports Apache Arrow APIs. Other approaches could be. What approach would be good for all of applications? Regards, Kazuaki Ishizaki
Re: Spark structured steaming from kafka - last message processed again after resume from checkpoint
Hi Niek, That's expected. Just answered on stackoverflow. On Sun, Dec 25, 2016 at 8:07 AM, Niekwrote: > Hi, > > I described my issue in full detail on > http://stackoverflow.com/questions/41300223/spark- > structured-steaming-from-kafka-last-message-processed-again-after-resume > > Any idea what's going wrong? > > Looking at the code base on > https://github.com/apache/spark/blob/3f62e1b5d9e75dc07bac3aa4db3e8d > 0615cc3cc3/sql/core/src/main/scala/org/apache/spark/sql/ > execution/streaming/StreamExecution.scala#L290, > I don't understand why you are resuming with an already committed offset > (the one from currrentBatchId - 1) > > Thanks, > > Niek. > > > > -- > View this message in context: http://apache-spark- > developers-list.1001551.n3.nabble.com/Spark-structured- > steaming-from-kafka-last-message-processed-again-after- > resume-from-checkpoint-tp20353.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Spark structured steaming from kafka - last message processed again after resume from checkpoint
Hi, I described my issue in full detail on http://stackoverflow.com/questions/41300223/spark-structured-steaming-from-kafka-last-message-processed-again-after-resume Any idea what's going wrong? Looking at the code base on https://github.com/apache/spark/blob/3f62e1b5d9e75dc07bac3aa4db3e8d0615cc3cc3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L290, I don't understand why you are resuming with an already committed offset (the one from currrentBatchId - 1) Thanks, Niek. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-structured-steaming-from-kafka-last-message-processed-again-after-resume-from-checkpoint-tp20353.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org