Re: Sharing data in columnar storage between two applications

2016-12-25 Thread Mark Hamstra
NOt so much about between applications, rather multiple frameworks within
an application, but still related:
https://cs.stanford.edu/~matei/papers/2017/cidr_weld.pdf

On Sun, Dec 25, 2016 at 8:12 PM, Kazuaki Ishizaki 
wrote:

> Here is an interesting discussion to share data in columnar storage
> between two applications.
> https://github.com/apache/spark/pull/15219#issuecomment-265835049
>
> One of the ideas is to prepare interfaces (or trait) only for read or
> write. Each application can implement only one class to want to do (e.g.
> read or write). For example, FiloDB wants to provide a columnar storage
> that can be read from Spark. In that case, it is easy to implement only
> read APIs for Spark. These two classes can be prepared.
> However, it may lead to incompatibility in ColumnarBatch. ColumnarBatch
> keeps a set of ColumnVector that can be read or written. The ColumnVector
> class should have read and write APIs. How can we put the new ColumnVector
> with only read APIs?  Here is an example to case incompatibility at
> https://gist.github.com/kiszk/00ab7d0c69f0e598e383cdc8e72bcc4d
>
> Another possible idea is that both applications supports Apache Arrow APIs.
> Other approaches could be.
>
> What approach would be good for all of applications?
>
> Regards,
> Kazuaki Ishizaki
>


Sharing data in columnar storage between two applications

2016-12-25 Thread Kazuaki Ishizaki
Here is an interesting discussion to share data in columnar storage 
between two applications.
https://github.com/apache/spark/pull/15219#issuecomment-265835049

One of the ideas is to prepare interfaces (or trait) only for read or 
write. Each application can implement only one class to want to do (e.g. 
read or write). For example, FiloDB wants to provide a columnar storage 
that can be read from Spark. In that case, it is easy to implement only 
read APIs for Spark. These two classes can be prepared.
However, it may lead to incompatibility in ColumnarBatch. ColumnarBatch 
keeps a set of ColumnVector that can be read or written. The ColumnVector 
class should have read and write APIs. How can we put the new ColumnVector 
with only read APIs?  Here is an example to case incompatibility at 
https://gist.github.com/kiszk/00ab7d0c69f0e598e383cdc8e72bcc4d

Another possible idea is that both applications supports Apache Arrow 
APIs.
Other approaches could be.

What approach would be good for all of applications?

Regards,
Kazuaki Ishizaki



Re: Spark structured steaming from kafka - last message processed again after resume from checkpoint

2016-12-25 Thread Shixiong(Ryan) Zhu
Hi Niek,

That's expected. Just answered on stackoverflow.

On Sun, Dec 25, 2016 at 8:07 AM, Niek  wrote:

> Hi,
>
> I described my issue in full detail on
> http://stackoverflow.com/questions/41300223/spark-
> structured-steaming-from-kafka-last-message-processed-again-after-resume
>
> Any idea what's going wrong?
>
> Looking at the code base on
> https://github.com/apache/spark/blob/3f62e1b5d9e75dc07bac3aa4db3e8d
> 0615cc3cc3/sql/core/src/main/scala/org/apache/spark/sql/
> execution/streaming/StreamExecution.scala#L290,
> I don't understand why you are resuming with an already committed offset
> (the one from currrentBatchId - 1)
>
> Thanks,
>
> Niek.
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Spark-structured-
> steaming-from-kafka-last-message-processed-again-after-
> resume-from-checkpoint-tp20353.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Spark structured steaming from kafka - last message processed again after resume from checkpoint

2016-12-25 Thread Niek
Hi,

I described my issue in full detail on
http://stackoverflow.com/questions/41300223/spark-structured-steaming-from-kafka-last-message-processed-again-after-resume

Any idea what's going wrong?

Looking at the code base on
https://github.com/apache/spark/blob/3f62e1b5d9e75dc07bac3aa4db3e8d0615cc3cc3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L290,
I don't understand why you are resuming with an already committed offset
(the one from currrentBatchId - 1) 

Thanks,

Niek.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-structured-steaming-from-kafka-last-message-processed-again-after-resume-from-checkpoint-tp20353.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org