[jira] [Commented] (PIO-38) add Apache Parquet as a data source

Wojciech Indyk (JIRA) Mon, 26 Sep 2016 11:46:46 -0700

    [ 
https://issues.apache.org/jira/browse/PIO-38?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523861#comment-15523861
 ]


Wojciech Indyk commented on PIO-38:
-----------------------------------

Hello [~Ziemin]! Sorry for late response.
I would like to have a chance to provide events to PredictionIO using my 
current place of storing events. As I can see PredictionIO can work with a pair 
of Elasticsearch+HBase. Therefore to use Elasticsearch as a backend I need to 
use HBase as an event-store. I don't know PredictionIO so good, so correct me 
if I'm wrong.
I don't want to use HBase, because it enlarges my technology stack and has no 
benefit in case of training model in batch. Parquet is more suitable to this 
case, when I append my archive of events once a day, then can use this data 
(subset) to train a recommendation model without duplication data in HBase.
Is it clear enough?

> add Apache Parquet as a data source
> -----------------------------------
>
>                 Key: PIO-38
>                 URL: https://issues.apache.org/jira/browse/PIO-38
>             Project: PredictionIO
>          Issue Type: New Feature
>            Reporter: Wojciech Indyk
>              Labels: features
>
> Apache Parquet (https://parquet.apache.org/) is a columnar data store, native 
> for Apache Spark and very well suited to storing batch data (as an input) for 
> PredictionIO Engine.
> Parquet is very popular to archive clickstream, so it would enable to use 
> PredictionIO without additional import of data (and duplication) to HBase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIO-38) add Apache Parquet as a data source

Reply via email to