Re: Best practises to storing data in Parquet files

2016-08-29 Thread Mich Talebzadeh
Hi Kevin. When you say Kafka interacting with Oracle database (if I understand you correctly) are you using GoldenGate with Kafka interface to push data from Oracle to Kafka? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Best practises to storing data in Parquet files

2016-08-28 Thread Chanh Le
> Does parquet file has limit in size ( 1TB ) ? I did’t see any problem but 1TB is too big to operation need to divide into small pieces. > Should we use SaveMode.APPEND for long running streaming app ? Yes, but you need to partition it by time so it easy to maintain like update or delete a

Re: Best practises to storing data in Parquet files

2016-08-28 Thread Kevin Tran
Hi Mich, My stack is as following: Data sources: * IBM MQ * Oracle database Kafka to store all messages from data sources Spark Streaming fetching messages from Kafka and do a bit transform and write parquet files to HDFS Hive / SparkSQL / Impala will query on parquet files. Do you have any

Re: Best practises to storing data in Parquet files

2016-08-28 Thread Mich Talebzadeh
Hi, Can you explain about you particular stack. Example what is the source of streaming data and the role that Spark plays. Are you dealing with Real Time and Batch and why Parquet and not something like Hbase to ingest data real time. HTH Dr Mich Talebzadeh LinkedIn *

Best practises to storing data in Parquet files

2016-08-28 Thread Kevin Tran
Hi, Does anyone know what is the best practises to store data to parquet file? Does parquet file has limit in size ( 1TB ) ? Should we use SaveMode.APPEND for long running streaming app ? How should we store in HDFS (directory structure, ... )? Thanks, Kevin.