Hi, Saif, This depends on your use cases. For example, you want to do a table scan every time? or you want to get a specific row? or you want to get a temporal query? Do you have a security concern when you choose your target-side data store?
Offloading a huge table is also very expensive. It is time consuming. If the source side is mainframe, it could also eat a lot of MIPS. Thus, the best way is to save it in a persistent media without any data transformation and then transform and store them based on your query types. Thanks, Xiao Li 2015-10-09 11:25 GMT-07:00 <saif.a.ell...@wellsfargo.com>: > Hi all, > > I am in the procss of learning big data. > Right now, I am bringing huge databases through JDBC to Spark (a 250 > million rows table can take around 3 hours), and then re-saving it into > JSON, which is fast, simple, distributed, fail-safe and stores data types, > although without any compression. > > Reading from distributed JSON takes for this amount of data, around 2-3 > minutes and works good enough for me. But, do you suggest or prefer any > other format for intermediate storage, for fast and proper types reading? > Not only for intermediate between a network database, but also for > intermediate dataframe transformations to have data ready for processing. > > I have tried CSV but computational type inferring does not usually fit my > needs and take long types. Haven’t tried parquet since they fixed it for > 1.5, but that is also another option. > What do you also think of HBase, Hive or any other type? > > Looking for insights! > Saif > >