Re: Best storage format for intermediate process

2015-10-09 Thread Xiao Li
Hi, Saif,

This depends on your use cases. For example, you want to do a table scan
every time? or you want to get a specific row? or you want to get a
temporal query? Do you have a security concern when you choose your
target-side data store?

Offloading a huge table is also very expensive. It is time consuming. If
the source side is mainframe, it could also eat a lot of MIPS. Thus, the
best way is to save it in a persistent media without any data
transformation and then transform and store them based on your query types.

Thanks,

Xiao Li


2015-10-09 11:25 GMT-07:00 :

> Hi all,
>
> I am in the procss of learning big data.
> Right now, I am bringing huge databases through JDBC to Spark (a 250
> million rows table can take around 3 hours), and then re-saving it into
> JSON, which is fast, simple, distributed, fail-safe and stores data types,
> although without any compression.
>
> Reading from distributed JSON takes for this amount of data, around 2-3
> minutes and works good enough for me. But, do you suggest or prefer any
> other format for intermediate storage, for fast and proper types reading?
> Not only for intermediate between a network database, but also for
> intermediate dataframe transformations to have data ready for processing.
>
> I have tried CSV but computational type inferring does not usually fit my
> needs and take long types. Haven’t tried parquet since they fixed it for
> 1.5, but that is also another option.
> What do you also think of HBase, Hive or any other type?
>
> Looking for insights!
> Saif
>
>


Best storage format for intermediate process

2015-10-09 Thread Saif.A.Ellafi
Hi all,

I am in the procss of learning big data.
Right now, I am bringing huge databases through JDBC to Spark (a 250 million 
rows table can take around 3 hours), and then re-saving it into JSON, which is 
fast, simple, distributed, fail-safe and stores data types, although without 
any compression.

Reading from distributed JSON takes for this amount of data, around 2-3 minutes 
and works good enough for me. But, do you suggest or prefer any other format 
for intermediate storage, for fast and proper types reading?
Not only for intermediate between a network database, but also for intermediate 
dataframe transformations to have data ready for processing.

I have tried CSV but computational type inferring does not usually fit my needs 
and take long types. Haven't tried parquet since they fixed it for 1.5, but 
that is also another option.
What do you also think of HBase, Hive or any other type?

Looking for insights!
Saif