On Mon, 16 Sep 2013 14:04:13 -0700 (PDT)
Jason H <scorp...@yahoo.com> wrote:

> I'm transitioning my job from embedded space to Hadoop space. I was wondering 
> if it is possible to come up with a SQLite cluster adaptation.
> 
> I will give you a crash course in hadoop. Basically we get a very large CSV, 
> which is chopped up into 64MB chunks, and distributed to a number of nodes. 
> The file is actually replicated 2 times for a total of 3 copies of all chunks 
> on the cluster (no chunk is repeatedly stored on the same node). Then 
> MapReduce logic is run, and the results are combined. Instrumental to this is 
> the keys are returned in sorted order.
> 
> All of this is done in java (70% slower than C, on average, and with some 
> non-trivial start-up cost). Everyone is clamoring for SQL to be run on the 
> nodes. Hive attempts to leverage SQL, and is successful to some degree. But 
> being able to use Full SQL would be a huge improvement. Akin to Hadoop is 
> HBase 
> 
> HBase is similar with Hadoop, but it approaches things in a more conventional 
> columnar format It a copy of "BigTable" form google.. Here, the notion of 
> "column families" is important because column families are files. A row is 
> made up of keys, at leas one column family. There is an implied join between 
> the key, and each column family. As the table is viewed though, it is void as 
> a join between the key and all column families. What denotes a column family 
> (cf) is not specified, however the idea is to group columns into cfs by 
> usage. That is cf1 is your most commonly needed data, and cfN is the least 
> often needed.
> 
> HBase is queried by a specialized API. This API is written to work over very 
> large datasets, working directly with the data. However not all uses of HBase 
> need this. The majority of queries are distributed just because they are over 
> a huge dataset, with a modest amount of rows returned. Distribution allows 
> for much more paralleled disk reading.  For this case, a SQLite cluster makes 
> perfect sense. 
> 
> Mapping all of this to SQLite, I could see a bit of work could go a long way. 
> Column families can be implemented as separate files, which are ATTACHed and 
> joined as needed. The most complicated operation is a join, where we have to 
> coordinate the list of distinct values of the join to all other notes, for 
> join matching. We then have to move all of that data to the same node for the 
> join. 
> 
> The non-data input is a traditional SQL statement, but we will have to parse 
> and restructure the statement to join for the needed column families. Also 
> needed is a way to ship a row to another server for processing. 
> 
> I'm just putting this out there as me thinking out loud. I wonder how it 
> would turn out. Comments?

If you want nosql, look at hypertable. It's c++, faster (on our environment) 
than Hadoop/HBase and uses hql, similar to sql. It uses the same filesystem 
Hadoop uses.

About a cluster of sqlite, it depends on what you need, distribute data rows on 
all sqlite cluster or all sqlite dbs has the same rows (like raid0 and raid1).

The problem you describe with distribute data rows on all sqlite cluster 
(raid0) could be minimized applying the where conditions to tables before the 
join, except those ones that uses more than one joining table to evaluate. Then 
move, as you say, data to same node, for example the one with a LEFT table and 
more rows.

The other manner (raid1), having all sqlite servers the same data is easier and 
is done with virtual tables and nanomsg/0mq to send locks on write (only one 
server can write to all) and data. But if you need it, PostgreSQL don't need 
those tricks.

---   ---
Eduardo Morras <emorr...@yahoo.es>
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to