[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-13 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856360#action_12856360
 ] 

Enis Soztutar commented on NUTCH-808:
-

bq. What do you mean by current implementation? NutchBase?
Indeed. In package o.a.n.storage deals with ORM (though not all classes)

bq. I know that Cascading have various Tape/Sink implementations including 
JDBC, HBase but also SimpleDB. Maybe it would be worth having a look at how 
they do it?
The way cascading does this is to convert Tuples (cascading data structure) to 
HBase/JDBC records. The schema for HBase/JDBC is given as a metadata. Since 
they deal with only tuple -> table row, it is not that difficult. But again, 
cascading does not allow for mapping lists to columns, etc. 

bq. My gut feeling would be to write a custom framework instead of relying on 
DataNucleus and use AVRO if possible. I really think that HBase support is 
urgently needed but am less convinced that we need MySQL in the very short 
term. 
Yeah, the more I think about it, the more I come to terms with custom 
implementation. However, I think we might benefit a lot from the ideas from JDO 
in the long term. Also, JDBC implementation may not be relevant for large scale 
deployments, but it will be a very nice side effect of the ORM layer, which 
will allow easy deployment, which in turn will hopefully bring more users. 

> Evaluate ORM Frameworks which support non-relational column-oriented 
> datastores and RDBMs 
> --
>
> Key: NUTCH-808
> URL: https://issues.apache.org/jira/browse/NUTCH-808
> Project: Nutch
>  Issue Type: Task
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 2.0
>
>
> We have an ORM layer in the NutchBase branch, which uses Avro Specific 
> Compiler to compile class definitions given in JSON. Before moving on with 
> this, we might benefit from evaluating other frameworks, whether they suit 
> our needs. 
> We want at least the following capabilities:
> - Using POJOs 
> - Able to persist objects to at least HBase, Cassandra, and RDBMs 
> - Able to efficiently serialize objects as task outputs from Hadoop jobs
> - Allow native queries, along with standard queries 
> Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-13 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856349#action_12856349
 ] 

Julien Nioche commented on NUTCH-808:
-

Hi Enis,

{quote}
On the other hand, current implementation is ...
{quote}

What do you mean by current implementation? NutchBase?

My gut feeling would be to write a custom framework instead of relying on 
DataNucleus and use AVRO if possible. I really think that HBase support is 
urgently needed but am less convinced that we need MySQL in the very short 
term. 

I know that Cascading have various Tape/Sink implementations including JDBC, 
HBase  but also SimpleDB. Maybe it would be worth having a look at how they do 
it?

> Evaluate ORM Frameworks which support non-relational column-oriented 
> datastores and RDBMs 
> --
>
> Key: NUTCH-808
> URL: https://issues.apache.org/jira/browse/NUTCH-808
> Project: Nutch
>  Issue Type: Task
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 2.0
>
>
> We have an ORM layer in the NutchBase branch, which uses Avro Specific 
> Compiler to compile class definitions given in JSON. Before moving on with 
> this, we might benefit from evaluating other frameworks, whether they suit 
> our needs. 
> We want at least the following capabilities:
> - Using POJOs 
> - Able to persist objects to at least HBase, Cassandra, and RDBMs 
> - Able to efficiently serialize objects as task outputs from Hadoop jobs
> - Allow native queries, along with standard queries 
> Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-12 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856124#action_12856124
 ] 

Enis Soztutar commented on NUTCH-808:
-

So, this is the results so far : 

DataNucleus was previously known as JPOX and it was the reference 
implementation for Java Data objects (JDO). JDO is a java standard for 
persistence. A similar specification, named JPA is also a persistence standard, 
which is forked from EJB 3. However, JPA is designed for RDBMs only, so it will 
not be useful for us 
(http://www.datanucleus.org/products/accessplatform/persistence_api.html). 

In JDO, the first step is to define the domain objects as POJOs. Then, the 
persistance metadata is specified either using annotations, XML or both. Then a 
byte code enhancer uses instrumentation to add required methods to the classes 
defined as @PersistanceCapable. The database tables can be generated by hand, 
automatically by datanucleus, or by using a tool (SchemaTool). 
The persistence layer uses standard JDO syntax, which is similar to JDBC. The 
objects can be queried using JPQL. 

I have run a small test to persist objects of WebTableRow class (from NutchBase 
branch) to both MySQL and HBase. Although it took me a fair bit of time to 
set-up, I was able to persist objects to both. 

However, although it is possible to map complex fields (like lists, maps, 
arrays, etc) to RDBMs using different strategies (such as serializing directly, 
using Joins, using Foreign Keys), I was not able to find a way to leverage 
HBase data model. For example, we want to be able to map lists and maps to 
columns in column families. Without such functionality using column oriented 
stores does not bring any advantage. 

For the byte[] serialization for MapReduce, we can either implement a new 
datastore for datanucleus, which also implements Hadoop's Serialization, or use 
Avro to generate Java classes to be feed into JPOX enhancer, or else manually 
implement Writable. 

To sum up, datanucleus brings the following advantages :
- out of the box RDBMs support 
- XML or annotation metadata
- JDO is a Java standard 
- standard query interface
- JSON support

The disadvantages to use DataNucleus would be:
- JDO is rather complex, Implementing a datastore is not very trivial
- We need write patches to datanucleus to flexibly map complex fields to 
leverage HBase's data model
- We have no control on the source code
- no native Hbase support (for example using filters, etc)

On the other hand, current implementation is 
- tested on production, 
- can leverage HBase data model, 
- can be modified to work with Avro serialization directly, 
- cassandra support could be added with little effort
- can support multiple languages (in the future)

I believe that having SQLite, MySQL and HBase support is critical for Nutch 
2.0, for out-of-the-box use, ease of deployment and real-scale computing 
respectively. But obviously we cannot use DataNucleus out of the box either. 


ORM is inherently a hard problem. I propose we go ahead and make the changes to 
DataNucleus to see if it is feasible, and continue with it if it suits our 
needs. Of course, having a custom framework will also be great, so any feedback 
would be more than welcome. 

> Evaluate ORM Frameworks which support non-relational column-oriented 
> datastores and RDBMs 
> --
>
> Key: NUTCH-808
> URL: https://issues.apache.org/jira/browse/NUTCH-808
> Project: Nutch
>  Issue Type: Task
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 2.0
>
>
> We have an ORM layer in the NutchBase branch, which uses Avro Specific 
> Compiler to compile class definitions given in JSON. Before moving on with 
> this, we might benefit from evaluating other frameworks, whether they suit 
> our needs. 
> We want at least the following capabilities:
> - Using POJOs 
> - Able to persist objects to at least HBase, Cassandra, and RDBMs 
> - Able to efficiently serialize objects as task outputs from Hadoop jobs
> - Allow native queries, along with standard queries 
> Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-02 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852840#action_12852840
 ] 

Enis Soztutar commented on NUTCH-808:
-

A candidate framework is DataNucleus. It has the following benefits. 

- Apache 2 license. 
- JDO support 
- HBase, RDBMS, XML persistance. 

I will further investigate whether we can integrate Hadoop writables/Avro 
serialization so that objects can be passed from Mapred. 


> Evaluate ORM Frameworks which support non-relational column-oriented 
> datastores and RDBMs 
> --
>
> Key: NUTCH-808
> URL: https://issues.apache.org/jira/browse/NUTCH-808
> Project: Nutch
>  Issue Type: Task
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>
> We have an ORM layer in the NutchBase branch, which uses Avro Specific 
> Compiler to compile class definitions given in JSON. Before moving on with 
> this, we might benefit from evaluating other frameworks, whether they suit 
> our needs. 
> We want at least the following capabilities:
> - Using POJOs 
> - Able to persist objects to at least HBase, Cassandra, and RDBMs 
> - Able to efficiently serialize objects as task outputs from Hadoop jobs
> - Allow native queries, along with standard queries 
> Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.