[jira] [Commented] (HIVE-4044) Add URL type

2013-06-30 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696306#comment-13696306
 ] 

Ashutosh Chauhan commented on HIVE-4044:


I didn't know this, but apparently there exists a datalink type in sql standard 
which very much look and feel like url. 
http://wiki.postgresql.org/wiki/DATALINK So, if standard compliance is a goal, 
we may need to add this eventually. Though at that point its better to call it 
datalink instead of url.

 Add URL type
 

 Key: HIVE-4044
 URL: https://issues.apache.org/jira/browse/HIVE-4044
 Project: Hive
  Issue Type: Improvement
Reporter: Samuel Yuan
Assignee: Samuel Yuan
 Attachments: HIVE-4044.HIVE-4044.HIVE-4044.D8799.1.patch


 Having a separate type for URLs would enable improvements in storage 
 efficiency based on breaking up a URL into its components. The new type will 
 be named URL and made a non-reserved keyword (see HIVE-701).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4044) Add URL type

2013-06-12 Thread Samuel Yuan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13681573#comment-13681573
 ] 

Samuel Yuan commented on HIVE-4044:
---

I tried breaking the URL into parts and encoding them as individual columns; 
the dictionary shrunk, but the overhead of the other ORC columns introduced 
(mostly the column of indices) made a bigger impact, so compression was 
actually worse overall. I also tried storing the query string as a map and 
putting common keys into separate columns; this improved compression somewhat, 
but still not enough to offset the overhead of new columns for the query string.

 Add URL type
 

 Key: HIVE-4044
 URL: https://issues.apache.org/jira/browse/HIVE-4044
 Project: Hive
  Issue Type: Improvement
Reporter: Samuel Yuan
Assignee: Samuel Yuan
 Attachments: HIVE-4044.HIVE-4044.HIVE-4044.D8799.1.patch


 Having a separate type for URLs would enable improvements in storage 
 efficiency based on breaking up a URL into its components. The new type will 
 be named URL and made a non-reserved keyword (see HIVE-701).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4044) Add URL type

2013-06-10 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679659#comment-13679659
 ] 

Owen O'Malley commented on HIVE-4044:
-

We could actually implement this as a different encoding of string columns that 
recognizes string columns and breaks them down into individual parts. Another 
approach would be to use a trie encoding for the dictionary. That would have a 
lot of the same value and would likely be a general win.

 Add URL type
 

 Key: HIVE-4044
 URL: https://issues.apache.org/jira/browse/HIVE-4044
 Project: Hive
  Issue Type: Improvement
Reporter: Samuel Yuan
Assignee: Samuel Yuan
 Attachments: HIVE-4044.HIVE-4044.HIVE-4044.D8799.1.patch


 Having a separate type for URLs would enable improvements in storage 
 efficiency based on breaking up a URL into its components. The new type will 
 be named URL and made a non-reserved keyword (see HIVE-701).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4044) Add URL type

2013-02-27 Thread Samuel Yuan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1357#comment-1357
 ] 

Samuel Yuan commented on HIVE-4044:
---

You're right, the idea is that it will enable better encoding of URLs. Kevin 
found that breaking up the URL into its components and storing them as separate 
columns results in significant space savings. The original plan was to 
implement this idea with RCFile, but with the new ORC file format I decided to 
wait for that instead, and to submit this part separately.

However, it looks like the improvements of the ORC file have erased any gains 
we would have gotten by breaking up URLs into the individual components, so 
this won't be needed any more.

 Add URL type
 

 Key: HIVE-4044
 URL: https://issues.apache.org/jira/browse/HIVE-4044
 Project: Hive
  Issue Type: Improvement
Reporter: Samuel Yuan
Assignee: Samuel Yuan
 Attachments: HIVE-4044.HIVE-4044.HIVE-4044.D8799.1.patch


 Having a separate type for URLs would enable improvements in storage 
 efficiency based on breaking up a URL into its components. The new type will 
 be named URL and made a non-reserved keyword (see HIVE-701).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4044) Add URL type

2013-02-23 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585177#comment-13585177
 ] 

Ashutosh Chauhan commented on HIVE-4044:


URL is an unusual type to add in query processing engines. Can you spec out 
whats the motivation of adding this type (e.g. you can always use string type 
for urls). I am assuming from your description above that it might result in 
storage efficiency by having better encoding of urls. But, I see in 
LazyBinaryURL following comment
/**
 * The serialization of LazyBinaryURL is the same as the binary representation
 * of the underlying string
 */
and also URLWritable has
{code}
 @Override
  public void write(DataOutput out) throws IOException {
if (url != null) {
  byte[] bytes = url.toString().getBytes();
  WritableUtils.writeVInt(out, bytes.length);
  out.write(bytes);
} else {
  WritableUtils.writeVInt(out, 0);
}
  }
{code}

So, it seems like you are storing urls as string anyways both for intermediate 
data of MR as well as output of query. So, I don't see how is it resulting in 
better storage efficiency. 

 Add URL type
 

 Key: HIVE-4044
 URL: https://issues.apache.org/jira/browse/HIVE-4044
 Project: Hive
  Issue Type: Improvement
Reporter: Samuel Yuan
Assignee: Samuel Yuan
 Attachments: HIVE-4044.HIVE-4044.HIVE-4044.D8799.1.patch


 Having a separate type for URLs would enable improvements in storage 
 efficiency based on breaking up a URL into its components. The new type will 
 be named URL and made a non-reserved keyword (see HIVE-701).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira