[jira] [Commented] (HIVE-4044) Add URL type
[ https://issues.apache.org/jira/browse/HIVE-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696306#comment-13696306 ] Ashutosh Chauhan commented on HIVE-4044: I didn't know this, but apparently there exists a datalink type in sql standard which very much look and feel like url. http://wiki.postgresql.org/wiki/DATALINK So, if standard compliance is a goal, we may need to add this eventually. Though at that point its better to call it datalink instead of url. Add URL type Key: HIVE-4044 URL: https://issues.apache.org/jira/browse/HIVE-4044 Project: Hive Issue Type: Improvement Reporter: Samuel Yuan Assignee: Samuel Yuan Attachments: HIVE-4044.HIVE-4044.HIVE-4044.D8799.1.patch Having a separate type for URLs would enable improvements in storage efficiency based on breaking up a URL into its components. The new type will be named URL and made a non-reserved keyword (see HIVE-701). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4044) Add URL type
[ https://issues.apache.org/jira/browse/HIVE-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13681573#comment-13681573 ] Samuel Yuan commented on HIVE-4044: --- I tried breaking the URL into parts and encoding them as individual columns; the dictionary shrunk, but the overhead of the other ORC columns introduced (mostly the column of indices) made a bigger impact, so compression was actually worse overall. I also tried storing the query string as a map and putting common keys into separate columns; this improved compression somewhat, but still not enough to offset the overhead of new columns for the query string. Add URL type Key: HIVE-4044 URL: https://issues.apache.org/jira/browse/HIVE-4044 Project: Hive Issue Type: Improvement Reporter: Samuel Yuan Assignee: Samuel Yuan Attachments: HIVE-4044.HIVE-4044.HIVE-4044.D8799.1.patch Having a separate type for URLs would enable improvements in storage efficiency based on breaking up a URL into its components. The new type will be named URL and made a non-reserved keyword (see HIVE-701). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4044) Add URL type
[ https://issues.apache.org/jira/browse/HIVE-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679659#comment-13679659 ] Owen O'Malley commented on HIVE-4044: - We could actually implement this as a different encoding of string columns that recognizes string columns and breaks them down into individual parts. Another approach would be to use a trie encoding for the dictionary. That would have a lot of the same value and would likely be a general win. Add URL type Key: HIVE-4044 URL: https://issues.apache.org/jira/browse/HIVE-4044 Project: Hive Issue Type: Improvement Reporter: Samuel Yuan Assignee: Samuel Yuan Attachments: HIVE-4044.HIVE-4044.HIVE-4044.D8799.1.patch Having a separate type for URLs would enable improvements in storage efficiency based on breaking up a URL into its components. The new type will be named URL and made a non-reserved keyword (see HIVE-701). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4044) Add URL type
[ https://issues.apache.org/jira/browse/HIVE-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1357#comment-1357 ] Samuel Yuan commented on HIVE-4044: --- You're right, the idea is that it will enable better encoding of URLs. Kevin found that breaking up the URL into its components and storing them as separate columns results in significant space savings. The original plan was to implement this idea with RCFile, but with the new ORC file format I decided to wait for that instead, and to submit this part separately. However, it looks like the improvements of the ORC file have erased any gains we would have gotten by breaking up URLs into the individual components, so this won't be needed any more. Add URL type Key: HIVE-4044 URL: https://issues.apache.org/jira/browse/HIVE-4044 Project: Hive Issue Type: Improvement Reporter: Samuel Yuan Assignee: Samuel Yuan Attachments: HIVE-4044.HIVE-4044.HIVE-4044.D8799.1.patch Having a separate type for URLs would enable improvements in storage efficiency based on breaking up a URL into its components. The new type will be named URL and made a non-reserved keyword (see HIVE-701). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4044) Add URL type
[ https://issues.apache.org/jira/browse/HIVE-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585177#comment-13585177 ] Ashutosh Chauhan commented on HIVE-4044: URL is an unusual type to add in query processing engines. Can you spec out whats the motivation of adding this type (e.g. you can always use string type for urls). I am assuming from your description above that it might result in storage efficiency by having better encoding of urls. But, I see in LazyBinaryURL following comment /** * The serialization of LazyBinaryURL is the same as the binary representation * of the underlying string */ and also URLWritable has {code} @Override public void write(DataOutput out) throws IOException { if (url != null) { byte[] bytes = url.toString().getBytes(); WritableUtils.writeVInt(out, bytes.length); out.write(bytes); } else { WritableUtils.writeVInt(out, 0); } } {code} So, it seems like you are storing urls as string anyways both for intermediate data of MR as well as output of query. So, I don't see how is it resulting in better storage efficiency. Add URL type Key: HIVE-4044 URL: https://issues.apache.org/jira/browse/HIVE-4044 Project: Hive Issue Type: Improvement Reporter: Samuel Yuan Assignee: Samuel Yuan Attachments: HIVE-4044.HIVE-4044.HIVE-4044.D8799.1.patch Having a separate type for URLs would enable improvements in storage efficiency based on breaking up a URL into its components. The new type will be named URL and made a non-reserved keyword (see HIVE-701). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira