I am using importtsv tool to ingest data. I have some doubts. I am using hbase 1.1.5.
First does it ingest non-string/numeric values? I was referring this link <http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/> detailing importtsv in cloudera distribution. It says:"it interprets everything as strings". So I was guessing what does that mean. I am using simple wordcount example where first column is a word and second column is word count. When I keep file as follows: "access","1" "about","1" and ingest and then do scan on hbase shell it gives following output: about column=f:count, timestamp=1467716881104, value="1" access column=f:count, timestamp=1467716881104, value="1" When I keep file as follows (double quotes surrounding count is removed): "access",1 "about",1 and ingest and then do scan on hbase shell it gives following output (double quotes surrounding count is not there): about column=f:count, timestamp=1467716881104, value=1 access column=f:count, timestamp=1467716881104, value=1 So as you can see there are no double quotes in count's value. *Q1. Does that mean it is stored as integer and not as string? * The cloudera's article suggests that custom MR job needs to be written for ingesting non-string values. However I am not able to get what does that mean if above is ingesting integer values. Also another doubt I am having is that whether I can escape the column separator when it appears inside the column value. For example in importtsv, we can specify the separator as follows: -Dimporttsv.separator=, However what if I have employee data where first column is employee name and second column as address? My file will have rows resembling to something like this: "mahesh","A6,Hyatt Appartment" That second comma makes importtsv think that there are three columns and throwing BadTsvLineException("Excessive columns"). Thus I tried escaping comma with backslash ('\') and just for sake of curiosity escaping backslash with another backslash (that is "\\"). So my file had following lines: "able","1\" "z","1\" "za","1\\1" When I ran scan on hbase shell, it gave following output: able column=f:count, timestamp=1467716881104, value="1\x5C" z column=f:count, timestamp=1467716881104, value="1\x5C" za column=f:count, timestamp=1467716881104, value="1\x5C\x5C1" *Q2. So it seems that instead of escaping character following backslash, it encodes backslash as "\x5C". Is it like that? Is there no way to escape column separator while bulk loading data using importtsv?* -- View this message in context: http://apache-hbase.679495.n3.nabble.com/Escaping-separator-in-data-while-bulk-loading-using-importtsv-tool-and-ingesting-numeric-values-tp4081081.html Sent from the HBase User mailing list archive at Nabble.com.