Hi,

After streaming twitter data to HDFS using Flume, I'm trying to analyze it
using some HIVE queries. The data is in JSON format and not clean having
double quotes (") in wrong places causing the HIVE queries to fail. I am
getting the following error:

Failed with exception
java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException:
org.codehaus.jackson.JsonParseException: Unexpected end-of-input: was
expecting closing '"' for name

The script used for creating the external table:

ADD JAR 
/usr/local/hive/apache-hive-1.2.1-bin/lib/hive-serdes-1.0-SNAPSHOT.jar;set
hive.support.sql11.reserved.keywords = false;
CREATE EXTERNAL TABLE tweets (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweet_count INT,
retweeted_status STRUCT<
text:STRING,
user:STRUCT<screen_name:STRING,name:STRING>>,
entities STRUCT<
urls:ARRAY<STRUCT<expanded_url:STRING>>,
user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
hashtags:ARRAY<STRUCT<text:STRING>>>,
text STRING,
user STRUCT<
screen_name:STRING,
name:STRING,
friends_count:INT,
followers_count:INT,
statuses_count:INT,
verified:BOOLEAN,
utc_offset:INT,
time_zone:STRING>,
in_reply_to_screen_name STRING)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/usr/local/hadoop/bin/tweets';

Since I would not know for which row the extra double quotes is present, I
can't put an escape character. How can I escape the junk characters and
process the data successfully?

Appreciate any help.

Thanks,

Joel

Reply via email to