CTAS support in sparksql

Kashish Jain Sun, 20 Mar 2016 23:09:07 -0700

Hi,

I was exploring the possibility of CTAS with spark-sql (SPARK-1.3.1) for saving 
the big results into CSV formatted files for offline viewing. These are the two 
things that I did


  1.  CREATE TABLE IF NOT EXISTS csv_dump27 ROW FORMAT DELIMITED FIELDS 
TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 
'/data/offline/' as select X, Y from tableName where timestamp=1427094000;
  2.  CREATE TABLE IF NOT EXISTS csv_dump44 ROW FORMAT DELIMITED FIELDS 
TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS PARQUET LOCATION 
'/data/offline/' as select X, Y from tableName where timestamp=1427094000;

Here are my observations

Case1 – I am able to create the table in hive and can see the CSV data at the 
provided path. But when I try to do some queries over this table through 
spark-sql I get exceptions. (StackTrace below).
Case2 – I am able to create the table in hive, and can see the parquet files. 
But strangely this time I am able to query this table through spark-sql without 
any exception. I can’t use this though since saving data as parquet does not 
serve my purpose of offline viewing in CSV.

So my question is “Is CTAS supported in spark-sql with storage as TEXT?”

On some debugging, I saw that the "dir”  (directory) variable passed to the 
“TextInputFormat” contains the appended paths of each part file under that 
directory. StrilUtils.split is not able to split the paths in to individual 
paths.

hdfs://NN-199:9000/data/offline/part-00000\,hdfs:/NN-199:9000/data/offlline/part-00001\,hdfs:/NN-199:9000/data/offline/part-00002\…………..
 . . . . ..


However in the case of  Parquet, the “dir" variable is correctly passed as only 
the top level directory

hdfs://NN-199:9000/data/offline/csv_dump44

P.S.:- StackTrace
java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal 
character in scheme name at index 10: part-00000,hdfs:
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.<init>(Path.java:172)
at org.apache.hadoop.fs.Path.<init>(Path.java:94)
at org.apache.hadoop.fs.Globber.glob(Globber.java:211)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1642)
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1556)
at org.apache.spark.rdd.RDD.collect(RDD.scala:825)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:88)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:423)
at 
org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:58)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.URISyntaxException: Illegal character in scheme name at 
index 10: part-00000,hdfs:
at java.net.URI$Parser.fail(Unknown Source)
at java.net.URI$Parser.checkChars(Unknown Source)
at java.net.URI$Parser.parse(Unknown Source)
at java.net.URI.<init>(Unknown Source)
at org.apache.hadoop.fs.Path.initialize(Path.java:203)


Thanks
Kashish Jain

CTAS support in sparksql

Reply via email to