[ https://issues.apache.org/jira/browse/SPARK-21661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454070#comment-16454070 ]
Li Yuanjian commented on SPARK-21661: ------------------------------------- Got it. > SparkSQL can't merge load table from Hadoop > ------------------------------------------- > > Key: SPARK-21661 > URL: https://issues.apache.org/jira/browse/SPARK-21661 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.2.0 > Reporter: Dapeng Sun > Assignee: Li Yuanjian > Priority: Major > Fix For: 2.3.0 > > > Here is the original text of external table on HDFS: > {noformat} > Permission Owner Group Size Last Modified Replication Block > Size Name > -rw-r--r-- root supergroup 0 B 8/6/2017, 11:43:03 PM 3 > 256 MB income_band_001.dat > -rw-r--r-- root supergroup 0 B 8/6/2017, 11:39:31 PM 3 > 256 MB income_band_002.dat > ... > -rw-r--r-- root supergroup 327 B 8/6/2017, 11:44:47 PM 3 > 256 MB income_band_530.dat > {noformat} > After SparkSQL load, every files have a output file, even the files are 0B. > For the load on Hive, the data files would be merged according the data size > of original files. > Reproduce: > {noformat} > CREATE EXTERNAL TABLE t1 (a int,b string) STORED AS TEXTFILE LOCATION > "hdfs://xxx:9000/data/t1" > CREATE TABLE t2 STORED AS PARQUET AS SELECT * FROM t1; > {noformat} > The table t2 have many small files without data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org