[jira] [Updated] (SPARK-3720) support ORC in spark sql

2014-10-15 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-3720:
--
Attachment: orc.diff

This is the diff for orc file support. Because I also work on this item in 
parallel for spark-2883. Since there is PR already opened for this jira, I 
attach my diff here, and will work with wangfei to consolidate our work and 
work together to this feature. Following is the detail description.

1. Basic Operator: saveAsOrcFile and OrcFile. The former is used to save the 
table into orc format file, and the latter is used to import orc format file 
into spark sql table.
2. Column pruning
3. Self-contained schema support: The orc support is fully functional 
independent of hive metastore. The table schema is maintained by the orc file 
itself.
4. To support the orc file, user need to: import import 
org.apache.spark.sql.hive.orc._ to bring in the orc support into context
5. The orc file is operated in HiveContext, the only reason is due to package 
issue, and we don’t want to bring in hive dependency into spark sql. Note that 
orc operations does not relies on Hive metastore.
6. It support full complicated dataType in Spark Sql, for example, list, seq, 
and nested datatype.
7. Because the feature is supported in HiveContext, so the sql parser is 
actually using hive parser.

Hive 0.13.1 support.
With minor change, after spark hive upgraded to 0.13.1
1. the orc can support different compression method, e.g., SNAPPY, LZO, ZLIB, 
and NONE
2. prediction pushdown
Following is the example to use orc file, which is almost identical to the 
parquet format support from user perspective.
import org.apache.spark.sql.hive.orc._
val ctx = new org.apache.spark.sql.hive.HiveContext(sc)
val people = sc.textFile(examples/src/main/resources/people.txt)
val schemaString = name age
import org.apache.spark.sql._
val schema = StructType(schemaString.split( ).map(fieldName = 
StructField(fieldName, StringType, true)))
val rowRDD = people.map(_.split(,)).map(p = Row(p(0), p(1).trim))
val peopleSchemaRDD = ctx.applySchema(rowRDD, schema)
peopleSchemaRDD.registerTempTable(people)
val results = ctx.sql(SELECT name FROM people)
results.map(t = Name:  + t(0)).collect().foreach(println)
peopleSchemaRDD.saveAsOrcFile(people.orc)
val orcFile = ctx.orcFile(people.orc)
orcFile.registerTempTable(orcFile)
val teenagers = ctx.sql(SELECT name FROM orcFile WHERE age = 13 AND age = 
19)
teenagers.map(t = Name:  + t(0)).collect().foreach(println)

 support ORC in spark sql
 

 Key: SPARK-3720
 URL: https://issues.apache.org/jira/browse/SPARK-3720
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Attachments: orc.diff


 The Optimized Row Columnar (ORC) file format provides a highly efficient way 
 to store data on hdfs.ORC file format has many advantages such as:
 1 a single file as the output of each task, which reduces the NameNode's load
 2 Hive type support including datetime, decimal, and the complex types 
 (struct, list, map, and union)
 3 light-weight indexes stored within the file
 skip row groups that don't pass predicate filtering
 seek to a given row
 4 block-mode compression based on data type
 run-length encoding for integer columns
 dictionary encoding for string columns
 5 concurrent reads of the same file using separate RecordReaders
 6 ability to split files without scanning for markers
 7 bound the amount of memory needed for reading or writing
 8 metadata stored using Protocol Buffers, which allows addition and removal 
 of fields
 Now spark sql support Parquet, support ORC provide people more opts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3720) support ORC in spark sql

2014-09-29 Thread wangfei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangfei updated SPARK-3720:
---
Description: 
The Optimized Row Columnar (ORC) file format provides a highly efficient way to 
store data on hdfs.ORC file format has many advantages such as:

1 a single file as the output of each task, which reduces the NameNode's load
2 Hive type support including datetime, decimal, and the complex types (struct, 
list, map, and union)
3 light-weight indexes stored within the file
skip row groups that don't pass predicate filtering
seek to a given row
4 block-mode compression based on data type
run-length encoding for integer columns
dictionary encoding for string columns
5 concurrent reads of the same file using separate RecordReaders
6 ability to split files without scanning for markers
7 bound the amount of memory needed for reading or writing
8 metadata stored using Protocol Buffers, which allows addition and removal of 
fields

Now spark sql support Parquet, support ORC provide people more opts.

  was:
The Optimized Row Columnar (ORC) file format provides a highly efficient way to 
store data on hdfs.ORC file format has many advantages such as:

1 a single file as the output of each task, which reduces the NameNode's load
2 Hive type support including datetime, decimal, and the complex types (struct, 
list, map, and union)
3 light-weight indexes stored within the file
skip row groups that don't pass predicate filtering
seek to a given row
4 block-mode compression based on data type
run-length encoding for integer columns
dictionary encoding for string columns
5 concurrent reads of the same file using separate RecordReaders
6 ability to split files without scanning for markers
7 bound the amount of memory needed for reading or writing
8 metadata stored using Protocol Buffers, which allows addition and removal of 
fields


 support ORC in spark sql
 

 Key: SPARK-3720
 URL: https://issues.apache.org/jira/browse/SPARK-3720
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei

 The Optimized Row Columnar (ORC) file format provides a highly efficient way 
 to store data on hdfs.ORC file format has many advantages such as:
 1 a single file as the output of each task, which reduces the NameNode's load
 2 Hive type support including datetime, decimal, and the complex types 
 (struct, list, map, and union)
 3 light-weight indexes stored within the file
 skip row groups that don't pass predicate filtering
 seek to a given row
 4 block-mode compression based on data type
 run-length encoding for integer columns
 dictionary encoding for string columns
 5 concurrent reads of the same file using separate RecordReaders
 6 ability to split files without scanning for markers
 7 bound the amount of memory needed for reading or writing
 8 metadata stored using Protocol Buffers, which allows addition and removal 
 of fields
 Now spark sql support Parquet, support ORC provide people more opts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org