[jira] [Created] (SPARK-43079) Add bloom filter details in spark history server plans/SVGs

2023-04-10 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created SPARK-43079:


 Summary: Add bloom filter details in spark history server 
plans/SVGs
 Key: SPARK-43079
 URL: https://issues.apache.org/jira/browse/SPARK-43079
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.2
Reporter: Rajesh Balamohan


Spark bloom filter can be enabled via 
"spark.sql.optimizer.runtimeFilter.semiJoinReduction.enabled=true  and 
spark.sql.optimizer.runtime.bloomFilter.enabled=true".

Spark history server's SVG doesn't render the bloom filter details; It will be 
good to include this detail in the plan. (as of now, it shows up explain plan's 
text output).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32225) Parquet footer information is read twice

2020-07-08 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created SPARK-32225:


 Summary: Parquet footer information is read twice
 Key: SPARK-32225
 URL: https://issues.apache.org/jira/browse/SPARK-32225
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Rajesh Balamohan
 Attachments: spark_parquet_footer_reads.png

When running queries, spark reads parquet footer information twice. In cloud 
env, this would turn out to be expensive (depending on the jobs, # of splits). 
It would be nice to reuse the footer information already read via 
"ParquetInputFormat::buildReaderWithPartitionValues"

 

!image-2020-07-08-14-24-23-470.png|width=726,height=730!

Lines of interest:

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L271]


[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L326]

 

[https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L105]


[https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L111]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32225) Parquet footer information is read twice

2020-07-08 Thread Rajesh Balamohan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-32225:
-
Attachment: spark_parquet_footer_reads.png

> Parquet footer information is read twice
> 
>
> Key: SPARK-32225
> URL: https://issues.apache.org/jira/browse/SPARK-32225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rajesh Balamohan
>Priority: Minor
> Attachments: spark_parquet_footer_reads.png
>
>
> When running queries, spark reads parquet footer information twice. In cloud 
> env, this would turn out to be expensive (depending on the jobs, # of 
> splits). It would be nice to reuse the footer information already read via 
> "ParquetInputFormat::buildReaderWithPartitionValues"
>  
> !image-2020-07-08-14-24-23-470.png|width=726,height=730!
> Lines of interest:
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L271]
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L326]
>  
> [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L105]
> [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L111]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32225) Parquet footer information is read twice

2020-07-08 Thread Rajesh Balamohan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-32225:
-
Description: 
When running queries, spark reads parquet footer information twice. In cloud 
env, this would turn out to be expensive (depending on the jobs, # of splits). 
It would be nice to reuse the footer information already read via 
"ParquetInputFormat::buildReaderWithPartitionValues"

 

!spark_parquet_footer_reads.png|width=640,height=644!

Lines of interest:

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L271]

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L326]

 

[https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L105]

[https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L111]

 

  was:
When running queries, spark reads parquet footer information twice. In cloud 
env, this would turn out to be expensive (depending on the jobs, # of splits). 
It would be nice to reuse the footer information already read via 
"ParquetInputFormat::buildReaderWithPartitionValues"

 

!image-2020-07-08-14-24-23-470.png|width=726,height=730!

Lines of interest:

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L271]


[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L326]

 

[https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L105]


[https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L111]

 


> Parquet footer information is read twice
> 
>
> Key: SPARK-32225
> URL: https://issues.apache.org/jira/browse/SPARK-32225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rajesh Balamohan
>Priority: Minor
> Attachments: spark_parquet_footer_reads.png
>
>
> When running queries, spark reads parquet footer information twice. In cloud 
> env, this would turn out to be expensive (depending on the jobs, # of 
> splits). It would be nice to reuse the footer information already read via 
> "ParquetInputFormat::buildReaderWithPartitionValues"
>  
> !spark_parquet_footer_reads.png|width=640,height=644!
> Lines of interest:
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L271]
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L326]
>  
> [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L105]
> [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L111]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22599) Avoid extra reading for cached table

2017-12-04 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277754#comment-16277754
 ] 

Rajesh Balamohan commented on SPARK-22599:
--

[~CodingCat] - Thanks for sharing results. Results mentions"SPARK-22599, master 
branch, parquet".  Does it mean that "SPARK-22599, master branch" were run with 
text data?



> Avoid extra reading for cached table
> 
>
> Key: SPARK-22599
> URL: https://issues.apache.org/jira/browse/SPARK-22599
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Nan Zhu
>
> In the current implementation of Spark, InMemoryTableExec read all data in a 
> cached table, filter CachedBatch according to stats and pass data to the 
> downstream operators. This implementation makes it inefficient to reside the 
> whole table in memory to serve various queries against different partitions 
> of the table, which occupies a certain portion of our users' scenarios.
> The following is an example of such a use case:
> store_sales is a 1TB-sized table in cloud storage, which is partitioned by 
> 'location'. The first query, Q1, wants to output several metrics A, B, C for 
> all stores in all locations. After that, a small team of 3 data scientists 
> wants to do some causal analysis for the sales in different locations. To 
> avoid unnecessary I/O and parquet/orc parsing overhead, they want to cache 
> the whole table in memory in Q1.
> With the current implementation, even any one of the data scientists is only 
> interested in one out of three locations, the queries they submit to Spark 
> cluster is still reading 1TB data completely.
> The reason behind the extra reading operation is that we implement 
> CachedBatch as
> {code}
> case class CachedBatch(numRows: Int, buffers: Array[Array[Byte]], stats: 
> InternalRow)
> {code}
> where "stats" is a part of every CachedBatch, so we can only filter batches 
> for output of InMemoryTableExec operator by reading all data in in-memory 
> table as input. The extra reading would be even more unacceptable when some 
> of the table's data is evicted to disks.
> We propose to introduce a new type of block, metadata block, for the 
> partitions of RDD representing data in the cached table. Every metadata block 
> contains stats info for all columns in a partition and is saved to 
> BlockManager when executing compute() method for the partition. To minimize 
> the number of bytes to read,
> More details can be found in design 
> doc:https://docs.google.com/document/d/1DSiP3ej7Wd2cWUPVrgqAtvxbSlu5_1ZZB6m_2t8_95Q/edit?usp=sharing
> performance test results:
> Environment: 6 Executors, each of which has 16 cores 90G memory
> dataset: 1T TPCDS data
> queries: tested 4 queries (Q19, Q46, Q34, Q27) in 
> https://github.com/databricks/spark-sql-perf/blob/c2224f37e50628c5c8691be69414ec7f5a3d919a/src/main/scala/com/databricks/spark/sql/perf/tpcds/ImpalaKitQueries.scala
> results: 
> https://docs.google.com/spreadsheets/d/1A20LxqZzAxMjW7ptAJZF4hMBaHxKGk3TBEQoAJXfzCI/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21971) Too many open files in Spark due to concurrent files being opened

2017-09-10 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-21971:


 Summary: Too many open files in Spark due to concurrent files 
being opened
 Key: SPARK-21971
 URL: https://issues.apache.org/jira/browse/SPARK-21971
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.1.0
Reporter: Rajesh Balamohan
Priority: Minor


When running Q67 of TPC-DS at 1 TB dataset on multi node cluster, it 
consistently fails with "too many open files" exception.

{noformat}
O scheduler.TaskSetManager: Finished task 25.0 in stage 844.0 (TID 243786) in 
394 ms on machine111.xyz (executor 2) (189/200)
17/08/20 10:33:45 INFO scheduler.TaskSetManager: Finished task 172.0 in stage 
844.0 (TID 243932) in 11996 ms on cn116-10.l42scl.hortonworks.com (executor 6) 
(190/200)
17/08/20 10:37:40 WARN scheduler.TaskSetManager: Lost task 144.0 in stage 844.0 
(TID 243904, machine1.xyz, executor 1): java.nio.file.FileSystemException: 
/grid/3/hadoop/yarn/local/usercache/rbalamohan/appcache/application_1490656001509_7207/blockmgr-5180e3f0-f7ed-44bb-affc-8f99f09ba7bc/28/temp_local_690afbf7-172d-4fdb-8492-3e2ebd8d5183:
 Too many open files
at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at 
sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
at java.nio.channels.FileChannel.open(FileChannel.java:287)
at java.nio.channels.FileChannel.open(FileChannel.java:335)
at 
org.apache.spark.io.NioBufferedFileInputStream.(NioBufferedFileInputStream.java:43)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.(UnsafeSorterSpillReader.java:75)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:150)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getIterator(UnsafeExternalSorter.java:607)
at 
org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:169)
at 
org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:173)
{noformat}

Cluster was configured with multiple cores per executor. 

Window function uses "spark.sql.windowExec.buffer.spill.threshold=4096" which 
causes large number of spills in larger dataset. With multiple cores per 
executor, this reproduces easily. 

{{UnsafeExternalSorter::getIterator()}} invokes {{spillWriter.getReader}} for 
all the available spillWriters. {{UnsafeSorterSpillReader}} opens up the file 
in its constructor and closes the file later as a part of its close() call. 
This causes too many open files issue.
Note that this is not a file leak, but more of concurrent files being open at 
any given time depending on the dataset being processed.

One option could be to increase "spark.sql.windowExec.buffer.spill.threshold" 
so that fewer spill files are generated, but it is hard to determine the 
sweetspot for all workload. Another option is to set ulimit to "unlimited" for 
files, but that would not be a good production setting. It would be good to 
consider reducing the number of concurrent "UnsafeExternalSorter::getIterator".










--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12998) Enable OrcRelation when connecting via spark thrift server

2016-11-10 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653491#comment-15653491
 ] 

Rajesh Balamohan commented on SPARK-12998:
--

Sure. Thanks [~dongjoon]

> Enable OrcRelation when connecting via spark thrift server
> --
>
> Key: SPARK-12998
> URL: https://issues.apache.org/jira/browse/SPARK-12998
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
>
> When a user connects via spark-thrift server to execute SQL, it does not 
> enable PPD with ORC. It ends up creating MetastoreRelation which does not 
> have ORC PPD.  Purpose of this JIRA is to convert MetastoreRelation to 
> OrcRelation in HiveMetastoreCatalog, so that users can benefit from PPD even 
> when connecting to spark-thrift server.
> {noformat}
> For example, "explain select count(1) from  tpch_flat_orc_1000.lineitem where 
> l_shipdate = '1990-04-18'", current plan is 
> +--+--+
> |   plan  
>  |
> +--+--+
> | == Physical Plan == 
>  |
> | TungstenAggregate(key=[], 
> functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#17L]) 
>  |
> | +- Exchange SinglePartition, None   
>  |
> |+- WholeStageCodegen 
>  |
> |   :  +- TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#20L])  |
> |   : +- Project  
>  |
> |   :+- Filter (l_shipdate#11 = 1990-04-18)   
>  |
> |   :   +- INPUT  
>  |
> |   +- HiveTableScan [l_shipdate#11], MetastoreRelation tpch_1000, 
> lineitem, None |
> +--+--+
> It would be good to change it to OrcRelation to do PPD with ORC, which 
> reduces the runtime by large margin.
>  
> +---+--+
> | 
> plan  
> |
> +---+--+
> | == Physical Plan == 
>   
> |
> | TungstenAggregate(key=[], 
> functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#70L]) 
>   
> |
> | +- Exchange SinglePartition, None   
>   
> |
> |+- WholeStageCodegen 
>   
> |
> |   :  +- TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#106L])
>   |
> |   : +- Project  
>   
> |
> |   :+- Filter (_col10#64 = 1990-04-18)   
>   
> |
> |   :   +- INPUT  
>

[jira] [Updated] (SPARK-16948) Use metastore schema instead of inferring schema for ORC in HiveMetastoreCatalog

2016-09-02 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-16948:
-
Summary: Use metastore schema instead of inferring schema for ORC in 
HiveMetastoreCatalog  (was: Use metastore schema instead of inferring schema in 
ORC in HiveMetastoreCatalog)

> Use metastore schema instead of inferring schema for ORC in 
> HiveMetastoreCatalog
> 
>
> Key: SPARK-16948
> URL: https://issues.apache.org/jira/browse/SPARK-16948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>Priority: Minor
>
> Querying empty partitioned ORC tables from spark-sql throws exception with 
> "spark.sql.hive.convertMetastoreOrc=true".
> {noformat}
> java.util.NoSuchElementException: None.get
> at scala.None$.get(Option.scala:347)
> at scala.None$.get(Option.scala:345)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$12.apply(HiveMetastoreCatalog.scala:297)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$12.apply(HiveMetastoreCatalog.scala:284)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToLogicalRelation(HiveMetastoreCatalog.scala:284)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$.org$apache$spark$sql$hive$HiveMetastoreCatalog$OrcConversions$$convertToOrcRelation(HiveMetastoreCatalo)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:423)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:414)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16948) Use metastore schema instead of inferring schema in ORC in HiveMetastoreCatalog

2016-09-02 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-16948:
-
Summary: Use metastore schema instead of inferring schema in ORC in 
HiveMetastoreCatalog  (was: Support empty orc table when converting hive serde 
table to data source table)

> Use metastore schema instead of inferring schema in ORC in 
> HiveMetastoreCatalog
> ---
>
> Key: SPARK-16948
> URL: https://issues.apache.org/jira/browse/SPARK-16948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>Priority: Minor
>
> Querying empty partitioned ORC tables from spark-sql throws exception with 
> "spark.sql.hive.convertMetastoreOrc=true".
> {noformat}
> java.util.NoSuchElementException: None.get
> at scala.None$.get(Option.scala:347)
> at scala.None$.get(Option.scala:345)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$12.apply(HiveMetastoreCatalog.scala:297)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$12.apply(HiveMetastoreCatalog.scala:284)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToLogicalRelation(HiveMetastoreCatalog.scala:284)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$.org$apache$spark$sql$hive$HiveMetastoreCatalog$OrcConversions$$convertToOrcRelation(HiveMetastoreCatalo)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:423)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:414)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16948) Support empty orc table when converting hive serde table to data source table

2016-08-25 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-16948:
-
Summary: Support empty orc table when converting hive serde table to data 
source table  (was: Querying empty partitioned orc tables throws exception)

> Support empty orc table when converting hive serde table to data source table
> -
>
> Key: SPARK-16948
> URL: https://issues.apache.org/jira/browse/SPARK-16948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>Priority: Minor
>
> Querying empty partitioned ORC tables from spark-sql throws exception with 
> "spark.sql.hive.convertMetastoreOrc=true".
> {noformat}
> java.util.NoSuchElementException: None.get
> at scala.None$.get(Option.scala:347)
> at scala.None$.get(Option.scala:345)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$12.apply(HiveMetastoreCatalog.scala:297)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$12.apply(HiveMetastoreCatalog.scala:284)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToLogicalRelation(HiveMetastoreCatalog.scala:284)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$.org$apache$spark$sql$hive$HiveMetastoreCatalog$OrcConversions$$convertToOrcRelation(HiveMetastoreCatalo)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:423)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:414)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17179) Consider improving partition pruning in HiveMetastoreCatalog

2016-08-21 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-17179:


 Summary: Consider improving partition pruning in 
HiveMetastoreCatalog
 Key: SPARK-17179
 URL: https://issues.apache.org/jira/browse/SPARK-17179
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Rajesh Balamohan
Priority: Critical



Issue:
- Create an external table with 1000s of partition
- Running simple query with partition details ends up listing all files for 
caching in ListingFileCatalog.  This would turn out to be very slow in cloud 
based FS access (e.g S3). Even though, ListingFileCatalog supports
multi-threading, it would end up unncessarily listing 1000+ files when user is 
just interested in 1 partition.
- This adds up additional overhead in HiveMetastoreCatalog as it queries all 
partitions in convertToLogicalRelation 
(metastoreRelation.getHiveQlPartitions()).  Partition related details
are not passed in here, so ends up overloading hive metastore.
- Also even if any partition changes, cache would be dirtied and have to be 
re-populated.  It would be nice to prune the partitions in metastore layer 
itself, so that few partitions are looked up via FileSystem and only few items 
are cached.

{noformat}
"CREATE EXTERNAL TABLE `ca_par_ext`(
  `customer_id` bigint,
  `account_id` bigint)
PARTITIONED BY (
  `effective_date` date)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3a://bucket_details/ca_par'"

explain select count(*) from ca_par_ext where effective_date between 
'2015-12-17' and '2015-12-18';

{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17036) Hadoop config caching could lead to memory pressure and high CPU usage in thrift server

2016-08-15 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15420984#comment-15420984
 ] 

Rajesh Balamohan commented on SPARK-17036:
--

When large number of jobs are run in concurrent fashion via spark thrift 
server, it starts consuming large amount of CPU fairly soon. Since 
{{spark.hadoop.cloneConf=false}} by default, it 
caches the job conf for every RDD that is created in {{HadoopRDD.getJobConf}}. 
This creates large GC pressure and ends up causing this high CPU usage. This 
would not cause OOM as this cache is a soft reference cache internally.
Creating this JIRA to explore on whether this caching can be made optional and 
create new conf object instead.

> Hadoop config caching could lead to memory pressure and high CPU usage in 
> thrift server
> ---
>
> Key: SPARK-17036
> URL: https://issues.apache.org/jira/browse/SPARK-17036
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
>Priority: Minor
>
> Creating this as a follow up jira to SPARK-12920.  Profiler output on the 
> caching is attached in SPARK-12920.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12920) Honor "spark.ui.retainedStages" to reduce mem-pressure

2016-08-12 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418827#comment-15418827
 ] 

Rajesh Balamohan commented on SPARK-12920:
--

Thanks [~vanzin] . I have created SPARK-17036 for caching issue.

> Honor "spark.ui.retainedStages" to reduce mem-pressure
> --
>
> Key: SPARK-12920
> URL: https://issues.apache.org/jira/browse/SPARK-12920
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Fix For: 2.1.0
>
> Attachments: SPARK-12920.profiler.png, 
> SPARK-12920.profiler_job_progress_listner.png
>
>
> - Configured with fair-share-scheduler.
> - 4-5 users submitting/running jobs concurrently via spark-thrift-server
> - Spark thrift server spikes to1600+% CPU and stays there for long time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17036) Hadoop config caching could lead to memory pressure and high CPU usage in thrift server

2016-08-12 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-17036:


 Summary: Hadoop config caching could lead to memory pressure and 
high CPU usage in thrift server
 Key: SPARK-17036
 URL: https://issues.apache.org/jira/browse/SPARK-17036
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Rajesh Balamohan
Priority: Minor


Creating this as a follow up jira to SPARK-12920.  Profiler output on the 
caching is attached in SPARK-12920.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12920) Honor "spark.ui.retainedStages" to reduce mem-pressure

2016-08-08 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-12920:
-
Summary: Honor "spark.ui.retainedStages" to reduce mem-pressure  (was: Fix 
high CPU usage in spark thrift server with concurrent users)

> Honor "spark.ui.retainedStages" to reduce mem-pressure
> --
>
> Key: SPARK-12920
> URL: https://issues.apache.org/jira/browse/SPARK-12920
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
> Attachments: SPARK-12920.profiler.png, 
> SPARK-12920.profiler_job_progress_listner.png
>
>
> - Configured with fair-share-scheduler.
> - 4-5 users submitting/running jobs concurrently via spark-thrift-server
> - Spark thrift server spikes to1600+% CPU and stays there for long time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16948) Querying empty partitioned orc tables throws exception

2016-08-08 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-16948:
-
Description: 
Querying empty partitioned ORC tables from spark-sql throws exception with 
"spark.sql.hive.convertMetastoreOrc=true".

{noformat}
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$12.apply(HiveMetastoreCatalog.scala:297)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$12.apply(HiveMetastoreCatalog.scala:284)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToLogicalRelation(HiveMetastoreCatalog.scala:284)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$.org$apache$spark$sql$hive$HiveMetastoreCatalog$OrcConversions$$convertToOrcRelation(HiveMetastoreCatalo)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:423)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:414)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
{noformat}



  was:
Querying empty partitioned ORC tables from spark-sql throws exception with 
"spark.sql.hive.convertMetastoreOrc=true".




> Querying empty partitioned orc tables throws exception
> --
>
> Key: SPARK-16948
> URL: https://issues.apache.org/jira/browse/SPARK-16948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>Priority: Minor
>
> Querying empty partitioned ORC tables from spark-sql throws exception with 
> "spark.sql.hive.convertMetastoreOrc=true".
> {noformat}
> java.util.NoSuchElementException: None.get
> at scala.None$.get(Option.scala:347)
> at scala.None$.get(Option.scala:345)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$12.apply(HiveMetastoreCatalog.scala:297)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$12.apply(HiveMetastoreCatalog.scala:284)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToLogicalRelation(HiveMetastoreCatalog.scala:284)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$.org$apache$spark$sql$hive$HiveMetastoreCatalog$OrcConversions$$convertToOrcRelation(HiveMetastoreCatalo)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:423)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:414)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16948) Querying empty partitioned orc tables throw exceptions

2016-08-08 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-16948:


 Summary: Querying empty partitioned orc tables throw exceptions
 Key: SPARK-16948
 URL: https://issues.apache.org/jira/browse/SPARK-16948
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Rajesh Balamohan
Priority: Minor


Querying empty partitioned ORC tables from spark-sql throws exception with 
"spark.sql.hive.convertMetastoreOrc=true".





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16948) Querying empty partitioned orc tables throws exception

2016-08-08 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-16948:
-
Summary: Querying empty partitioned orc tables throws exception  (was: 
Querying empty partitioned orc tables throw exceptions)

> Querying empty partitioned orc tables throws exception
> --
>
> Key: SPARK-16948
> URL: https://issues.apache.org/jira/browse/SPARK-16948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>Priority: Minor
>
> Querying empty partitioned ORC tables from spark-sql throws exception with 
> "spark.sql.hive.convertMetastoreOrc=true".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12920) Fix high CPU usage in spark thrift server with concurrent users

2016-08-07 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-12920:
-
Summary: Fix high CPU usage in spark thrift server with concurrent users  
(was: Spark thrift server can run at very high CPU with concurrent users)

> Fix high CPU usage in spark thrift server with concurrent users
> ---
>
> Key: SPARK-12920
> URL: https://issues.apache.org/jira/browse/SPARK-12920
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
> Attachments: SPARK-12920.profiler.png, 
> SPARK-12920.profiler_job_progress_listner.png
>
>
> - Configured with fair-share-scheduler.
> - 4-5 users submitting/running jobs concurrently via spark-thrift-server
> - Spark thrift server spikes to1600+% CPU and stays there for long time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14387) Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc

2016-08-03 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-14387:
-
Summary: Enable Hive-1.x ORC compatibility with 
spark.sql.hive.convertMetastoreOrc  (was: Exceptions thrown when querying ORC 
tables)

> Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc
> -
>
> Key: SPARK-14387
> URL: https://issues.apache.org/jira/browse/SPARK-14387
> Project: Spark
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>
> In master branch, I tried to run TPC-DS queries (e.g Query27) at 200 GB 
> scale. Initially I got the following exception (as FileScanRDD has been made 
> the default in master branch)
> {noformat}
> 16/04/04 06:49:55 WARN TaskSetManager: Lost task 0.0 in stage 15.0. 
> java.lang.IllegalArgumentException: Field "s_store_sk" does not exist.
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
> at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
> at scala.collection.AbstractMap.getOrElse(Map.scala:59)
> at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410)
> at 
> org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:157)
> at 
> org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:146)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:69)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:60)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegen$$anonfun$6$$anon$1.hasNext(WholeStageCodegen.scala:361)
> {noformat}
> When running with "spark.sql.sources.fileScan=false", following exception is 
> thrown
> {noformat}
> 16/04/04 09:02:00 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING,
> java.lang.IllegalArgumentException: Field "cd_demo_sk" does not exist.
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
> at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
> at scala.collection.AbstractMap.getOrElse(Map.scala:59)
> at 
> org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410)
> at 
> org.apache.spark.sql.hive.orc.OrcTableScan.execute(OrcRelation.scala:317)
> at 
> org.apache.spark.sql.hive.orc.DefaultSource.buildInternalScan(OrcRelation.scala:124)
> at 
> 

[jira] [Updated] (SPARK-14752) LazilyGenerateOrdering throws NullPointerException

2016-04-25 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-14752:
-
Summary: LazilyGenerateOrdering throws NullPointerException  (was: 
LazilyGenerateOrdering throws NullPointerException with TakeOrderedAndProject)

> LazilyGenerateOrdering throws NullPointerException
> --
>
> Key: SPARK-14752
> URL: https://issues.apache.org/jira/browse/SPARK-14752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>
> codebase: spark master
> DataSet: TPC-DS
> Client: $SPARK_HOME/bin/beeline
> Example query to reproduce the issue:  
> select i_item_id from item order by i_item_id limit 10;
> Explain plan output
> {noformat}
> explain select i_item_id from item order by i_item_id limit 10;
> +--+--+
> | 
> plan  
>   
>  |
> +--+--+
> | == Physical Plan ==
> TakeOrderedAndProject(limit=10, orderBy=[i_item_id#1229 ASC], 
> output=[i_item_id#1229])
> +- WholeStageCodegen
>:  +- Project [i_item_id#1229]
>: +- Scan HadoopFiles[i_item_id#1229] Format: ORC, PushedFilters: [], 
> ReadSchema: struct  |
> +--+--+
> {noformat}
> Exception:
> {noformat}
> TaskResultGetter: Exception while getting task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1791)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:645)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:344)
>   at java.util.PriorityQueue.add(PriorityQueue.java:321)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31)
>   at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
>  

[jira] [Comment Edited] (SPARK-14752) LazilyGenerateOrdering throws NullPointerException with TakeOrderedAndProject

2016-04-20 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249576#comment-15249576
 ] 

Rajesh Balamohan edited comment on SPARK-14752 at 4/20/16 11:42 AM:


Changing generatedOrdering in LazilyGeneratedOrdering to 
{noformat}
private[this] lazy val generatedOrdering = GenerateOrdering.generate(ordering)
{noformat} 
solves the issue and the query runs fine.  Thought of checking with committers 
opinion before posting the PR for this.


was (Author: rajesh.balamohan):
Changing generatedOrdering in LazilyGeneratedOrdering to 
{noformat}
private[this] lazy val generatedOrdering = GenerateOrdering.generate(ordering)
{noformat} 
solves the issue and the query runs fine. 

> LazilyGenerateOrdering throws NullPointerException with TakeOrderedAndProject
> -
>
> Key: SPARK-14752
> URL: https://issues.apache.org/jira/browse/SPARK-14752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>
> codebase: spark master
> DataSet: TPC-DS
> Client: $SPARK_HOME/bin/beeline
> Example query to reproduce the issue:  
> select i_item_id from item order by i_item_id limit 10;
> Explain plan output
> {noformat}
> explain select i_item_id from item order by i_item_id limit 10;
> +--+--+
> | 
> plan  
>   
>  |
> +--+--+
> | == Physical Plan ==
> TakeOrderedAndProject(limit=10, orderBy=[i_item_id#1229 ASC], 
> output=[i_item_id#1229])
> +- WholeStageCodegen
>:  +- Project [i_item_id#1229]
>: +- Scan HadoopFiles[i_item_id#1229] Format: ORC, PushedFilters: [], 
> ReadSchema: struct  |
> +--+--+
> {noformat}
> Exception:
> {noformat}
> TaskResultGetter: Exception while getting task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1791)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at 

[jira] [Commented] (SPARK-14752) LazilyGenerateOrdering throws NullPointerException with TakeOrderedAndProject

2016-04-20 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249576#comment-15249576
 ] 

Rajesh Balamohan commented on SPARK-14752:
--

Changing generatedOrdering in LazilyGeneratedOrdering to 
{noformat}
private[this] lazy val generatedOrdering = GenerateOrdering.generate(ordering)
{noformat} 
solves the issue and the query runs fine. 

> LazilyGenerateOrdering throws NullPointerException with TakeOrderedAndProject
> -
>
> Key: SPARK-14752
> URL: https://issues.apache.org/jira/browse/SPARK-14752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>
> codebase: spark master
> DataSet: TPC-DS
> Client: $SPARK_HOME/bin/beeline
> Example query to reproduce the issue:  
> select i_item_id from item order by i_item_id limit 10;
> Explain plan output
> {noformat}
> explain select i_item_id from item order by i_item_id limit 10;
> +--+--+
> | 
> plan  
>   
>  |
> +--+--+
> | == Physical Plan ==
> TakeOrderedAndProject(limit=10, orderBy=[i_item_id#1229 ASC], 
> output=[i_item_id#1229])
> +- WholeStageCodegen
>:  +- Project [i_item_id#1229]
>: +- Scan HadoopFiles[i_item_id#1229] Format: ORC, PushedFilters: [], 
> ReadSchema: struct  |
> +--+--+
> {noformat}
> Exception:
> {noformat}
> TaskResultGetter: Exception while getting task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1791)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:645)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:344)
>   at java.util.PriorityQueue.add(PriorityQueue.java:321)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
>   at 
> 

[jira] [Created] (SPARK-14752) LazilyGenerateOrdering throws NullPointerException with TakeOrderedAndProject

2016-04-20 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-14752:


 Summary: LazilyGenerateOrdering throws NullPointerException with 
TakeOrderedAndProject
 Key: SPARK-14752
 URL: https://issues.apache.org/jira/browse/SPARK-14752
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Rajesh Balamohan


codebase: spark master

DataSet: TPC-DS

Client: $SPARK_HOME/bin/beeline

Example query to reproduce the issue:  
select i_item_id from item order by i_item_id limit 10;

Explain plan output
{noformat}
explain select i_item_id from item order by i_item_id limit 10;
+--+--+
|   
  plan  

   |
+--+--+
| == Physical Plan ==
TakeOrderedAndProject(limit=10, orderBy=[i_item_id#1229 ASC], 
output=[i_item_id#1229])
+- WholeStageCodegen
   :  +- Project [i_item_id#1229]
   : +- Scan HadoopFiles[i_item_id#1229] Format: ORC, PushedFilters: [], 
ReadSchema: struct  |
+--+--+
{noformat}

Exception:
{noformat}
TaskResultGetter: Exception while getting task result
com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
Serialization trace:
underlying (org.apache.spark.util.BoundedPriorityQueue)
at 
com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
at 
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
at 
org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1791)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669)
at java.util.PriorityQueue.siftUp(PriorityQueue.java:645)
at java.util.PriorityQueue.offer(PriorityQueue.java:344)
at java.util.PriorityQueue.add(PriorityQueue.java:321)
at 
com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
at 
com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
at 
com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS

2016-04-19 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-14521:
-
Summary: StackOverflowError in Kryo when executing TPC-DS  (was: 
StackOverflowError in Kryo when executing TPC-DS Query27)

> StackOverflowError in Kryo when executing TPC-DS
> 
>
> Key: SPARK-14521
> URL: https://issues.apache.org/jira/browse/SPARK-14521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Rajesh Balamohan
>Priority: Blocker
>
> Build details:  Spark build from master branch (Apr-10)
> DataSet:TPC-DS at 200 GB scale in Parq format stored in hive.
> Client: $SPARK_HOME/bin/beeline 
> Query:  TPC-DS Query27
> spark.sql.sources.fileScan=true (this is the default value anyways)
> Exception:
> {noformat}
> Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS Query27

2016-04-19 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249012#comment-15249012
 ] 

Rajesh Balamohan edited comment on SPARK-14521 at 4/20/16 12:31 AM:


Update:
- By default, spark-thrift server disables "spark.kryo.referenceTracking" (if 
not specified in conf).
https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala#L55

- When "spark.kryo.referenceTracking" is set to true explicitly in 
spark-defaults.conf, query executes successfully. Alternatively, if 
"spark.sql.autoBroadcastJoinThreshold" can be set to a very low value in order 
to prevent broacasting (this is done just for verification).  

- Recent changes LongHashedRelation could have introduced loops which would 
need "spark.kryo.referenceTracking=true" in spark-thrift server. I will create 
a PR for this.


was (Author: rajesh.balamohan):
Update:
- By default, spark-thrift server disables "spark.kryo.referenceTracking".
https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala#L55

- When "spark.kryo.referenceTracking" is set to true explicitly in 
spark-defaults.conf, query executes successfully. Alternatively, if 
"spark.sql.autoBroadcastJoinThreshold" can be set to a very low value in order 
to prevent broacasting (this is done just for verification).  

- Recent changes LongHashedRelation could have introduced loops which would 
need "spark.kryo.referenceTracking=true" in spark-thrift server. I will create 
a PR for this.

> StackOverflowError in Kryo when executing TPC-DS Query27
> 
>
> Key: SPARK-14521
> URL: https://issues.apache.org/jira/browse/SPARK-14521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Rajesh Balamohan
>Priority: Blocker
>
> Build details:  Spark build from master branch (Apr-10)
> DataSet:TPC-DS at 200 GB scale in Parq format stored in hive.
> Client: $SPARK_HOME/bin/beeline 
> Query:  TPC-DS Query27
> spark.sql.sources.fileScan=true (this is the default value anyways)
> Exception:
> {noformat}
> Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> 

[jira] [Commented] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS Query27

2016-04-19 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249012#comment-15249012
 ] 

Rajesh Balamohan commented on SPARK-14521:
--

Update:
- By default, spark-thrift server disables "spark.kryo.referenceTracking".
https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala#L55

- When "spark.kryo.referenceTracking" is set to true explicitly in 
spark-defaults.conf, query executes successfully. Alternatively, if 
"spark.sql.autoBroadcastJoinThreshold" can be set to a very low value in order 
to prevent broacasting (this is done just for verification).  

- Recent changes LongHashedRelation could have introduced loops which would 
need "spark.kryo.referenceTracking=true" in spark-thrift server. I will create 
a PR for this.

> StackOverflowError in Kryo when executing TPC-DS Query27
> 
>
> Key: SPARK-14521
> URL: https://issues.apache.org/jira/browse/SPARK-14521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Rajesh Balamohan
>Priority: Blocker
>
> Build details:  Spark build from master branch (Apr-10)
> DataSet:TPC-DS at 200 GB scale in Parq format stored in hive.
> Client: $SPARK_HOME/bin/beeline 
> Query:  TPC-DS Query27
> spark.sql.sources.fileScan=true (this is the default value anyways)
> Exception:
> {noformat}
> Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (SPARK-14588) Consider getting column stats from files (wherever feasible) to get better stats for joins

2016-04-12 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-14588:


 Summary: Consider getting column stats from files (wherever 
feasible) to get better stats for joins
 Key: SPARK-14588
 URL: https://issues.apache.org/jira/browse/SPARK-14588
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Rajesh Balamohan


Broadcast join is determined by "spark.sql.autoBroadcastJoinThreshold". Stats 
for this is determined from the files and by determining the projected columns 
(internally it assumes 20 bytes for string columns). However, estimated stats 
could be invalid if the dataset contains greater than 20 bytes for string 
columns . In such instances, broadcast join would not be invoked. 

File formats like ORC would be able to provide the raw data size for the 
projected columns. It might be good to consider those (whenever available) to 
determine the accurate stats for broadcast threshold.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14551) Reduce number of NameNode calls in OrcRelation with FileSourceStrategy mode

2016-04-11 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-14551:
-
Summary: Reduce number of NameNode calls in OrcRelation with 
FileSourceStrategy mode  (was: Reduce number of NN calls in OrcRelation with 
FileSourceStrategy mode)

> Reduce number of NameNode calls in OrcRelation with FileSourceStrategy mode
> ---
>
> Key: SPARK-14551
> URL: https://issues.apache.org/jira/browse/SPARK-14551
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
>Priority: Minor
>
> When FileSourceStrategy is used, record reader is created which incurs a NN 
> call internally. Later in OrcRelation.unwrapOrcStructs, it ends ups reading 
> the file information to get the ObjectInspector. This incurs additional NN 
> call. It would be good to avoid this additional NN call (specifically for 
> partitioned datasets)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS Query27

2016-04-10 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234380#comment-15234380
 ] 

Rajesh Balamohan commented on SPARK-14521:
--

Build with commit f8c9beca38f1f396eb3220b23db6d77112a50293 does not have this 
issue. Suspecting it to be kryo 3.0.3 upgrade issue.

> StackOverflowError in Kryo when executing TPC-DS Query27
> 
>
> Key: SPARK-14521
> URL: https://issues.apache.org/jira/browse/SPARK-14521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>
> Build details:  Spark build from master branch (Apr-10)
> DataSet:TPC-DS at 200 GB scale in Parq format stored in hive.
> Client: $SPARK_HOME/bin/beeline 
> Query:  TPC-DS Query27
> spark.sql.sources.fileScan=true (this is the default value anyways)
> Exception:
> {noformat}
> Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS Query27

2016-04-10 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-14521:
-
Summary: StackOverflowError in Kryo when executing TPC-DS Query27  (was: 
StackOverflowError when executing TPC-DS Query27)

> StackOverflowError in Kryo when executing TPC-DS Query27
> 
>
> Key: SPARK-14521
> URL: https://issues.apache.org/jira/browse/SPARK-14521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>
> Build details:  Spark build from master branch (Apr-10)
> DataSet:TPC-DS at 200 GB scale in Parq format stored in hive.
> Client: $SPARK_HOME/bin/beeline 
> Query:  TPC-DS Query27
> spark.sql.sources.fileScan=true (this is the default value anyways)
> Exception:
> {noformat}
> Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14521) StackOverflowError when executing TPC-DS Query27

2016-04-10 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-14521:


 Summary: StackOverflowError when executing TPC-DS Query27
 Key: SPARK-14521
 URL: https://issues.apache.org/jira/browse/SPARK-14521
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Rajesh Balamohan


Build details:  Spark build from master branch (Apr-10)

DataSet:TPC-DS at 200 GB scale in Parq format stored in hive.

Client: $SPARK_HOME/bin/beeline 

Query:  TPC-DS Query27

spark.sql.sources.fileScan=true (this is the default value anyways)

Exception:
{noformat}
Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99)
at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
at 
com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
at 
com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
at 
com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
at 
com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
at 
com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
at 
com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14520) ClasscastException thrown with spark.sql.parquet.enableVectorizedReader=true

2016-04-10 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-14520:
-
Description: 
Build details: Spark build from master branch (Apr-10)

TPC-DS at 200 GB scale stored in Parq format stored in hive.

Ran TPC-DS Query27 via Spark beeline client with 
"spark.sql.sources.fileScan=false".

{noformat}
 java.lang.ClassCastException: 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader
 cannot be cast to org.apache.parquet.hadoop.ParquetRecordReader
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:480)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:476)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:161)
at 
org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:121)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
at org.apache.spark.scheduler.Task.run(Task.scala:82)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}



Creating this JIRA as a placeholder to track this issue.


  was:
Build details: Spark build from master branch (Apr-10)

TPC-DS at 200 GB scale stored in Parq format stored in hive.

Ran TPC-DS Query27 via Spark beeline client.

{noformat}
 java.lang.ClassCastException: 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader
 cannot be cast to org.apache.parquet.hadoop.ParquetRecordReader
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:480)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:476)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:161)
at 
org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:121)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
at org.apache.spark.scheduler.Task.run(Task.scala:82)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}


Creating this JIRA as a placeholder to track this issue.



> ClasscastException thrown with spark.sql.parquet.enableVectorizedReader=true
> 
>
> Key: SPARK-14520
> URL: https://issues.apache.org/jira/browse/SPARK-14520
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>
> Build details: Spark build from master branch (Apr-10)
> TPC-DS at 200 GB scale stored in Parq format stored in hive.
> Ran TPC-DS Query27 via Spark beeline client with 
> "spark.sql.sources.fileScan=false".
> {noformat}
>  java.lang.ClassCastException: 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader
>  cannot be cast to org.apache.parquet.hadoop.ParquetRecordReader
> at 
> 

[jira] [Updated] (SPARK-14520) ClasscastException thrown with spark.sql.parquet.enableVectorizedReader=true

2016-04-10 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-14520:
-
Description: 
Build details: Spark build from master branch (Apr-10)

TPC-DS at 200 GB scale stored in Parq format stored in hive.

Ran TPC-DS Query27 via Spark beeline client.

{noformat}
 java.lang.ClassCastException: 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader
 cannot be cast to org.apache.parquet.hadoop.ParquetRecordReader
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:480)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:476)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:161)
at 
org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:121)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
at org.apache.spark.scheduler.Task.run(Task.scala:82)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}


Creating this JIRA as a placeholder to track this issue.


  was:
Build details: Spark build from master branch (Apr-10)

TPC-DS at 200 GB scale stored in Parq format stored in hive.

Ran TPC-DS Query27 via Spark beeline client.

{noformat}
 java.lang.ClassCastException: 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader
 cannot be cast to org.apache.parquet.hadoop.ParquetRecordReader
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:480)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:476)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:161)
at 
org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:121)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
at org.apache.spark.scheduler.Task.run(Task.scala:82)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}


Creating this JIRA as a placeholder to fix this issue.



> ClasscastException thrown with spark.sql.parquet.enableVectorizedReader=true
> 
>
> Key: SPARK-14520
> URL: https://issues.apache.org/jira/browse/SPARK-14520
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>
> Build details: Spark build from master branch (Apr-10)
> TPC-DS at 200 GB scale stored in Parq format stored in hive.
> Ran TPC-DS Query27 via Spark beeline client.
> {noformat}
>  java.lang.ClassCastException: 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader
>  cannot be cast to org.apache.parquet.hadoop.ParquetRecordReader
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:480)
> at 
> 

[jira] [Created] (SPARK-14520) ClasscastException thrown with spark.sql.parquet.enableVectorizedReader=true

2016-04-10 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-14520:


 Summary: ClasscastException thrown with 
spark.sql.parquet.enableVectorizedReader=true
 Key: SPARK-14520
 URL: https://issues.apache.org/jira/browse/SPARK-14520
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Rajesh Balamohan


Build details: Spark build from master branch (Apr-10)

TPC-DS at 200 GB scale stored in Parq format stored in hive.

Ran TPC-DS Query27 via Spark beeline client.

{noformat}
 java.lang.ClassCastException: 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader
 cannot be cast to org.apache.parquet.hadoop.ParquetRecordReader
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:480)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:476)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:161)
at 
org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:121)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
at org.apache.spark.scheduler.Task.run(Task.scala:82)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}


Creating this JIRA as a placeholder to fix this issue.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14387) Exceptions thrown when querying ORC tables

2016-04-04 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-14387:


 Summary: Exceptions thrown when querying ORC tables
 Key: SPARK-14387
 URL: https://issues.apache.org/jira/browse/SPARK-14387
 Project: Spark
  Issue Type: Bug
Reporter: Rajesh Balamohan


In master branch, I tried to run TPC-DS queries (e.g Query27) at 200 GB scale. 
Initially I got the following exception (as FileScanRDD has been made the 
default in master branch)

{noformat}
16/04/04 06:49:55 WARN TaskSetManager: Lost task 0.0 in stage 15.0. 
java.lang.IllegalArgumentException: Field "s_store_sk" does not exist.
at 
org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
at 
org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235)
at 
org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
at 
org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
at 
org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410)
at 
org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:157)
at 
org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:146)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:69)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:60)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegen$$anonfun$6$$anon$1.hasNext(WholeStageCodegen.scala:361)
{noformat}

When running with "spark.sql.sources.fileScan=false", following exception is 
thrown

{noformat}
16/04/04 09:02:00 ERROR SparkExecuteStatementOperation: Error executing query, 
currentState RUNNING,
java.lang.IllegalArgumentException: Field "cd_demo_sk" does not exist.
at 
org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
at 
org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at 
org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235)
at 
org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
at 
org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
at 
org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410)
at 
org.apache.spark.sql.hive.orc.OrcTableScan.execute(OrcRelation.scala:317)
at 
org.apache.spark.sql.hive.orc.DefaultSource.buildInternalScan(OrcRelation.scala:124)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$16.apply(DataSourceStrategy.scala:229)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$16.apply(DataSourceStrategy.scala:228)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:537)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:536)
at 

[jira] [Updated] (SPARK-14321) Reduce date format cost in date functions

2016-04-03 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-14321:
-
Summary: Reduce date format cost in date functions  (was: Reduce date 
format cost and string-to-date cost in date functions)

> Reduce date format cost in date functions
> -
>
> Key: SPARK-14321
> URL: https://issues.apache.org/jira/browse/SPARK-14321
> Project: Spark
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Priority: Minor
>
> Currently the code generated is
> {noformat}
> /* 066 */ UTF8String primitive5 = null;
> /* 067 */ if (!isNull4) {
> /* 068 */   try {
> /* 069 */ primitive5 = UTF8String.fromString(new 
> java.text.SimpleDateFormat("-MM-dd HH:mm:ss").format(
> /* 070 */ new java.util.Date(primitive7 * 1000L)));
> /* 071 */   } catch (java.lang.Throwable e) {
> /* 072 */ isNull4 = true;
> /* 073 */   }
> /* 074 */ }
> {noformat}
> Instantiation of SimpleDateFormat is fairly expensive. It can be created on 
> need basis. 
> I will share the patch soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14321) Reduce date format cost and string-to-date cost in date functions

2016-03-31 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-14321:
-
Summary: Reduce date format cost and string-to-date cost in date functions  
(was: Reduce DateFormat cost in datetimeExpressions)

> Reduce date format cost and string-to-date cost in date functions
> -
>
> Key: SPARK-14321
> URL: https://issues.apache.org/jira/browse/SPARK-14321
> Project: Spark
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Priority: Minor
>
> Currently the code generated is
> {noformat}
> /* 066 */ UTF8String primitive5 = null;
> /* 067 */ if (!isNull4) {
> /* 068 */   try {
> /* 069 */ primitive5 = UTF8String.fromString(new 
> java.text.SimpleDateFormat("-MM-dd HH:mm:ss").format(
> /* 070 */ new java.util.Date(primitive7 * 1000L)));
> /* 071 */   } catch (java.lang.Throwable e) {
> /* 072 */ isNull4 = true;
> /* 073 */   }
> /* 074 */ }
> {noformat}
> Instantiation of SimpleDateFormat is fairly expensive. It can be created on 
> need basis. 
> I will share the patch soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14321) Reduce DateFormat cost in datetimeExpressions

2016-03-31 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-14321:


 Summary: Reduce DateFormat cost in datetimeExpressions
 Key: SPARK-14321
 URL: https://issues.apache.org/jira/browse/SPARK-14321
 Project: Spark
  Issue Type: Bug
Reporter: Rajesh Balamohan
Priority: Minor


Currently the code generated is

{noformat}
/* 066 */ UTF8String primitive5 = null;
/* 067 */ if (!isNull4) {
/* 068 */   try {
/* 069 */ primitive5 = UTF8String.fromString(new 
java.text.SimpleDateFormat("-MM-dd HH:mm:ss").format(
/* 070 */ new java.util.Date(primitive7 * 1000L)));
/* 071 */   } catch (java.lang.Throwable e) {
/* 072 */ isNull4 = true;
/* 073 */   }
/* 074 */ }
{noformat}

Instantiation of SimpleDateFormat is fairly expensive. It can be created on 
need basis. 

I will share the patch soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14286) Empty ORC table join throws exception

2016-03-30 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-14286:


 Summary: Empty ORC table join throws exception
 Key: SPARK-14286
 URL: https://issues.apache.org/jira/browse/SPARK-14286
 Project: Spark
  Issue Type: Bug
Reporter: Rajesh Balamohan
Priority: Minor


When joining with an empty ORC table, sparks throws following exception. 

{noformat}
java.sql.SQLException: java.lang.IllegalArgumentException: orcFileOperator: 
path /apps/hive/warehouse/test.db/table does not have valid orc files 
matching the pattern
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14113) Consider marking JobConf closure-cleaning in HadoopRDD as optional

2016-03-24 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210038#comment-15210038
 ] 

Rajesh Balamohan commented on SPARK-14113:
--

[~srowen] - In some cases, queries have 5000+ RDDs and whenever HadoopRDD gets 
initialized this cleanup gets called causing the bottleneck. So the overall 
runtime gets increased by couple of seconds when the entire job runtime itself 
is smaller. For instance, SQLHadoopRDD does not have this kind of check.

> Consider marking JobConf closure-cleaning in HadoopRDD as optional
> --
>
> Key: SPARK-14113
> URL: https://issues.apache.org/jira/browse/SPARK-14113
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Rajesh Balamohan
>Priority: Minor
>
> In HadoopRDD, the following code was introduced as a part of SPARK-6943.
> {noformat}
>   if (initLocalJobConfFuncOpt.isDefined) {
> sparkContext.clean(initLocalJobConfFuncOpt.get)
>   }
> {noformat}
> When working on one of the changes in OrcRelation, I tried passing 
> initLocalJobConfFuncOpt to HadoopRDD and that incurred good performance 
> penalty (due to closure cleaning) with large RDDs. This would be invoked for 
> every HadoopRDD initialization causing the bottleneck.
> example threadstack is given below
> {noformat}
> at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
> at org.apache.xbean.asm5.ClassReader.readUTF8(Unknown Source)
> at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
> at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
> at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
> at 
> org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:402)
> at 
> org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:390)
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> at 
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102)
> at 
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
> at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:102)
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> at 
> org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:390)
> at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
> at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
> at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
> at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
> at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$15.apply(ClosureCleaner.scala:224)
> at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$15.apply(ClosureCleaner.scala:223)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:223)
> at 
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:2079)
> at 
> org.apache.spark.rdd.HadoopRDD.(HadoopRDD.scala:112){noformat}
> Creating this JIRA to explore the possibility of removing it or mark it 
> optional.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14113) Consider marking JobConf closure-cleaning in HadoopRDD as optional

2016-03-24 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-14113:


 Summary: Consider marking JobConf closure-cleaning in HadoopRDD as 
optional
 Key: SPARK-14113
 URL: https://issues.apache.org/jira/browse/SPARK-14113
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Rajesh Balamohan


In HadoopRDD, the following code was introduced as a part of SPARK-6943.

{noformat}
  if (initLocalJobConfFuncOpt.isDefined) {
sparkContext.clean(initLocalJobConfFuncOpt.get)
  }
{noformat}

When working on one of the changes in OrcRelation, I tried passing 
initLocalJobConfFuncOpt to HadoopRDD and that incurred good performance penalty 
(due to closure cleaning) with large RDDs. This would be invoked for every 
HadoopRDD initialization causing the bottleneck.

example threadstack is given below

{noformat}
at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
at org.apache.xbean.asm5.ClassReader.readUTF8(Unknown Source)
at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at 
org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:402)
at 
org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:390)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102)
at 
scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:102)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at 
org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:390)
at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at 
org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$15.apply(ClosureCleaner.scala:224)
at 
org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$15.apply(ClosureCleaner.scala:223)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:223)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2079)
at org.apache.spark.rdd.HadoopRDD.(HadoopRDD.scala:112){noformat}

Creating this JIRA to explore the possibility of removing it or mark it 
optional.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14091) Consider improving performance of SparkContext.getCallSite()

2016-03-22 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-14091:


 Summary: Consider improving performance of 
SparkContext.getCallSite()
 Key: SPARK-14091
 URL: https://issues.apache.org/jira/browse/SPARK-14091
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Rajesh Balamohan


Currently SparkContext.getCallSite() makes a call to Utils.getCallSite().

{noformat}
  private[spark] def getCallSite(): CallSite = {
val callSite = Utils.getCallSite()
CallSite(
  
Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm),
  Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm)
)
  }
{noformat}

However, in some places utils.withDummyCallSite(sc) is invoked to avoid 
expensive threaddumps within getCallSite().  But Utils.getCallSite() is 
evaluated earlier causing threaddumps to be computed.  This would impact when 
lots of RDDs are created (e.g spends close to 3-7 seconds when 1000+ are RDDs 
are present, which can have significant impact when entire query runtime is in 
the order of 10-20 seconds)

Creating this jira to consider evaluating getCallSite only when needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12925) Improve HiveInspectors.unwrap for StringObjectInspector.getPrimitiveWritableObject

2016-03-02 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176734#comment-15176734
 ] 

Rajesh Balamohan commented on SPARK-12925:
--

Earlier fix had a problem when Text was reused. Posting a revised patch for 
review which fixes the issue when Text is reused. 

> Improve HiveInspectors.unwrap for 
> StringObjectInspector.getPrimitiveWritableObject
> --
>
> Key: SPARK-12925
> URL: https://issues.apache.org/jira/browse/SPARK-12925
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Fix For: 2.0.0
>
> Attachments: SPARK-12925_profiler_cpu_samples.png
>
>
> Text is in UTF-8 and converting it via "UTF8String.fromString" incurs 
> decoding and encoding, which turns out to be expensive. (to be specific: 
> https://github.com/apache/spark/blob/0d543b98f3e3da5053f0476f4647a765460861f3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L323)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13542) Fix HiveInspectors.unwrap

2016-02-28 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-13542:
-
Summary: Fix HiveInspectors.unwrap  (was: Fix HiveInspectors.unwrap (Hive 
suite failures))

> Fix HiveInspectors.unwrap
> -
>
> Key: SPARK-13542
> URL: https://issues.apache.org/jira/browse/SPARK-13542
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Rajesh Balamohan
>
> Related to SPARK-12925. Text object might be reused in higher layers and this 
> could have wrong results in UTF8String. Instead, it should copy the byte 
> array and send it to UTF8String. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13542) Fix HiveInspectors.unwrap (Hive suite failures)

2016-02-28 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-13542:


 Summary: Fix HiveInspectors.unwrap (Hive suite failures)
 Key: SPARK-13542
 URL: https://issues.apache.org/jira/browse/SPARK-13542
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Rajesh Balamohan


Related to SPARK-12925. Text object might be reused in higher layers and this 
could have wrong results in UTF8String. Instead, it should copy the byte array 
and send it to UTF8String. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13059) Sort inputsplits by size in HadoopRDD to avoid long tails

2016-01-28 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-13059:


 Summary: Sort inputsplits by size in HadoopRDD to avoid long tails
 Key: SPARK-13059
 URL: https://issues.apache.org/jira/browse/SPARK-13059
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Rajesh Balamohan


HadoopRDD.getPartitions invokes getSplits from the inputformat and returns the 
HadoopPartition.  There are cases where the input splits generated are not  of 
equal sizes all the time and some splits would be much smaller than others.   
If bigger splits are scheduled at the end of the job, there is a possibility of 
getting long tail in the job.  Sorting the input splits by size (in descending 
order) can help in scheduling the larger splits upfront. This could also help 
in speculation as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13059) Sort inputsplits by size in HadoopRDD to avoid long tails

2016-01-28 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15121214#comment-15121214
 ] 

Rajesh Balamohan commented on SPARK-13059:
--

Thanks [~srowen]. The same problem would exist even today, if InputFormats 
choose to change the split calculation (for any reason). 

> Sort inputsplits by size in HadoopRDD to avoid long tails
> -
>
> Key: SPARK-13059
> URL: https://issues.apache.org/jira/browse/SPARK-13059
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Rajesh Balamohan
>
> HadoopRDD.getPartitions invokes getSplits from the inputformat and returns 
> the HadoopPartition.  There are cases where the input splits generated are 
> not  of equal sizes all the time and some splits would be much smaller than 
> others.   If bigger splits are scheduled at the end of the job, there is a 
> possibility of getting long tail in the job.  Sorting the input splits by 
> size (in descending order) can help in scheduling the larger splits upfront. 
> This could also help in speculation as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12998) Enable OrcRelation when connecting via spark thrift server

2016-01-26 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-12998:


 Summary: Enable OrcRelation when connecting via spark thrift server
 Key: SPARK-12998
 URL: https://issues.apache.org/jira/browse/SPARK-12998
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Rajesh Balamohan


When a user connects via spark-thrift server to execute SQL, it does not enable 
PPD with ORC. It ends up creating MetastoreRelation which does not have ORC 
PPD.  Purpose of this JIRA is to convert MetastoreRelation to OrcRelation in 
HiveMetastoreCatalog, so that users can benefit from PPD even when connecting 
to spark-thrift server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12998) Enable OrcRelation when connecting via spark thrift server

2016-01-26 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-12998:
-
Description: 
When a user connects via spark-thrift server to execute SQL, it does not enable 
PPD with ORC. It ends up creating MetastoreRelation which does not have ORC 
PPD.  Purpose of this JIRA is to convert MetastoreRelation to OrcRelation in 
HiveMetastoreCatalog, so that users can benefit from PPD even when connecting 
to spark-thrift server.

For example, "explain select count(1) from  tpch_flat_orc_1000.lineitem where 
l_shipdate = '1990-04-18'", current plan is 

+--+--+
|   plan
   |
+--+--+
| == Physical Plan ==   
   |
| TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
output=[_c0#17L])  |
| +- Exchange SinglePartition, None 
   |
|+- WholeStageCodegen   
   |
|   :  +- TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#20L])  |
|   : +- Project
   |
|   :+- Filter (l_shipdate#11 = 1990-04-18) 
   |
|   :   +- INPUT
   |
|   +- HiveTableScan [l_shipdate#11], MetastoreRelation tpch_1000, 
lineitem, None |
+--+--+

It would be good to change it to OrcRelation to do PPD with ORC, which reduces 
the runtime by large margin.
 
+---+--+
|   
  plan  
|
+---+--+
| == Physical Plan ==   

|
| TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
output=[_c0#70L])   
|
| +- Exchange SinglePartition, None 

|
|+- WholeStageCodegen   

|
|   :  +- TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#106L])  
|
|   : +- Project

|
|   :+- Filter (_col10#64 = 1990-04-18) 

|
|   :   +- INPUT

|
|   +- Scan OrcRelation[_col10#64] InputPaths: 
hdfs://nn:8020/apps/hive/warehouse/tpch_1000.db/lineitem, PushedFilters: 
[EqualTo(_col10,1990-04-18)]  |
+---+--+

  was:When a user connects via spark-thrift server to execute SQL, it does not 
enable PPD with ORC. It ends up creating MetastoreRelation which does not 

[jira] [Updated] (SPARK-12998) Enable OrcRelation when connecting via spark thrift server

2016-01-26 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-12998:
-
Description: 
When a user connects via spark-thrift server to execute SQL, it does not enable 
PPD with ORC. It ends up creating MetastoreRelation which does not have ORC 
PPD.  Purpose of this JIRA is to convert MetastoreRelation to OrcRelation in 
HiveMetastoreCatalog, so that users can benefit from PPD even when connecting 
to spark-thrift server.

{noformat}
For example, "explain select count(1) from  tpch_flat_orc_1000.lineitem where 
l_shipdate = '1990-04-18'", current plan is 

+--+--+
|   plan
   |
+--+--+
| == Physical Plan ==   
   |
| TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
output=[_c0#17L])  |
| +- Exchange SinglePartition, None 
   |
|+- WholeStageCodegen   
   |
|   :  +- TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#20L])  |
|   : +- Project
   |
|   :+- Filter (l_shipdate#11 = 1990-04-18) 
   |
|   :   +- INPUT
   |
|   +- HiveTableScan [l_shipdate#11], MetastoreRelation tpch_1000, 
lineitem, None |
+--+--+

It would be good to change it to OrcRelation to do PPD with ORC, which reduces 
the runtime by large margin.
 
+---+--+
|   
  plan  
|
+---+--+
| == Physical Plan ==   

|
| TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
output=[_c0#70L])   
|
| +- Exchange SinglePartition, None 

|
|+- WholeStageCodegen   

|
|   :  +- TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#106L])  
|
|   : +- Project

|
|   :+- Filter (_col10#64 = 1990-04-18) 

|
|   :   +- INPUT

|
|   +- Scan OrcRelation[_col10#64] InputPaths: 
hdfs://nn:8020/apps/hive/warehouse/tpch_1000.db/lineitem, PushedFilters: 
[EqualTo(_col10,1990-04-18)]  |
+---+--+

{noformat}

  was:
When a user connects via spark-thrift server to execute SQL, it does not enable 
PPD with ORC. It ends up creating 

[jira] [Updated] (SPARK-12948) Consider reducing size of broadcasts in OrcRelation

2016-01-24 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-12948:
-
Attachment: SPARK-12948.mem.prof.snapshot.png

> Consider reducing size of broadcasts in OrcRelation
> ---
>
> Key: SPARK-12948
> URL: https://issues.apache.org/jira/browse/SPARK-12948
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
> Attachments: SPARK-12948.mem.prof.snapshot.png, 
> SPARK-12948_cpuProf.png
>
>
> Size of broadcasted data in OrcRelation was significantly higher when running 
> query with large number of partitions (e.g TPC-DS). Consider reducing the 
> size of the broadcasted data in OrcRelation, as it has an impact on the job 
> runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12948) Consider reducing size of broadcasts in OrcRelation

2016-01-20 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-12948:


 Summary: Consider reducing size of broadcasts in OrcRelation
 Key: SPARK-12948
 URL: https://issues.apache.org/jira/browse/SPARK-12948
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Rajesh Balamohan


Size of broadcasted data in OrcRelation was significantly higher when running 
query with large number of partitions (e.g TPC-DS). Consider reducing the size 
of the broadcasted data in OrcRelation, as it has an impact on the job runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12948) Consider reducing size of broadcasts in OrcRelation

2016-01-20 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-12948:
-
Attachment: SPARK-12948_cpuProf.png

> Consider reducing size of broadcasts in OrcRelation
> ---
>
> Key: SPARK-12948
> URL: https://issues.apache.org/jira/browse/SPARK-12948
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
> Attachments: SPARK-12948_cpuProf.png
>
>
> Size of broadcasted data in OrcRelation was significantly higher when running 
> query with large number of partitions (e.g TPC-DS). Consider reducing the 
> size of the broadcasted data in OrcRelation, as it has an impact on the job 
> runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12925) Improve HiveInspectors.unwrap for StringObjectInspector.getPrimitiveWritableObject

2016-01-20 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-12925:
-
Attachment: SPARK-12925_profiler_cpu_samples.png

> Improve HiveInspectors.unwrap for 
> StringObjectInspector.getPrimitiveWritableObject
> --
>
> Key: SPARK-12925
> URL: https://issues.apache.org/jira/browse/SPARK-12925
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
> Attachments: SPARK-12925_profiler_cpu_samples.png
>
>
> Text is in UTF-8 and converting it via "UTF8String.fromString" incurs 
> decoding and encoding, which turns out to be expensive. (to be specific: 
> https://github.com/apache/spark/blob/0d543b98f3e3da5053f0476f4647a765460861f3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L323)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12925) Improve HiveInspectors.unwrap for StringObjectInspector.getPrimitiveWritableObject

2016-01-20 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-12925:


 Summary: Improve HiveInspectors.unwrap for 
StringObjectInspector.getPrimitiveWritableObject
 Key: SPARK-12925
 URL: https://issues.apache.org/jira/browse/SPARK-12925
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Rajesh Balamohan


Text is in UTF-8 and converting it via "UTF8String.fromString" incurs decoding 
and encoding, which turns out to be expensive. (to be specific: 
https://github.com/apache/spark/blob/0d543b98f3e3da5053f0476f4647a765460861f3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L323)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12898) Consider having dummyCallSite for HiveTableScan

2016-01-19 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-12898:
-
Attachment: callsiteProf.png

> Consider having dummyCallSite for HiveTableScan
> ---
>
> Key: SPARK-12898
> URL: https://issues.apache.org/jira/browse/SPARK-12898
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
> Attachments: callsiteProf.png
>
>
> Currently, it runs with getCallSite which is really expensive and shows up 
> when scanning through large table with partitions (e.g TPC-DS). It would be 
> good to consider having dummyCallSite in HiveTableScan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12898) Consider having dummyCallSite for HiveTableScan

2016-01-19 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-12898:
-
Attachment: (was: callsiteProf)

> Consider having dummyCallSite for HiveTableScan
> ---
>
> Key: SPARK-12898
> URL: https://issues.apache.org/jira/browse/SPARK-12898
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
>
> Currently, it runs with getCallSite which is really expensive and shows up 
> when scanning through large table with partitions (e.g TPC-DS). It would be 
> good to consider having dummyCallSite in HiveTableScan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12898) Consider having dummyCallSite for HiveTableScan

2016-01-19 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-12898:
-
Attachment: callsiteProf

> Consider having dummyCallSite for HiveTableScan
> ---
>
> Key: SPARK-12898
> URL: https://issues.apache.org/jira/browse/SPARK-12898
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
> Attachments: callsiteProf
>
>
> Currently, it runs with getCallSite which is really expensive and shows up 
> when scanning through large table with partitions (e.g TPC-DS). It would be 
> good to consider having dummyCallSite in HiveTableScan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12920) Spark thrift server can run at very high CPU with concurrent users

2016-01-19 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-12920:


 Summary: Spark thrift server can run at very high CPU with 
concurrent users
 Key: SPARK-12920
 URL: https://issues.apache.org/jira/browse/SPARK-12920
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Rajesh Balamohan


- Configured with fair-share-scheduler.
- 4-5 users submitting/running jobs concurrently via spark-thrift-server
- Spark thrift server spikes to1600+% CPU and stays there for long time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12920) Spark thrift server can run at very high CPU with concurrent users

2016-01-19 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-12920:
-
Attachment: SPARK-12920.profiler.png
SPARK-12920.profiler_job_progress_listner.png

-RDD would be different if user runs different queries.
-This causes many objects in cache causing lots of pressure for GC (Refer 2703 
MB of cache size).
-Also, JobProgressListener caches stageInfo and Job objects.  Setting 
spark.sql.ui.retainedExecutions=0, spark.ui.enabled=false, 
spark.ui.retainedStages=0, spark.ui.retainedJobs=0 did not release the objects 
. This is because of 
https://github.com/apache/spark/blob/2b5d11f34d73eb7117c0c4668c1abb27dcc3a403/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala#L142
 (releases one entry when retainedStages=0.  This causes memory build up in 
multi user env).




> Spark thrift server can run at very high CPU with concurrent users
> --
>
> Key: SPARK-12920
> URL: https://issues.apache.org/jira/browse/SPARK-12920
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
> Attachments: SPARK-12920.profiler.png, 
> SPARK-12920.profiler_job_progress_listner.png
>
>
> - Configured with fair-share-scheduler.
> - 4-5 users submitting/running jobs concurrently via spark-thrift-server
> - Spark thrift server spikes to1600+% CPU and stays there for long time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12898) Consider having dummyCallSite for HiveTableScan

2016-01-18 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-12898:


 Summary: Consider having dummyCallSite for HiveTableScan
 Key: SPARK-12898
 URL: https://issues.apache.org/jira/browse/SPARK-12898
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Rajesh Balamohan


Currently, it runs with getCallSite which is really expensive and shows up when 
scanning through large table with partitions (e.g TPC-DS). It would be good to 
consider having dummyCallSite in HiveTableScan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12803) Consider adding ability to profile specific instances of executors in spark

2016-01-14 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097836#comment-15097836
 ] 

Rajesh Balamohan commented on SPARK-12803:
--

Letting the profiler agent run on all executors and connecting to it works out 
for very long running jobs. If trying to profile a task in a job (e.g 30 
seconds job), it would take sometime to locate the node and start profiling. By 
that time job itself would be over. Another option is to have offline profiling 
enabled (wherein it dumps the entire profiler snapshots on exit, or on periodic 
basis). But that would also generate too many snapshots. Having the option 
mentioned in the ticket would help in enabling profiling on specific places on 
need basis. 

> Consider adding ability to profile specific instances of executors in spark
> ---
>
> Key: SPARK-12803
> URL: https://issues.apache.org/jira/browse/SPARK-12803
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Reporter: Rajesh Balamohan
>
> It would be useful to profile specific instances of executors as opposed to 
> adding profiler details to all executors via 
> "spark.executor.extraJavaOptions".  
> Setting the number of executors to just 1 and profiling wouldn't be much 
> useful (in some cases, most of the time with single executor mode would be 
> spent in terms of reading data from remote node).  At the same time, setting 
> profiling option to all executors could just create too many number of 
> snapshots; making it harder to analyze.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12803) Consider adding ability to profile specific instances of executors in spark

2016-01-13 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-12803:


 Summary: Consider adding ability to profile specific instances of 
executors in spark
 Key: SPARK-12803
 URL: https://issues.apache.org/jira/browse/SPARK-12803
 Project: Spark
  Issue Type: Bug
  Components: Java API
Reporter: Rajesh Balamohan


It would be useful to profile specific instances of executors as opposed to 
adding profiler details to all executors via "spark.executor.extraJavaOptions". 
 

Setting the number of executors to just 1 and profiling wouldn't be much useful 
(in some cases, most of the time with single executor mode would be spent in 
terms of reading data from remote node).  At the same time, setting profiling 
option to all executors could just create too many number of snapshots; making 
it harder to analyze.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12803) Consider adding ability to profile specific instances of executors in spark

2016-01-13 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-12803:
-
Issue Type: Improvement  (was: Bug)

> Consider adding ability to profile specific instances of executors in spark
> ---
>
> Key: SPARK-12803
> URL: https://issues.apache.org/jira/browse/SPARK-12803
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Reporter: Rajesh Balamohan
>
> It would be useful to profile specific instances of executors as opposed to 
> adding profiler details to all executors via 
> "spark.executor.extraJavaOptions".  
> Setting the number of executors to just 1 and profiling wouldn't be much 
> useful (in some cases, most of the time with single executor mode would be 
> spent in terms of reading data from remote node).  At the same time, setting 
> profiling option to all executors could just create too many number of 
> snapshots; making it harder to analyze.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12803) Consider adding ability to profile specific instances of executors in spark

2016-01-13 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097091#comment-15097091
 ] 

Rajesh Balamohan commented on SPARK-12803:
--

It is for connecting to profiler. Adding profiler options with 
"spark.executor.extraJavaOptions" ends up adding profiler options on all 
executors. However, that might not be useful if the cluster has 100 executors.
There are scenarios where one wants to profile 1 or 2 executors in the cluster. 

Ideal would be to provide an option to enable profiling only on specific tasks 
in different stages (e.g enable profiling on task 10 in stage 5 which is 
performing badly. One need not enable profiling on all executors to do this). I 
am not sure if this can be supported at this time. 

> Consider adding ability to profile specific instances of executors in spark
> ---
>
> Key: SPARK-12803
> URL: https://issues.apache.org/jira/browse/SPARK-12803
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Reporter: Rajesh Balamohan
>
> It would be useful to profile specific instances of executors as opposed to 
> adding profiler details to all executors via 
> "spark.executor.extraJavaOptions".  
> Setting the number of executors to just 1 and profiling wouldn't be much 
> useful (in some cases, most of the time with single executor mode would be 
> spent in terms of reading data from remote node).  At the same time, setting 
> profiling option to all executors could just create too many number of 
> snapshots; making it harder to analyze.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12417) Orc bloom filter options are not propagated during file write in spark

2015-12-17 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-12417:


 Summary: Orc bloom filter options are not propagated during file 
write in spark
 Key: SPARK-12417
 URL: https://issues.apache.org/jira/browse/SPARK-12417
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Rajesh Balamohan


ORC bloom filter is supported by the version of hive used in Spark 1.5.2. 
However, when trying to create orc file with bloom filter option, it does not 
make use of it.

E.g, following orc output does not create the bloom filter even though the 
options are specified.
{noformat}
Map orcOption = new HashMap();
orcOption.put("orc.bloom.filter.columns", "*");
hiveContext.sql("select * from accounts where 
effective_date='2015-12-30'").write().
format("orc").options(orcOption).save("/tmp/accounts");
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12417) Orc bloom filter options are not propagated during file write in spark

2015-12-17 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-12417:
-
Attachment: SPARK-12417.1.patch

> Orc bloom filter options are not propagated during file write in spark
> --
>
> Key: SPARK-12417
> URL: https://issues.apache.org/jira/browse/SPARK-12417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
> Attachments: SPARK-12417.1.patch
>
>
> ORC bloom filter is supported by the version of hive used in Spark 1.5.2. 
> However, when trying to create orc file with bloom filter option, it does not 
> make use of it.
> E.g, following orc output does not create the bloom filter even though the 
> options are specified.
> {noformat}
> Map orcOption = new HashMap();
> orcOption.put("orc.bloom.filter.columns", "*");
> hiveContext.sql("select * from accounts where 
> effective_date='2015-12-30'").write().
> format("orc").options(orcOption).save("/tmp/accounts");
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org