RE: Spark-Locality: Hinting Spark location of the executor does not take effect.

2020-09-18 Thread Nasrulla Khan Haris
Was providing IP address instead of FQDN. Providing FQDN helped.

Thanks,

From: Nasrulla Khan Haris
Sent: Wednesday, September 16, 2020 4:11 PM
To: dev@spark.apache.org
Subject: Spark-Locality: Hinting Spark location of the executor does not take 
effect.

HI Spark developers,

If I want to hint spark to use particular list of hosts to execute tasks on. I 
see that getBlockLocations is used to get the list of hosts from HDFS.

https://github.com/apache/spark/blob/7955b3962ac46b89564e0613db7bea98a1478bf2/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L386


Hinting Spark by custom getBlockLocation which return Array of BlockLocations 
with host ip address doesn’t help, Spark continues to host it on other 
executors hosts.

Is there something I am doing wrong ?

Test:
Spark.read.csv()


Appreciate your inputs 

Thanks,
Nasrulla



Spark-Locality: Hinting Spark location of the executor does not take effect.

2020-09-16 Thread Nasrulla Khan Haris
HI Spark developers,

If I want to hint spark to use particular list of hosts to execute tasks on. I 
see that getBlockLocations is used to get the list of hosts from HDFS.

https://github.com/apache/spark/blob/7955b3962ac46b89564e0613db7bea98a1478bf2/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L386


Hinting Spark by custom getBlockLocation which return Array of BlockLocations 
with host ip address doesn’t help, Spark continues to host it on other 
executors hosts.

Is there something I am doing wrong ?

Test:
Spark.read.csv()


Appreciate your inputs 

Thanks,
Nasrulla



Spark-Locality: Hinting Spark location of the executor does not take effect.

2020-09-16 Thread Nasrulla Khan Haris
HI Spark developers,

If I want to hint spark to use particular list of hosts to execute tasks on. I 
see that getBlockLocations is used to get the list of hosts from HDFS.

https://github.com/apache/spark/blob/7955b3962ac46b89564e0613db7bea98a1478bf2/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L386


Hinting Spark by custom getBlockLocation which return Array of BlockLocations 
with host ip address doesn't help, Spark continues to host it on other 
executors hosts.

Is there something I am doing wrong ?

Test:
Spark.read.csv()


Thanks,
Nasrulla



Unable to run bash script when using spark-submit in cluster mode.

2020-07-23 Thread Nasrulla Khan Haris
Hi Spark Users,


I am trying to execute bash script from my spark app. I can run the below 
command without issues from spark-shell however when I use it in the spark-app 
and submit with spark-submit, container is not able to find the directories.

val result = "export LD_LIBRARY_PATH=/ binaries/ && /binaries/generatedata 
simulate -rows 1000 -payload 32 -file MyFile1" !!


Any inputs on how to make the script visible in spark executor ?


Thanks,
Nasrulla



RE: Unable to run bash script when using spark-submit in cluster mode.

2020-07-23 Thread Nasrulla Khan Haris
Are local paths not exposed in containers ?

Thanks,
Nasrulla

From: Nasrulla Khan Haris
Sent: Thursday, July 23, 2020 6:13 PM
To: user@spark.apache.org
Subject: Unable to run bash script when using spark-submit in cluster mode.
Importance: High

Hi Spark Users,


I am trying to execute bash script from my spark app. I can run the below 
command without issues from spark-shell however when I use it in the spark-app 
and submit with spark-submit, container is not able to find the directories.

val result = "export LD_LIBRARY_PATH=/ binaries/ && /binaries/generatedata 
simulate -rows 1000 -payload 32 -file MyFile1" !!


Any inputs on how to make the script visible in spark executor ?


Thanks,
Nasrulla



RE: UnknownSource NullPointerException in CodeGen. with Custom Strategy

2020-06-28 Thread Nasrulla Khan Haris
StackTrace with WSCG disabled


scala> df29.groupBy("LastName").count().show()
20/06/28 06:20:55 WARN TaskSetManager: Lost task 1.0 in stage 2.0 (TID 8, 
wn5-nkhwes.zhqzi2stszlevpekfsrlmpqjah.dx.internal.cloudapp.net, executor 4): 
java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:193)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:360)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:112)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:102)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:122)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Thanks,
Nasrulla

From: Nasrulla Khan Haris
Sent: Saturday, June 27, 2020 11:18 PM
To: dev@spark.apache.org
Subject: UnknownSource NullPointerException in CodeGen. with Custom Strategy

HI Spark Developers,

Encountering this NullPointerException while reading parquet file in multi-node 
cluster. However while running the spark-job locally on single-node 
(development environment) not encountering this error. Appreciate your inputs.

Thanks in advance,
NKH

pqjah.dx.internal.cloudapp.net, executor 1): java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1$agg_FastHashMap_0.hash$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1$agg_FastHashMap_0.findOrInsert(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doConsume_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:122)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
  at 
org.apache.spark.scheduler

UnknownSource NullPointerException in CodeGen. with Custom Strategy

2020-06-28 Thread Nasrulla Khan Haris
HI Spark Developers,

Encountering this NullPointerException while reading parquet file in multi-node 
cluster. However while running the spark-job locally on single-node 
(development environment) not encountering this error. Appreciate your inputs.

Thanks in advance,
NKH

pqjah.dx.internal.cloudapp.net, executor 1): java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1$agg_FastHashMap_0.hash$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1$agg_FastHashMap_0.findOrInsert(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doConsume_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:122)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
  at scala.Option.foreach(Option.scala:257)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2065)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2086)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2105)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:365)
  at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
  at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
  at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2550)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2764)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:751)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:710)
 

Datasource with ColumnBatchScan support.

2020-06-15 Thread Nasrulla Khan Haris
HI Spark developers,

FileSourceScanExec
 extends ColumnarBatchScan  which internal converts columnarbatch to 
InternalRows, If I have a new Datasource/FileFormat which uses customrelation 
instead of HadoopFsRelation, Driver uses 
RowDataSourceScanExec,
 this causes castexeption from internalRow to columnarBatch. Is there a way to 
provide ColumnarBatchScan support to customrelation ?

Appreciate your inputs.

Thanks,
NKH



RE: [EXTERNAL] Re: ColumnnarBatch to InternalRow Cast exception with codegen enabled.

2020-06-12 Thread Nasrulla Khan Haris

Thanks Kris for your inputs. Yes I have a new data source which wraps around 
built-in parquet data source. What I do not understand is with WSCG disabled, 
output is not columnar batch, if my changes do not handle columnar support, 
shouldn’t the behavior remain same with or without WSCG.



From: Kris Mo 
Sent: Friday, June 12, 2020 2:20 AM
To: Nasrulla Khan Haris 
Cc: dev@spark.apache.org
Subject: [EXTERNAL] Re: ColumnnarBatch to InternalRow Cast exception with 
codegen enabled.

Hi Nasrulla,

Not sure what your new code is doing, but the symptom looks like you're 
creating a new data source that wraps around the builtin Parquet data source?

The problem here is, whole-stage codegen generated code for row-based input, 
but the actual input is columnar.
In other words, in your setup, the vectorized Parquet reader is enabled (which 
produces columnar output), and you probably wrote a new operator that didn't 
properly interact with the columnar support, so that WSCG thought it should 
generate row-based code instead of columnar code.

Hope it helps,
Kris
--

Kris Mok

Software Engineer Databricks Inc.

kris@databricks.com<mailto:kris@databricks.com>

databricks.com<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdatabricks.com%2F=02%7C01%7CNasrulla.Khan%40microsoft.com%7C4755c6eb23a245f62f8c08d80eb1da53%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637275504329646919=PkkwqVJgkcR92uhEujWmCpuMFg9gLXNzXLHwZOr%2B1bA%3D=0>

 [Image removed by sender.]
<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdatabricks.com%2F=02%7C01%7CNasrulla.Khan%40microsoft.com%7C4755c6eb23a245f62f8c08d80eb1da53%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637275504329656882=h4GnaCoU6Dc8DAx2boaXLKOW089%2BCZtqYsSYid0%2F22g%3D=0>


On Thu, Jun 11, 2020 at 5:41 PM Nasrulla Khan Haris 
 wrote:
HI Spark developer,

I have a new baseRelation which Initializes ParquetFileFormat object and when 
reading the data I am encountering Cast Exception below, however when I disable 
codegen support with config “spark.sql.codegen.wholeStage"= false, I do not 
encounter this exception.


20/06/11 17:35:39 INFO FileScanRDD: Reading File path: file:///D:/ 
jvm/src/test/scala/resources/pems_sorted/station=402260/part-r-00245-ddaee723-f3f6-4f25-a34b-3312172aa6d7.snappy.parquet,
 range: 0-50936, partition values: [402260]
20/06/11 17:35:39 INFO CodecPool: Got brand-new decompressor [.snappy]
20/06/11 17:35:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch 
cannot be cast to org.apache.spark.sql.catalyst.InternalRow
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)


Appreciate your inputs.

Thanks,
NKH


RE: [EXTERNAL] Re: ColumnnarBatch to InternalRow Cast exception with codegen enabled.

2020-06-12 Thread Nasrulla Khan Haris
Thanks Kris for your inputs. Yes I have a new data source which wraps around 
builtin parquet data source. What I do not understand is with WSCG disabled, 
Output is not columnar batch.


From: Kris Mo 
Sent: Friday, June 12, 2020 2:20 AM
To: Nasrulla Khan Haris 
Cc: dev@spark.apache.org
Subject: [EXTERNAL] Re: ColumnnarBatch to InternalRow Cast exception with 
codegen enabled.

Hi Nasrulla,

Not sure what your new code is doing, but the symptom looks like you're 
creating a new data source that wraps around the builtin Parquet data source?

The problem here is, whole-stage codegen generated code for row-based input, 
but the actual input is columnar.
In other words, in your setup, the vectorized Parquet reader is enabled (which 
produces columnar output), and you probably wrote a new operator that didn't 
properly interact with the columnar support, so that WSCG thought it should 
generate row-based code instead of columnar code.

Hope it helps,
Kris
--

Kris Mok

Software Engineer Databricks Inc.

kris@databricks.com<mailto:kris@databricks.com>

databricks.com<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdatabricks.com%2F=02%7C01%7CNasrulla.Khan%40microsoft.com%7C4755c6eb23a245f62f8c08d80eb1da53%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637275504329646919=PkkwqVJgkcR92uhEujWmCpuMFg9gLXNzXLHwZOr%2B1bA%3D=0>

 [Image removed by sender.]
<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdatabricks.com%2F=02%7C01%7CNasrulla.Khan%40microsoft.com%7C4755c6eb23a245f62f8c08d80eb1da53%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637275504329656882=h4GnaCoU6Dc8DAx2boaXLKOW089%2BCZtqYsSYid0%2F22g%3D=0>


On Thu, Jun 11, 2020 at 5:41 PM Nasrulla Khan Haris 
 wrote:
HI Spark developer,

I have a new baseRelation which Initializes ParquetFileFormat object and when 
reading the data I am encountering Cast Exception below, however when I disable 
codegen support with config “spark.sql.codegen.wholeStage"= false, I do not 
encounter this exception.


20/06/11 17:35:39 INFO FileScanRDD: Reading File path: file:///D:/ 
jvm/src/test/scala/resources/pems_sorted/station=402260/part-r-00245-ddaee723-f3f6-4f25-a34b-3312172aa6d7.snappy.parquet,
 range: 0-50936, partition values: [402260]
20/06/11 17:35:39 INFO CodecPool: Got brand-new decompressor [.snappy]
20/06/11 17:35:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch 
cannot be cast to org.apache.spark.sql.catalyst.InternalRow
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)


Appreciate your inputs.

Thanks,
NKH


ColumnnarBatch to InternalRow Cast exception with codegen enabled.

2020-06-11 Thread Nasrulla Khan Haris
HI Spark developer,

I have a new baseRelation which Initializes ParquetFileFormat object and when 
reading the data I am encountering Cast Exception below, however when I disable 
codegen support with config "spark.sql.codegen.wholeStage"= false, I do not 
encounter this exception.


20/06/11 17:35:39 INFO FileScanRDD: Reading File path: file:///D:/ 
jvm/src/test/scala/resources/pems_sorted/station=402260/part-r-00245-ddaee723-f3f6-4f25-a34b-3312172aa6d7.snappy.parquet,
 range: 0-50936, partition values: [402260]
20/06/11 17:35:39 INFO CodecPool: Got brand-new decompressor [.snappy]
20/06/11 17:35:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch 
cannot be cast to org.apache.spark.sql.catalyst.InternalRow
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)


Appreciate your inputs.

Thanks,
NKH


RE: Does Spark SQL support GRANT/REVOKE operations on Tables?

2020-06-10 Thread Nasrulla Khan Haris
I did enable auth related configs in hive-site.xml as per below document.

I tried this on Spark 2.4.4. Is it supported ?
https://cwiki.apache.org/confluence/display/Hive/Storage+Based+Authorization+in+the+Metastore+Server



From: Nasrulla Khan Haris
Sent: Wednesday, June 10, 2020 5:55 PM
To: user@spark.apache.org
Subject: Does Spark SQL support GRANT/REVOKE operations on Tables?

HI Spark users,

I see REVOKE/GRANT operations In list of supported operations but when I run 
the on a table. I see

Error: org.apache.spark.sql.catalyst.parser.ParseException:
Operation not allowed: GRANT(line 1, pos 0)


== SQL ==
GRANT INSERT ON table_priv1 TO USER user2
^^^

  at 
org.apache.spark.sql.catalyst.parser.ParserUtils$.operationNotAllowed(ParserUtils.scala:41)
  at 
org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitFailNativeCommand$1.apply(SparkSqlParser.scala:1047)
  at 
org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitFailNativeCommand$1.apply(SparkSqlParser.scala:1038)
  at 
org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:108)
  at 
org.apache.spark.sql.execution.SparkSqlAstBuilder.visitFailNativeCommand(SparkSqlParser.scala:1038)
  at 
org.apache.spark.sql.execution.SparkSqlAstBuilder.visitFailNativeCommand(SparkSqlParser.scala:55)
  at 
org.apache.spark.sql.catalyst.parser.SqlBaseParser$FailNativeCommandContext.accept(SqlBaseParser.java:782)
  at 
org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:18)
  at 
org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:72)
  at 
org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:72)
  at 
org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:108)
  at 
org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:71)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:70)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:69)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:100)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:69)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
  ... 49 elided


Error: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'dir' expecting {'(', 'SELECT', 'FROM', 'ADD', 'DESC', 'WITH', 
'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE', 'DESCRIBE', 'EXPLAIN', 'SHOW', 
'USE', 'DROP', 'ALTER', 'MAP', 'SET', 'RESET', 'START', 'COMMIT', 'ROLLBACK', 
'REDUCE', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'DFS', 'TRUNCATE', 'ANALYZE', 
'LIST', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT', 
'LOAD'}(line 1, pos 0)

== SQL ==



Appreciate your response.

Thanks,
NKH


Does Spark SQL support GRANT/REVOKE operations on Tables?

2020-06-10 Thread Nasrulla Khan Haris
HI Spark users,

I see REVOKE/GRANT operations In list of supported operations but when I run 
the on a table. I see

Error: org.apache.spark.sql.catalyst.parser.ParseException:
Operation not allowed: GRANT(line 1, pos 0)


== SQL ==
GRANT INSERT ON table_priv1 TO USER user2
^^^

  at 
org.apache.spark.sql.catalyst.parser.ParserUtils$.operationNotAllowed(ParserUtils.scala:41)
  at 
org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitFailNativeCommand$1.apply(SparkSqlParser.scala:1047)
  at 
org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitFailNativeCommand$1.apply(SparkSqlParser.scala:1038)
  at 
org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:108)
  at 
org.apache.spark.sql.execution.SparkSqlAstBuilder.visitFailNativeCommand(SparkSqlParser.scala:1038)
  at 
org.apache.spark.sql.execution.SparkSqlAstBuilder.visitFailNativeCommand(SparkSqlParser.scala:55)
  at 
org.apache.spark.sql.catalyst.parser.SqlBaseParser$FailNativeCommandContext.accept(SqlBaseParser.java:782)
  at 
org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:18)
  at 
org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:72)
  at 
org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:72)
  at 
org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:108)
  at 
org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:71)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:70)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:69)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:100)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:69)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
  ... 49 elided


Error: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'dir' expecting {'(', 'SELECT', 'FROM', 'ADD', 'DESC', 'WITH', 
'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE', 'DESCRIBE', 'EXPLAIN', 'SHOW', 
'USE', 'DROP', 'ALTER', 'MAP', 'SET', 'RESET', 'START', 'COMMIT', 'ROLLBACK', 
'REDUCE', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'DFS', 'TRUNCATE', 'ANALYZE', 
'LIST', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT', 
'LOAD'}(line 1, pos 0)

== SQL ==



Appreciate your response.

Thanks,
NKH


preferredlocations for hadoopfsrelations based baseRelations

2020-06-04 Thread Nasrulla Khan Haris
HI Spark developers,

I have created new format extending fileformat. I see getPrefferedLocations is 
available if newCustomRDD is created. Since fileformat is based off FileScanRDD 
which uses readfile method to read partitioned file, Is there a way to add 
desired preferredLocations ?

Appreciate your responses.

Thanks,
NKH



RE: Adding Custom finalize method to RDDs.

2019-06-12 Thread Nasrulla Khan Haris
We cannot have control over RDD going out of scope from memory as it was 
handled by JVM. Thus I am not sure try and finalize will help.
 Thus I wanted to use some mechanism to cleanup of some temporary data which is 
created by RDD immediately as soon as it goes out of scope.

Any ideas ?

Thanks,
Nasrulla

From: Phillip Henry 
Sent: Tuesday, June 11, 2019 11:28 PM
To: Nasrulla Khan Haris 
Cc: Vinoo Ganesh ; dev@spark.apache.org
Subject: Re: Adding Custom finalize method to RDDs.

That's not the kind of thing a finalize method was ever supposed to do.

Use a try/finally block instead.

Phillip


On Wed, 12 Jun 2019, 00:01 Nasrulla Khan Haris, 
mailto:nasrulla.k...@microsoft.com.invalid>>
 wrote:
I want to delete some files which I created In my datasource api,  as soon as 
the RDD is cleaned up.

Thanks,
Nasrulla

From: Vinoo Ganesh mailto:vgan...@palantir.com>>
Sent: Monday, June 10, 2019 1:32 PM
To: Nasrulla Khan Haris 
mailto:nasrulla.k...@microsoft.com.INVALID>>;
 dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: Adding Custom finalize method to RDDs.

Generally overriding the finalize() method is an antipattern (it was in fact 
deprecated in java 11  
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Object.html#finalize())
 . What’s the use case here?

From: Nasrulla Khan Haris 
mailto:nasrulla.k...@microsoft.com.INVALID>>
Date: Monday, June 10, 2019 at 15:44
To: "dev@spark.apache.org<mailto:dev@spark.apache.org>" 
mailto:dev@spark.apache.org>>
Subject: RE: Adding Custom finalize method to RDDs.

Hello Everyone,
Is there a way  to do it from user-code ?

Thanks,
Nasrulla

From: Nasrulla Khan Haris 
mailto:nasrulla.k...@microsoft.com.INVALID>>
Sent: Sunday, June 9, 2019 5:30 PM
To: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Adding Custom finalize method to RDDs.

Hi All,

Is there a way to add custom finalize method to RDD objects to add custom logic 
when RDDs are destructed by JVM ?

Thanks,
Nasrulla



RE: Adding Custom finalize method to RDDs.

2019-06-11 Thread Nasrulla Khan Haris
I want to delete some files which I created In my datasource api,  as soon as 
the RDD is cleaned up.

Thanks,
Nasrulla

From: Vinoo Ganesh 
Sent: Monday, June 10, 2019 1:32 PM
To: Nasrulla Khan Haris ; 
dev@spark.apache.org
Subject: Re: Adding Custom finalize method to RDDs.

Generally overriding the finalize() method is an antipattern (it was in fact 
deprecated in java 11  
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Object.html#finalize())
 . What’s the use case here?

From: Nasrulla Khan Haris 
mailto:nasrulla.k...@microsoft.com.INVALID>>
Date: Monday, June 10, 2019 at 15:44
To: "dev@spark.apache.org<mailto:dev@spark.apache.org>" 
mailto:dev@spark.apache.org>>
Subject: RE: Adding Custom finalize method to RDDs.

Hello Everyone,
Is there a way  to do it from user-code ?

Thanks,
Nasrulla

From: Nasrulla Khan Haris 
mailto:nasrulla.k...@microsoft.com.INVALID>>
Sent: Sunday, June 9, 2019 5:30 PM
To: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Adding Custom finalize method to RDDs.

Hi All,

Is there a way to add custom finalize method to RDD objects to add custom logic 
when RDDs are destructed by JVM ?

Thanks,
Nasrulla



RE: Adding Custom finalize method to RDDs.

2019-06-10 Thread Nasrulla Khan Haris
Hello Everyone,
Is there a way  to do it from user-code ?

Thanks,
Nasrulla

From: Nasrulla Khan Haris 
Sent: Sunday, June 9, 2019 5:30 PM
To: dev@spark.apache.org
Subject: Adding Custom finalize method to RDDs.

Hi All,

Is there a way to add custom finalize method to RDD objects to add custom logic 
when RDDs are destructed by JVM ?

Thanks,
Nasrulla



Adding Custom finalize method to RDDs.

2019-06-09 Thread Nasrulla Khan Haris
Hi All,

Is there a way to add custom finalize method to RDD objects to add custom logic 
when RDDs are destructed by JVM ?

Thanks,
Nasrulla



RE: RDD object Out of scope.

2019-05-21 Thread Nasrulla Khan Haris
Thanks Sean, that makes sense. 

Regards,
Nasrulla

-Original Message-
From: Sean Owen  
Sent: Tuesday, May 21, 2019 6:24 PM
To: Nasrulla Khan Haris 
Cc: dev@spark.apache.org
Subject: Re: RDD object Out of scope.

I'm not clear what you're asking. An RDD itself is just an object in the JVM. 
It will be garbage collected if there are no references. What else would there 
be to clean up in your case? ContextCleaner handles cleaned up of persisted 
RDDs, etc.

On Tue, May 21, 2019 at 7:39 PM Nasrulla Khan Haris 
 wrote:
>
> I am trying to find the code that cleans up uncached RDD.
>
>
>
> Thanks,
>
> Nasrulla
>
>
>
> From: Charoes 
> Sent: Tuesday, May 21, 2019 5:10 PM
> To: Nasrulla Khan Haris 
> Cc: Wenchen Fan ; dev@spark.apache.org
> Subject: Re: RDD object Out of scope.
>
>
>
> If you cached a RDD and hold a reference of that RDD in your code, then your 
> RDD will NOT be cleaned up.
>
> There is a ReferenceQueue in ContextCleaner, which is used to keep tracking 
> the reference of RDD, Broadcast, and Accumulator etc.
>
>
>
> On Wed, May 22, 2019 at 1:07 AM Nasrulla Khan Haris 
>  wrote:
>
> Thanks for reply Wenchen, I am curious as what happens when RDD goes out of 
> scope when it is not cached.
>
>
>
> Nasrulla
>
>
>
> From: Wenchen Fan 
> Sent: Tuesday, May 21, 2019 6:28 AM
> To: Nasrulla Khan Haris 
> Cc: dev@spark.apache.org
> Subject: Re: RDD object Out of scope.
>
>
>
> RDD is kind of a pointer to the actual data. Unless it's cached, we don't 
> need to clean up the RDD.
>
>
>
> On Tue, May 21, 2019 at 1:48 PM Nasrulla Khan Haris 
>  wrote:
>
> HI Spark developers,
>
>
>
> Can someone point out the code where RDD objects go out of scope ?. I found 
> the contextcleaner code in which only persisted RDDs are cleaned up in 
> regular intervals if the RDD is registered to cleanup. I have not found where 
> the destructor for RDD object is invoked. I am trying to understand when RDD 
> cleanup happens when the RDD is not persisted.
>
>
>
> Thanks in advance, appreciate your help.
>
> Nasrulla
>
>


RE: RDD object Out of scope.

2019-05-21 Thread Nasrulla Khan Haris
I am trying to find the code that cleans up uncached RDD.

Thanks,
Nasrulla

From: Charoes 
Sent: Tuesday, May 21, 2019 5:10 PM
To: Nasrulla Khan Haris 
Cc: Wenchen Fan ; dev@spark.apache.org
Subject: Re: RDD object Out of scope.

If you cached a RDD and hold a reference of that RDD in your code, then your 
RDD will NOT be cleaned up.
There is a ReferenceQueue in ContextCleaner, which is used to keep tracking the 
reference of RDD, Broadcast, and Accumulator etc.

On Wed, May 22, 2019 at 1:07 AM Nasrulla Khan Haris 
mailto:nasrulla.k...@microsoft.com.invalid>>
 wrote:
Thanks for reply Wenchen, I am curious as what happens when RDD goes out of 
scope when it is not cached.

Nasrulla

From: Wenchen Fan mailto:cloud0...@gmail.com>>
Sent: Tuesday, May 21, 2019 6:28 AM
To: Nasrulla Khan Haris 
mailto:nasrulla.k...@microsoft.com.invalid>>
Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: RDD object Out of scope.

RDD is kind of a pointer to the actual data. Unless it's cached, we don't need 
to clean up the RDD.

On Tue, May 21, 2019 at 1:48 PM Nasrulla Khan Haris 
mailto:nasrulla.k...@microsoft.com.invalid>>
 wrote:
HI Spark developers,

Can someone point out the code where RDD objects go out of scope ?. I found the 
contextcleaner<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fblob%2Fmaster%2Fcore%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2FContextCleaner.scala%23L178=02%7C01%7CNasrulla.Khan%40microsoft.com%7Cd3db7eb5d2464e56f8cf08d6de49ddb6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636940806173476082=n%2FhFVJIRNVEgH%2FPM3oXfJ47VdhBtprAUGJh8tUPb3i8%3D=0>
 code in which only persisted RDDs are cleaned up in regular intervals if the 
RDD is registered to cleanup. I have not found where the destructor for RDD 
object is invoked. I am trying to understand when RDD cleanup happens when the 
RDD is not persisted.

Thanks in advance, appreciate your help.
Nasrulla



RE: RDD object Out of scope.

2019-05-21 Thread Nasrulla Khan Haris
Thanks for reply Wenchen, I am curious as what happens when RDD goes out of 
scope when it is not cached.

Nasrulla

From: Wenchen Fan 
Sent: Tuesday, May 21, 2019 6:28 AM
To: Nasrulla Khan Haris 
Cc: dev@spark.apache.org
Subject: Re: RDD object Out of scope.

RDD is kind of a pointer to the actual data. Unless it's cached, we don't need 
to clean up the RDD.

On Tue, May 21, 2019 at 1:48 PM Nasrulla Khan Haris 
mailto:nasrulla.k...@microsoft.com.invalid>>
 wrote:
HI Spark developers,

Can someone point out the code where RDD objects go out of scope ?. I found the 
contextcleaner<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fblob%2Fmaster%2Fcore%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2FContextCleaner.scala%23L178=02%7C01%7CNasrulla.Khan%40microsoft.com%7C81b54c9707834f297cc408d6ddf03381%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636940421061281654=ifd7sXnbwxIuzPXW2hIrhI%2BZN9kLccglY7W%2B%2BDJmbZI%3D=0>
 code in which only persisted RDDs are cleaned up in regular intervals if the 
RDD is registered to cleanup. I have not found where the destructor for RDD 
object is invoked. I am trying to understand when RDD cleanup happens when the 
RDD is not persisted.

Thanks in advance, appreciate your help.
Nasrulla



RDD object Out of scope.

2019-05-20 Thread Nasrulla Khan Haris
HI Spark developers,

Can someone point out the code where RDD objects go out of scope ?. I found the 
contextcleaner
 code in which only persisted RDDs are cleaned up in regular intervals if the 
RDD is registered to cleanup. I have not found where the destructor for RDD 
object is invoked. I am trying to understand when RDD cleanup happens when the 
RDD is not persisted.

Thanks in advance, appreciate your help.
Nasrulla



RE: adding shutdownmanagerhook to spark.

2019-05-13 Thread Nasrulla Khan Haris
Thanks Sean

Nasrulla

-Original Message-
From: Sean Owen  
Sent: Monday, May 13, 2019 4:16 PM
To: Nasrulla Khan Haris 
Cc: dev@spark.apache.org
Subject: Re: adding shutdownmanagerhook to spark.

Spark just adds a hook to the mechanism that Hadoop exposes. You can do the 
same. You shouldn't use Spark's.

On Mon, May 13, 2019 at 6:11 PM Nasrulla Khan Haris 
 wrote:
>
> HI All,
>
>
>
> I am trying to add shutdown hook, but looks like shutdown manager object 
> requires the package to be spark only, is there any other API that can help 
> me to do this ?
>
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> ub.com%2Fapache%2Fspark%2Fblob%2Fv2.4.0%2Fcore%2Fsrc%2Fmain%2Fscala%2F
> org%2Fapache%2Fspark%2Futil%2FShutdownHookManager.scaladata=01%7C
> 01%7CNasrulla.Khan%40microsoft.com%7C8aadef9c8cf844139eca08d6d7f90058%
> 7C72f988bf86f141af91ab2d7cd011db47%7C1sdata=Si4C7iF8PN5%2FJDDQy8F
> BzzuFh2XsYD%2BfjMtT5wLxbr4%3Dreserved=0
>
>
>
> I can see that hiveserver2 in 
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fcommit%2F515708d5f33d5acdb4206c626192d1838f8e691fdata=01%7C01%7CNasrulla.Khan%40microsoft.com%7C8aadef9c8cf844139eca08d6d7f90058%7C72f988bf86f141af91ab2d7cd011db47%7C1sdata=AGNpXhauHp4Wkbe%2BVesAbO5H%2BJFhwApMCRcAi2oROl0%3Dreserved=0
>  uses ShutdownHookManager in different package. But It expects me to have the 
> package name “org.apache.spark”
>
>
>
> Thanks,
>
> Nasrulla
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



adding shutdownmanagerhook to spark.

2019-05-13 Thread Nasrulla Khan Haris
HI All,

I am trying to add shutdown hook, but looks like shutdown manager object 
requires the package to be spark only, is there any other API that can help me 
to do this ?
https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/ShutdownHookManager.scala

I can see that hiveserver2 in 
https://github.com/apache/spark/commit/515708d5f33d5acdb4206c626192d1838f8e691f 
uses ShutdownHookManager in different package. But It expects me to have the 
package name "org.apache.spark"

Thanks,
Nasrulla



API for SparkContext ?

2019-05-13 Thread Nasrulla Khan Haris
HI All,

Is there a API for sparkContext where we can add our custom code before 
stopping sparkcontext ?
Appreciate your help.
Thanks,
Nasrulla



Need to clean up temporary files created post dataframe deletion.

2019-05-12 Thread Nasrulla Khan Haris
HI All,

I am new to apache spark core, I have custom datasource v2 connector for spark 
and I am looking for some suggestions deleting temp files created by the 
connector.

Any ideas would helpful.

Thanks for your help in advance.
Nasrulla


RE: Need guidance on Spark Session Termination.

2019-05-08 Thread Nasrulla Khan Haris
HI All,

Any Inputs here ?

Thanks,
Nasrulla

From: Nasrulla Khan Haris 
Sent: Tuesday, May 7, 2019 3:58 PM
To: dev@spark.apache.org
Subject: Need guidance on Spark Session Termination.

Hi fellow Spark-devs,

I am pretty new to spark core and I am looking for some answers to my use case. 
 I have a datasource v2 api connector, In my connector we create temporary 
files on the blob storage. Can you please suggest places where I can look if I 
want to delete the temporary files on storage at the end of Spark Session?

Thanks,
Nasrulla



Need guidance on Spark Session Termination.

2019-05-07 Thread Nasrulla Khan Haris
Hi fellow Spark-devs,

I am pretty new to spark core and I am looking for some answers to my use case. 
 I have a datasource v2 api connector, In my connector we create temporary 
files on the blob storage. Can you please suggest places where I can look if I 
want to delete the temporary files on storage at the end of Spark Session?

Thanks,
Nasrulla



[jira] [Created] (HIVE-20866) Result Caching doesnt work if order of the columns is changed in the query.

2018-11-04 Thread Nasrulla Khan Haris (JIRA)
Nasrulla Khan Haris created HIVE-20866:
--

 Summary: Result Caching doesnt work if order of the columns is 
changed in the query.
 Key: HIVE-20866
 URL: https://issues.apache.org/jira/browse/HIVE-20866
 Project: Hive
  Issue Type: New Feature
Reporter: Nasrulla Khan Haris


# Run the query "select country,count from hivesampletable group by country;"
 #  Make sure the result of the query is cached.
 #  Run the query by changing the order select count,country from 
hivesampletable group by country;
 #  Query in Step 3 should not spawn cluster tasks.

cc: [~ashitg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20865) Query Result caching for queries which contain subqueries which are cached.

2018-11-04 Thread Nasrulla Khan Haris (JIRA)
Nasrulla Khan Haris created HIVE-20865:
--

 Summary: Query Result caching for queries which contain subqueries 
which are cached.
 Key: HIVE-20865
 URL: https://issues.apache.org/jira/browse/HIVE-20865
 Project: Hive
  Issue Type: New Feature
  Components: Hive
Reporter: Nasrulla Khan Haris


Steps:
1. Execute query select count from hivesampletable where country='United 
States'
2. Make sure the query results for the query in step 1 is cached.
3. Execute select count from hivesampletable where country='United States' or 
Country="United Kingdom";
4. Query should not spawn cluster tasks to execute select count from 
hivesampletable where country='United States'

 

cc : [~ashitg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)