Re: Spark 3.3 + parquet 1.10

2023-07-24 Thread Pralabh Kumar
Spark3.3 in OSS built with parquet 1.12.  Just compiling  with parquet 1.10
results in build failure , so just wondering if any one have build &
compiled Spark 3.3 with parquet 1.10.

Regards
Pralabh Kumar

On Mon, Jul 24, 2023 at 3:04 PM Mich Talebzadeh 
wrote:

> Hi,
>
> Where is this limitation coming from (using 1.1.0)? That is 2018 build
>
> Have you tried Spark 3.3 with parquet writes as is? Just a small PoC will
> prove it.
>
> HTH
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 24 Jul 2023 at 10:15, Pralabh Kumar 
> wrote:
>
>> Hi Dev community.
>>
>> I have a quick question with respect to Spark 3.3. Currently Spark 3.3 is
>> built with parquet 1.12.
>> However, anyone tried Spark 3.3 with parquet 1.10 .
>>
>> We are at Uber , planning to migrate Spark 3.3 but we have limitations of
>> using parquet 1.10  . Has anyone tried building Spark 3.3 with parquet 1.10
>> ? What are the dos/ don't for it ?
>>
>>
>> Regards
>> Pralabh Kumar
>>
>


Spark 3.3 + parquet 1.10

2023-07-24 Thread Pralabh Kumar
Hi Dev community.

I have a quick question with respect to Spark 3.3. Currently Spark 3.3 is
built with parquet 1.12.
However, anyone tried Spark 3.3 with parquet 1.10 .

We are at Uber , planning to migrate Spark 3.3 but we have limitations of
using parquet 1.10  . Has anyone tried building Spark 3.3 with parquet 1.10
? What are the dos/ don't for it ?


Regards
Pralabh Kumar


Spark 3.0.0 EOL

2023-07-24 Thread Pralabh Kumar
Hi Dev Team

If possible , can you please provide the Spark 3.0.0 EOL timelines .

Regards
Pralabh Kumar


SPARK-43235

2023-05-03 Thread Pralabh Kumar
Hi Dev

Please find some time to review the Jira.

Regards
Pralabh Kumar


Re: Setting spark.kubernetes.driver.connectionTimeout, spark.kubernetes.submission.connectionTimeout to default spark.network.timeout

2022-08-02 Thread Pralabh Kumar
allows K8s API server-side caching via SPARK-36334.
> You may want to try the following configuration if you have very
> limited control plan resources.
>
> spark.kubernetes.executor.enablePollingWithResourceVersion=true
>
> Dongjoon.
>
> On Mon, Aug 1, 2022 at 7:52 AM Pralabh Kumar 
> wrote:
> >
> > Hi Dev team
> >
> >
> >
> > Since spark.network.timeout is default for all the network transactions
> . Shouldn’t   spark.kubernetes.driver.connectionTimeout,
> spark.kubernetes.submission.connectionTimeout by default to be set
> spark.network.timeout .
> >
> > Users migrating from Yarn to K8s are familiar with spark.network.timeout
> and if time out occurs on K8s , they need to explicitly set the above two
> properties. If the above properties are default set to
> spark.network.timeout then user don’t need to explicitly set above
> properties and it can work with spark.network.timeout.
> >
> >
> >
> > Please let me know if my understanding is correct
> >
> >
> >
> > Regards
> >
> > Pralabh Kumar
>


Setting spark.kubernetes.driver.connectionTimeout, spark.kubernetes.submission.connectionTimeout to default spark.network.timeout

2022-08-01 Thread Pralabh Kumar
Hi Dev team



Since* spark.network.timeout* is default for all the network transactions .
Shouldn’t   *spark.kubernetes.driver.connectionTimeout*,
*spark.kubernetes.submission.connectionTimeout* by default to be set
spark.network.timeout .

Users migrating from Yarn to K8s are familiar with spark.network.timeout
and if time out occurs on K8s , they need to explicitly set the above two
properties. If the above properties are default set to
spark.network.timeout then user don’t need to explicitly set above
properties and it can work with spark.network.timeout.



Please let me know if my understanding is correct



Regards

Pralabh Kumar


Spark-39755 Review/comment

2022-07-14 Thread Pralabh Kumar
Hi Dev community

Please review/comment

https://issues.apache.org/jira/browse/SPARK-39755

Regards
Pralabh kumar


Re: Spark32 + Java 11 . Reading parquet java.lang.NoSuchMethodError: 'sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner()'

2022-06-14 Thread Pralabh Kumar
Thx for the reply @Steve Loughran   @Martin . It helps
. However just a minor suggestion


   - Should we update the documentation
   https://spark.apache.org/docs/latest/#downloading , which talks
about   java.nio.DirectByteBuffer
   . We can add another case , where user will get the same error for Spark on
   K8s on Spark32 running on version < Hadoop 3.2 (since the default value in
   Docker file for Spark32 is Java 11)

Please let me know if it make sense to you.

Regards
Pralabh Kumar

On Tue, Jun 14, 2022 at 4:21 PM Steve Loughran  wrote:

> hadoop 3.2.x is the oldest of the hadoop branch 3 branches which gets
> active security patches, as was done last month. I would strongly recommend
> using it unless there are other compatibility issues (hive?)
>
> On Tue, 14 Jun 2022 at 05:31, Pralabh Kumar 
> wrote:
>
>> Hi Steve / Dev team
>>
>> Thx for the help . Have a quick question ,  How can we fix the above
>> error in Hadoop 3.1 .
>>
>>- Spark docker file have (Java 11)
>>
>> https://github.com/apache/spark/blob/branch-3.2/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
>>
>>- Now if we build Spark32  , Spark image will be having Java 11 .  If
>>we run on a Hadoop version less than 3.2 , it will throw an exception.
>>
>>
>>
>>- Should there be a separate docker file for Spark32 for Java 8 for
>>Hadoop version < 3.2 .  Spark 3.0.1 have Java 8 in docker file which works
>>fine in our environment (with Hadoop3.1)
>>
>>
>> Regards
>> Pralabh Kumar
>>
>>
>>
>> On Mon, Jun 13, 2022 at 3:25 PM Steve Loughran 
>> wrote:
>>
>>>
>>>
>>> On Mon, 13 Jun 2022 at 08:52, Pralabh Kumar 
>>> wrote:
>>>
>>>> Hi Dev team
>>>>
>>>> I have a spark32 image with Java 11 (Running Spark on K8s) .  While
>>>> reading a huge parquet file via  spark.read.parquet("") .  I am
>>>> getting the following error . The same error is mentioned in Spark docs
>>>> https://spark.apache.org/docs/latest/#downloading but w.r.t to apache
>>>> arrow.
>>>>
>>>>
>>>>- IMHO , I think the error is coming from Parquet 1.12.1  which is
>>>>based on Hadoop 2.10 which is not java 11 compatible.
>>>>
>>>>
>>> correct. see https://issues.apache.org/jira/browse/HADOOP-12760
>>>
>>>
>>> Please let me know if this understanding is correct and is there a way
>>>> to fix it.
>>>>
>>>
>>>
>>>
>>> upgrade to a version of hadoop with the fix. That's any version >=
>>> hadoop 3.2.0 which shipped since 2018
>>>
>>>>
>>>>
>>>> java.lang.NoSuchMethodError: 'sun.misc.Cleaner
>>>> sun.nio.ch.DirectBuffer.cleaner()'
>>>>
>>>> at
>>>> org.apache.hadoop.crypto.CryptoStreamUtils.freeDB(CryptoStreamUtils.java:41)
>>>>
>>>> at
>>>> org.apache.hadoop.crypto.CryptoInputStream.freeBuffers(CryptoInputStream.java:687)
>>>>
>>>> at
>>>> org.apache.hadoop.crypto.CryptoInputStream.close(CryptoInputStream.java:320)
>>>>
>>>> at java.base/java.io.FilterInputStream.close(Unknown
>>>> Source)
>>>>
>>>> at
>>>> org.apache.parquet.hadoop.util.H2SeekableInputStream.close(H2SeekableInputStream.java:50)
>>>>
>>>> at
>>>> org.apache.parquet.hadoop.ParquetFileReader.close(ParquetFileReader.java:1299)
>>>>
>>>> at
>>>> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:54)
>>>>
>>>> at
>>>> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44)
>>>>
>>>> at
>>>> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:467)
>>>>
>>>> at
>>>> org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:372)
>>>>
>>>> at
>>>> scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
>>>>
>>>> at scala.util.Success.$anonfun$map$1(Try.scala:255)
>>>>
>>>> at scala.util.Success.map(Try.scala:213)
>>>>
>>>> at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
>>>>
>>>> at
>>>> scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
>>>>
>>>> at
>>>> scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
>>>>
>>>> at
>>>> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
>>>>
>>>> at
>>>> java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown
>>>> Source)
>>>>
>>>> at
>>>> java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
>>>>
>>>> at
>>>> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown
>>>> Source)
>>>>
>>>> at
>>>> java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
>>>>
>>>> at
>>>> java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
>>>>
>>>> at
>>>> java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
>>>>
>>>


Re: Spark32 + Java 11 . Reading parquet java.lang.NoSuchMethodError: 'sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner()'

2022-06-13 Thread Pralabh Kumar
Hi Steve / Dev team

Thx for the help . Have a quick question ,  How can we fix the above error
in Hadoop 3.1 .

   - Spark docker file have (Java 11)
   
https://github.com/apache/spark/blob/branch-3.2/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile

   - Now if we build Spark32  , Spark image will be having Java 11 .  If we
   run on a Hadoop version less than 3.2 , it will throw an exception.



   - Should there be a separate docker file for Spark32 for Java 8 for
   Hadoop version < 3.2 .  Spark 3.0.1 have Java 8 in docker file which works
   fine in our environment (with Hadoop3.1)


Regards
Pralabh Kumar



On Mon, Jun 13, 2022 at 3:25 PM Steve Loughran  wrote:

>
>
> On Mon, 13 Jun 2022 at 08:52, Pralabh Kumar 
> wrote:
>
>> Hi Dev team
>>
>> I have a spark32 image with Java 11 (Running Spark on K8s) .  While
>> reading a huge parquet file via  spark.read.parquet("") .  I am getting
>> the following error . The same error is mentioned in Spark docs
>> https://spark.apache.org/docs/latest/#downloading but w.r.t to apache
>> arrow.
>>
>>
>>- IMHO , I think the error is coming from Parquet 1.12.1  which is
>>based on Hadoop 2.10 which is not java 11 compatible.
>>
>>
> correct. see https://issues.apache.org/jira/browse/HADOOP-12760
>
>
> Please let me know if this understanding is correct and is there a way to
>> fix it.
>>
>
>
>
> upgrade to a version of hadoop with the fix. That's any version >= hadoop
> 3.2.0 which shipped since 2018
>
>>
>>
>> java.lang.NoSuchMethodError: 'sun.misc.Cleaner
>> sun.nio.ch.DirectBuffer.cleaner()'
>>
>> at
>> org.apache.hadoop.crypto.CryptoStreamUtils.freeDB(CryptoStreamUtils.java:41)
>>
>> at
>> org.apache.hadoop.crypto.CryptoInputStream.freeBuffers(CryptoInputStream.java:687)
>>
>> at
>> org.apache.hadoop.crypto.CryptoInputStream.close(CryptoInputStream.java:320)
>>
>> at java.base/java.io.FilterInputStream.close(Unknown Source)
>>
>> at
>> org.apache.parquet.hadoop.util.H2SeekableInputStream.close(H2SeekableInputStream.java:50)
>>
>> at
>> org.apache.parquet.hadoop.ParquetFileReader.close(ParquetFileReader.java:1299)
>>
>> at
>> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:54)
>>
>> at
>> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44)
>>
>> at
>> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:467)
>>
>> at
>> org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:372)
>>
>> at
>> scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
>>
>> at scala.util.Success.$anonfun$map$1(Try.scala:255)
>>
>> at scala.util.Success.map(Try.scala:213)
>>
>> at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
>>
>> at
>> scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
>>
>> at
>> scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
>>
>> at
>> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
>>
>> at
>> java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown
>> Source)
>>
>> at
>> java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
>>
>> at
>> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown
>> Source)
>>
>> at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown
>> Source)
>>
>> at
>> java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
>>
>> at
>> java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
>>
>


Re: Spark32 + Java 11 . Reading parquet java.lang.NoSuchMethodError: 'sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner()'

2022-06-13 Thread Pralabh Kumar
Steve . Thx for your help ,please ignore last comment.

Regards
Pralabh Kumar

On Mon, 13 Jun 2022, 15:43 Pralabh Kumar,  wrote:

> Hi steve
>
> Thx for help . We are on Hadoop3.2 ,however we are building Hadoop3.2 with
> Java 8 .
>
> Do you suggest to build Hadoop with Java 11
>
> Regards
> Pralabh kumar
>
> On Mon, 13 Jun 2022, 15:25 Steve Loughran,  wrote:
>
>>
>>
>> On Mon, 13 Jun 2022 at 08:52, Pralabh Kumar 
>> wrote:
>>
>>> Hi Dev team
>>>
>>> I have a spark32 image with Java 11 (Running Spark on K8s) .  While
>>> reading a huge parquet file via  spark.read.parquet("") .  I am getting
>>> the following error . The same error is mentioned in Spark docs
>>> https://spark.apache.org/docs/latest/#downloading but w.r.t to apache
>>> arrow.
>>>
>>>
>>>- IMHO , I think the error is coming from Parquet 1.12.1  which is
>>>based on Hadoop 2.10 which is not java 11 compatible.
>>>
>>>
>> correct. see https://issues.apache.org/jira/browse/HADOOP-12760
>>
>>
>> Please let me know if this understanding is correct and is there a way to
>>> fix it.
>>>
>>
>>
>>
>> upgrade to a version of hadoop with the fix. That's any version >= hadoop
>> 3.2.0 which shipped since 2018
>>
>>>
>>>
>>> java.lang.NoSuchMethodError: 'sun.misc.Cleaner
>>> sun.nio.ch.DirectBuffer.cleaner()'
>>>
>>> at
>>> org.apache.hadoop.crypto.CryptoStreamUtils.freeDB(CryptoStreamUtils.java:41)
>>>
>>> at
>>> org.apache.hadoop.crypto.CryptoInputStream.freeBuffers(CryptoInputStream.java:687)
>>>
>>> at
>>> org.apache.hadoop.crypto.CryptoInputStream.close(CryptoInputStream.java:320)
>>>
>>> at java.base/java.io.FilterInputStream.close(Unknown Source)
>>>
>>> at
>>> org.apache.parquet.hadoop.util.H2SeekableInputStream.close(H2SeekableInputStream.java:50)
>>>
>>> at
>>> org.apache.parquet.hadoop.ParquetFileReader.close(ParquetFileReader.java:1299)
>>>
>>> at
>>> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:54)
>>>
>>> at
>>> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44)
>>>
>>> at
>>> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:467)
>>>
>>> at
>>> org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:372)
>>>
>>> at
>>> scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
>>>
>>> at scala.util.Success.$anonfun$map$1(Try.scala:255)
>>>
>>> at scala.util.Success.map(Try.scala:213)
>>>
>>> at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
>>>
>>> at
>>> scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
>>>
>>> at
>>> scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
>>>
>>> at
>>> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
>>>
>>> at
>>> java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown
>>> Source)
>>>
>>> at
>>> java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
>>>
>>> at
>>> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown
>>> Source)
>>>
>>> at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown
>>> Source)
>>>
>>> at
>>> java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
>>>
>>> at
>>> java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
>>>
>>


Re: Spark32 + Java 11 . Reading parquet java.lang.NoSuchMethodError: 'sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner()'

2022-06-13 Thread Pralabh Kumar
Hi steve

Thx for help . We are on Hadoop3.2 ,however we are building Hadoop3.2 with
Java 8 .

Do you suggest to build Hadoop with Java 11

Regards
Pralabh kumar

On Mon, 13 Jun 2022, 15:25 Steve Loughran,  wrote:

>
>
> On Mon, 13 Jun 2022 at 08:52, Pralabh Kumar 
> wrote:
>
>> Hi Dev team
>>
>> I have a spark32 image with Java 11 (Running Spark on K8s) .  While
>> reading a huge parquet file via  spark.read.parquet("") .  I am getting
>> the following error . The same error is mentioned in Spark docs
>> https://spark.apache.org/docs/latest/#downloading but w.r.t to apache
>> arrow.
>>
>>
>>- IMHO , I think the error is coming from Parquet 1.12.1  which is
>>based on Hadoop 2.10 which is not java 11 compatible.
>>
>>
> correct. see https://issues.apache.org/jira/browse/HADOOP-12760
>
>
> Please let me know if this understanding is correct and is there a way to
>> fix it.
>>
>
>
>
> upgrade to a version of hadoop with the fix. That's any version >= hadoop
> 3.2.0 which shipped since 2018
>
>>
>>
>> java.lang.NoSuchMethodError: 'sun.misc.Cleaner
>> sun.nio.ch.DirectBuffer.cleaner()'
>>
>> at
>> org.apache.hadoop.crypto.CryptoStreamUtils.freeDB(CryptoStreamUtils.java:41)
>>
>> at
>> org.apache.hadoop.crypto.CryptoInputStream.freeBuffers(CryptoInputStream.java:687)
>>
>> at
>> org.apache.hadoop.crypto.CryptoInputStream.close(CryptoInputStream.java:320)
>>
>> at java.base/java.io.FilterInputStream.close(Unknown Source)
>>
>> at
>> org.apache.parquet.hadoop.util.H2SeekableInputStream.close(H2SeekableInputStream.java:50)
>>
>> at
>> org.apache.parquet.hadoop.ParquetFileReader.close(ParquetFileReader.java:1299)
>>
>> at
>> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:54)
>>
>> at
>> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44)
>>
>> at
>> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:467)
>>
>> at
>> org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:372)
>>
>> at
>> scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
>>
>> at scala.util.Success.$anonfun$map$1(Try.scala:255)
>>
>> at scala.util.Success.map(Try.scala:213)
>>
>> at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
>>
>> at
>> scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
>>
>> at
>> scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
>>
>> at
>> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
>>
>> at
>> java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown
>> Source)
>>
>> at
>> java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
>>
>> at
>> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown
>> Source)
>>
>> at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown
>> Source)
>>
>> at
>> java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
>>
>> at
>> java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
>>
>


Spark32 + Java 11 . Reading parquet java.lang.NoSuchMethodError: 'sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner()'

2022-06-13 Thread Pralabh Kumar
Hi Dev team

I have a spark32 image with Java 11 (Running Spark on K8s) .  While reading
a huge parquet file via  spark.read.parquet("") .  I am getting the
following error . The same error is mentioned in Spark docs
https://spark.apache.org/docs/latest/#downloading but w.r.t to apache arrow.


   - IMHO , I think the error is coming from Parquet 1.12.1  which is based
   on Hadoop 2.10 which is not java 11 compatible.

Please let me know if this understanding is correct and is there a way to
fix it.


java.lang.NoSuchMethodError: 'sun.misc.Cleaner
sun.nio.ch.DirectBuffer.cleaner()'

at
org.apache.hadoop.crypto.CryptoStreamUtils.freeDB(CryptoStreamUtils.java:41)

at
org.apache.hadoop.crypto.CryptoInputStream.freeBuffers(CryptoInputStream.java:687)

at
org.apache.hadoop.crypto.CryptoInputStream.close(CryptoInputStream.java:320)

at java.base/java.io.FilterInputStream.close(Unknown Source)

at
org.apache.parquet.hadoop.util.H2SeekableInputStream.close(H2SeekableInputStream.java:50)

at
org.apache.parquet.hadoop.ParquetFileReader.close(ParquetFileReader.java:1299)

at
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:54)

at
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44)

at
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:467)

at
org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:372)

at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)

at scala.util.Success.$anonfun$map$1(Try.scala:255)

at scala.util.Success.map(Try.scala:213)

at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)

at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)

at
scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)

at
java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown
Source)

at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown
Source)

at
java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown
Source)

at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown
Source)

at
java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)

at
java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)


CVE-2020-13936

2022-05-05 Thread Pralabh Kumar
Hi Dev Team

Please let me know if  there is a jira to track this CVE changes with
respect to Spark  . Searched jira but couldn't find anything.

Please help

Regards
Pralabh Kumar


CVE-2021-22569

2022-05-04 Thread Pralabh Kumar
Hi Dev Team

Spark is using protobuf 2.5.0 which is vulnerable to CVE-2021-22569. CVE
recommends to use protobuf 3.19.2

Please let me know , if there is a jira to track the update w.r.t CVE and
Spark or should I create the one ?

Regards
Pralabh Kumar


Re: Issue on Spark on K8s with Proxy user on Kerberized HDFS : Spark-25355

2022-05-03 Thread Pralabh Kumar
 java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Unknown Source)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:163)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)



Again Thx for the input and advice regarding documentation and
apologies for putting the wrong error stack earlier.

Regards
Pralabh Kumar
On Tue, May 3, 2022 at 7:39 PM Steve Loughran  wrote:

>
> Prablah, did you follow the URL provided in the exception message? i put a
> lot of effort in to improving the diagnostics, where the wiki articles are
> part of the troubleshooing process
> https://issues.apache.org/jira/browse/HADOOP-7469
>
> it's really disappointing when people escalate the problem to open source
> developers before trying to fix the problem themselves, in this case, read
> the error message.
>
> now, if there is some k8s related issue which makes this more common, you
> are encouraged to update the wiki entry with a new cause. documentation is
> an important contribution to open source projects, and if you have
> discovered a new way to recreate the failure, it would be welcome. which
> reminds me, i have to add something to connection reset and docker which
> comes down to "turn off http keepalive in maven builds"
>
> -Steve
>
>
>
>
>
> On Sat, 30 Apr 2022 at 10:45, Gabor Somogyi 
> wrote:
>
>> Hi,
>>
>> Please be aware that ConnectionRefused exception is has nothing to do w/
>> authentication. See the description from Hadoop wiki:
>> "You get a ConnectionRefused
>> <https://cwiki.apache.org/confluence/display/HADOOP2/ConnectionRefused> 
>> Exception
>> when there is a machine at the address specified, but there is no program
>> listening on the specific TCP port the client is using -and there is no
>> firewall in the way silently dropping TCP connection requests. If you do
>> not know what a TCP connection request is, please consult the
>> specification <http://www.ietf.org/rfc/rfc793.txt>."
>>
>> This means the namenode on host:port is not reachable in the TCP layer.
>> Maybe there are multiple issues but I'm pretty sure that something is wrong
>> in the K8S net config.
>>
>> BR,
>> G
>>
>>
>> On Fri, Apr 29, 2022 at 6:23 PM Pralabh Kumar 
>> wrote:
>>
>>> Hi dev Team
>>>
>>> Spark-25355 added the functionality of the proxy user on K8s . However
>>> proxy user on K8s with Kerberized HDFS is not working .  It is throwing
>>> exception and
>>>
>>> 22/04/21 17:50:30 WARN Client: Exception encountered while connecting to
>>> the server : org.apache.hadoop.security.AccessControlException: Client
>>> cannot authenticate via:[TOKEN, KERBEROS]
>>>
>>>
>>> Exception in thread "main" java.net.ConnectException: Call From
>>>  to  failed on connection exception:
>>> java.net.ConnectException: Connection refused; For more details see:  http:
>>> //wiki.apache.org/hadoop/ConnectionRefused
>>>
>>> at
>>> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> Method)
>>>
>>> at
>>> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
>>> Source)
>>>
>>> at
>>> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
>>> Source)
>>>
>>> at java.base/java.lang.reflect.Constructor.newInstance(Unknown
>>> Source)
>>>
>>> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
>>>
>>> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:755)
>>>
>>> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1501)
>>>
>>> at org.apache.hadoop.ipc.Client.call(Client.java:1443)
>>>
>>>     at org.apache.hadoop.ipc.Client.call(Client.java:1353)
>>>
>>> at
>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
>>>
>>> at
>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
>>>
>>> at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
>>>
>>> at
>>>
>>>
>>>
>>> On debugging deep , we found the proxy user doesn't have access to
>>> delegation tokens in case of K8s .SparkSubmit.submit explicitly creating
>>> the proxy user and this user doesn't have delegation token.
>>>
>>>
>>> Please help me with the same.
>>>
>>>
>>> Regards
>>>
>>> Pralabh Kumar
>>>
>>>
>>>
>>>


Issue on Spark on K8s with Proxy user on Kerberized HDFS : Spark-25355

2022-04-29 Thread Pralabh Kumar
Hi dev Team

Spark-25355 added the functionality of the proxy user on K8s . However
proxy user on K8s with Kerberized HDFS is not working .  It is throwing
exception and

22/04/21 17:50:30 WARN Client: Exception encountered while connecting to
the server : org.apache.hadoop.security.AccessControlException: Client
cannot authenticate via:[TOKEN, KERBEROS]


Exception in thread "main" java.net.ConnectException: Call From 
to  failed on connection exception: java.net.ConnectException:
Connection refused; For more details see:  http://
wiki.apache.org/hadoop/ConnectionRefused

at
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)

at
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
Source)

at
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
Source)

at java.base/java.lang.reflect.Constructor.newInstance(Unknown Source)

at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)

at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:755)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1501)

at org.apache.hadoop.ipc.Client.call(Client.java:1443)

at org.apache.hadoop.ipc.Client.call(Client.java:1353)

at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)

at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)

at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)

at



On debugging deep , we found the proxy user doesn't have access to
delegation tokens in case of K8s .SparkSubmit.submit explicitly creating
the proxy user and this user doesn't have delegation token.


Please help me with the same.


Regards

Pralabh Kumar


Spark3.2 on K8s with proxy-user kerberized environment

2022-04-25 Thread Pralabh Kumar
Hi dev team
Please help me on the below problem

 I have kerberized cluster and am also doing the kinit .  Problem is only
coming when the proxy user is being used .

>
> Running Spark 3.2 on K8s with --proxy-user and getting below error and
> then the job fails . However when running without a proxy user job is
> running fine . Can anyone please help me with the same .
>
>
> 22/04/21 17:50:30 WARN Client: Exception encountered while connecting to
> the server : org.apache.hadoop.security.AccessControlException: Client
> cannot authenticate via:[TOKEN, KERBEROS]
>
> 22/04/21 17:50:31 WARN Client: Exception encountered while connecting to
> the server : org.apache.hadoop.security.AccessControlException: Client
> cannot authenticate via:[TOKEN, KERBEROS]
>
> 22/04/21 17:50:37 WARN Client: Exception encountered while connecting to
> the server : org.apache.hadoop.security.AccessControlException: Client
> cannot authenticate via:[TOKEN, KERBEROS]
>
> 22/04/21 17:50:53 WARN Client: Exception encountered while connecting to
> the server : org.apache.hadoop.security.AccessControlException: Client
> cannot authenticate via:[TOKEN, KERBEROS]
>
> 22/04/21 17:51:32 WARN Client: Exception encountered while connecting to
> the server : org.apache.hadoop.security.AccessControlException: Client
> cannot authenticate via:[TOKEN, KERBEROS]
>
> 22/04/21 17:52:07 WARN Client: Exception encountered while connecting to
> the server : org.apache.hadoop.security.AccessControlException: Client
> cannot authenticate via:[TOKEN, KERBEROS]
>
> 22/04/21 17:52:27 WARN Client: Exception encountered while connecting to
> the server : org.apache.hadoop.security.AccessControlException: Client
> cannot authenticate via:[TOKEN, KERBEROS]
>
> 22/04/21 17:52:53 WARN Client: Exception encountered while connecting to
> the server : org.apache.hadoop.security.AccessControlException: Client
> cannot authenticate via:[TOKEN, KERBEROS]
>
>


Re: CVE -2020-28458, How to upgrade datatables dependency

2022-04-17 Thread Pralabh Kumar
Thx for the update. ( I was about to create the PR).

Thx for looking into it.

On Sun, 17 Apr 2022, 00:26 Sean Owen,  wrote:

> FWIW here's an update to 1.10.25:
> https://github.com/apache/spark/pull/36226
>
>
> On Wed, Apr 13, 2022 at 8:28 AM Sean Owen  wrote:
>
>> You can see the files in
>> core/src/main/resources/org/apache/spark/ui/static - you can try dropping
>> in the new minified versions and see if the UI is OK.
>> You can open a pull request if it works to update it, in case this
>> affects Spark.
>> It looks like the smaller upgrade to 1.10.22 is also sufficient.
>>
>> On Wed, Apr 13, 2022 at 7:43 AM Pralabh Kumar 
>> wrote:
>>
>>> Hi Dev Team
>>>
>>> Spark 3.2 (and 3.3 might also) have CVE 2020-28458.  Therefore  in my
>>> local repo of Spark I would like to update DataTables to 1.11.5.
>>>
>>> Can you please help me to point out where I should upgrade DataTables
>>> dependency ?.
>>>
>>> Regards
>>> Pralabh Kumar
>>>
>>


CVE -2020-28458, How to upgrade datatables dependency

2022-04-13 Thread Pralabh Kumar
Hi Dev Team

Spark 3.2 (and 3.3 might also) have CVE 2020-28458.  Therefore  in my local
repo of Spark I would like to update DataTables to 1.11.5.

Can you please help me to point out where I should upgrade DataTables
dependency ?.

Regards
Pralabh Kumar


Spark 3.0.1 and spark 3.2 compatibility

2022-04-07 Thread Pralabh Kumar
Hi spark community

I have quick question .I am planning to migrate from spark 3.0.1 to spark
3.2.

Do I need to recompile my application with 3.2 dependencies or application
compiled with 3.0.1 will work fine on 3.2 ?


Regards
Pralabh kumar


Spark on K8s , some applications ended ungracefully

2022-03-31 Thread Pralabh Kumar
Hi Spark Team

Some of my spark applications on K8s ended with the below error . These
applications though completed successfully (as per the event log
SparkListenerApplicationEnd event at the end)
stil have even files with .inprogress. This causes the application to be
shown as inprogress in SHS.

Spark v : 3.0.1



22/03/31 08:33:34 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout,
java.util.concurrent.TimeoutException

java.util.concurrent.TimeoutException

at java.util.concurrent.FutureTask.get(FutureTask.java:205)

at
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:68)

22/03/31 08:33:34 WARN SparkContext: Ignoring Exception while stopping
SparkContext from shutdown hook

java.lang.InterruptedException

at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)

at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)

at
java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)

at
org.apache.spark.util.ThreadUtils$.shutdown(ThreadUtils.scala:348)





Please let me know if there is a solution for it ..

Regards

Pralabh Kumar


Skip single integration test case in Spark on K8s

2022-03-16 Thread Pralabh Kumar
Hi Spark team

I am running Spark kubernetes integration test suite on cloud.

build/mvn install \

-f  pom.xml \

-pl resource-managers/kubernetes/integration-tests -am -Pscala-2.12
-Phadoop-3.1.1 -Phive -Phive-thriftserver -Pyarn -Pkubernetes
-Pkubernetes-integration-tests \

-Djava.version=8 \

-Dspark.kubernetes.test.sparkTgz= \

-Dspark.kubernetes.test.imageTag=<> \

-Dspark.kubernetes.test.imageRepo=< <http://reg.visa.com/>repo> \

-Dspark.kubernetes.test.deployMode=cloud \

-Dtest.include.tags=k8s \

-Dspark.kubernetes.test.javaImageTag= \

-Dspark.kubernetes.test.namespace= \

-Dspark.kubernetes.test.serviceAccountName=spark \

-Dspark.kubernetes.test.kubeConfigContext=<> \

-Dspark.kubernetes.test.master=<> \

-Dspark.kubernetes.test.jvmImage=<> \

-Dspark.kubernetes.test.pythonImage=<> \

-Dlog4j.logger.org.apache.spark=DEBUG



I am successfully able to run some test cases and some are failing . For
e.g "Run SparkRemoteFileTest using a Remote data file" in KuberneterSuite
is failing.


Is there a way to skip running some of the test cases ?.



Please help me on the same.


Regards

Pralabh Kumar


Spark on K8s : property simillar to yarn.max.application.attempt

2022-02-04 Thread Pralabh Kumar
Hi Spark Team

I am running spark on K8s and looking for a
property/mechanism similar to  yarn.max.application.attempt . I know this
is not really a spark question , but i thought if anyone have faced the
similar issue,

Basically I want if my driver pod fails , it should be retried on a
different machine . Is there a way to do the same .

Regards
Pralabh Kumar


Re: Spark on k8s : spark 3.0.1 spark.kubernetes.executor.deleteontermination issue

2022-01-18 Thread Pralabh Kumar
Does this property spark.kubernetes.executor.deleteontermination checks
whether the executor which is deleted have shuffle data or not ?

On Tue, 18 Jan 2022, 11:20 Pralabh Kumar,  wrote:

> Hi spark team
>
> Have cluster wide property spark.kubernetis.executor.deleteontermination
> to true.
> During the long running job, some of the executor got deleted which have
> shuffle data. Because of this,  in the subsequent stage , we get lot of
> spark shuffle fetch fail exceptions.
>
>
> Please let me know , is there a way to fix it. Note if setting above
> property to false , I face no shuffle fetch exception.
>
>
> Regards
> Pralabh
>


Spark on k8s : spark 3.0.1 spark.kubernetes.executor.deleteontermination issue

2022-01-17 Thread Pralabh Kumar
Hi spark team

Have cluster wide property spark.kubernetis.executor.deleteontermination to
true.
During the long running job, some of the executor got deleted which have
shuffle data. Because of this,  in the subsequent stage , we get lot of
spark shuffle fetch fail exceptions.


Please let me know , is there a way to fix it. Note if setting above
property to false , I face no shuffle fetch exception.


Regards
Pralabh


Difference in behavior for Spark 3.0 vs Spark 3.1 "create database "

2022-01-10 Thread Pralabh Kumar
Hi Spark Team

When creating a database via Spark 3.0 on Hive

1) spark.sql("create database test location '/user/hive'").  It creates the
database location on hdfs . As expected

2) When running the same command on 3.1 the database is created on the
local file system by default. I have to prefix with hdfs to create db on
hdfs.

Why is there a difference in the behavior, Can you please point me to the
jira which causes this change.

Note : spark.sql.warehouse.dir and hive.metastore.warehouse.dir both are
having default values(not explicitly set)

Regards
Pralabh Kumar


ivy unit test case filing for Spark

2021-12-21 Thread Pralabh Kumar
Hi Spark Team

I am building a spark in VPN . But the unit test case below is failing.
This is pointing to ivy location which  cannot be reached within VPN . Any
help would be appreciated

test("SPARK-33084: Add jar support Ivy URI -- default transitive = true") {
  *sc *= new SparkContext(new
SparkConf().setAppName("test").setMaster("local-cluster[3,
1, 1024]"))
  *sc*.addJar("*ivy://org.apache.hive:hive-storage-api:2.7.0*")
  assert(*sc*.listJars().exists(_.contains(
"org.apache.hive_hive-storage-api-2.7.0.jar")))
  assert(*sc*.listJars().exists(_.contains(
"commons-lang_commons-lang-2.6.jar")))
}

Error

- SPARK-33084: Add jar support Ivy URI -- default transitive = true ***
FAILED ***
java.lang.RuntimeException: [unresolved dependency:
org.apache.hive#hive-storage-api;2.7.0: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(
SparkSubmit.scala:1447)
at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(
DependencyUtils.scala:185)
at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(
DependencyUtils.scala:159)
at org.apache.spark.SparkContext.addJar(SparkContext.scala:1996)
at org.apache.spark.SparkContext.addJar(SparkContext.scala:1928)
at org.apache.spark.SparkContextSuite.$anonfun$new$115(SparkContextSuite.
scala:1041)
at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)

Regards
Pralabh Kumar


Log4j 1.2.17 spark CVE

2021-12-12 Thread Pralabh Kumar
Hi developers,  users

Spark is built using log4j 1.2.17 . Is there a plan to upgrade based on
recent CVE detected ?


Regards
Pralabh kumar


https://issues.apache.org/jira/browse/SPARK-36622

2021-09-01 Thread Pralabh Kumar
Hi Spark dev  Community


Please let me know your opinion about
https://issues.apache.org/jira/browse/SPARK-36622

Regards
Pralabh Kumar


Spark Thriftserver is failing for when submitting command from beeline

2021-08-20 Thread Pralabh Kumar
Hi Dev

Environment details

Hadoop 3.2
Hive 3.1
Spark 3.0.3

Cluster : Kerborized .

1) Hive server is running fine
2) Spark sql , sparkshell, spark submit everything is working as expected.
3) Connecting Hive through beeline is working fine (after kinit)
beeline -u "jdbc:hive2://:/default;principal=

Now launched Spark thrift server and try to connect it through beeline.

beeline client perfectly connects with STS .

4) beeline -u "jdbc:hive2://:/default;principal=
   a) Log says connected to
   Spark sql
   Drive : Hive JDBC


Now when I run any commands ("show tables") it fails .  Log ins STS  says

*21/08/19 19:30:12 DEBUG UserGroupInformation: PrivilegedAction as:
(auth:PROXY) via  (auth:KERBEROS)
from:org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Client.createClientTransport(HadoopThriftAuthBridge.java:208)*
*21/08/19 19:30:12 DEBUG UserGroupInformation: PrivilegedAction as:*
**  * (auth:PROXY) via * **  * (auth:KERBEROS)
from:org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)*
21/08/19 19:30:12 DEBUG TSaslTransport: opening transport
org.apache.thrift.transport.TSaslClientTransport@f43fd2f
21/08/19 19:30:12 ERROR TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Failed to
find any Kerberos tgt)]
at
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at
org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:95)
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
at
org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:38)
at
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
at
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:480)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:247)
at
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1707)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:83)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:133)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
at
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3600)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3652)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3632)
at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1556)
at org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1545)
at
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$databaseExists$1(HiveClientImpl.scala:384)




My guess is authorization through proxy is not working .



Please help


Regards
Pralabh Kumar


Re: Hive on Spark vs Spark on Hive(HiveContext)

2021-07-01 Thread Pralabh Kumar
Hi mich

Thx for replying.your answer really helps. The comparison was done in 2016.
I would like to know the latest comparison with spark 3.0

Also what you are suggesting is to migrate queries to Spark ,which is
hivecontxt or hive on spark, which is what Facebook also did
. Is that understanding correct ?

Regards
Pralabh

On Thu, 1 Jul 2021, 15:44 Mich Talebzadeh, 
wrote:

> Hi Prahabh,
>
> This question has been asked before :)
>
> Few years ago (late 2016),  I made a presentation on running Hive Queries
> on the Spark execution engine for Hortonworks.
>
>
> https://www.slideshare.net/MichTalebzadeh1/query-engines-for-hive-mr-spark-tez-with-llap-considerations
>
> The issue you will face will be compatibility problems with versions of
> Hive and Spark.
>
> My suggestion would be to use Spark as a massive parallel processing and
> Hive as a storage layer. However, you need to test what can be migrated or
> not.
>
> HTH
>
>
> Mich
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 1 Jul 2021 at 10:52, Pralabh Kumar  wrote:
>
>> Hi Dev
>>
>> I am having thousands of legacy hive queries .  As a plan to move to
>> Spark , we are planning to migrate Hive queries on Spark .  Now there are
>> two approaches
>>
>>
>>1.  One is Hive on Spark , which is similar to changing the execution
>>engine in hive queries like TEZ.
>>2. Another one is migrating hive queries to Hivecontext/sparksql , an
>>approach used by Facebook and presented in Spark conference.
>>
>> https://databricks.com/session/experiences-migrating-hive-workload-to-sparksql#:~:text=Spark%20SQL%20in%20Apache%20Spark,SQL%20with%20minimal%20user%20intervention
>>.
>>
>>
>> Can you please guide me which option to go for . I am personally inclined
>> to go for option 2 . It also allows the use of the latest spark .
>>
>> Please help me on the same , as there are not much comparisons online
>> available keeping Spark 3.0 in perspective.
>>
>> Regards
>> Pralabh Kumar
>>
>>
>>


Hive on Spark vs Spark on Hive(HiveContext)

2021-07-01 Thread Pralabh Kumar
Hi Dev

I am having thousands of legacy hive queries .  As a plan to move to Spark
, we are planning to migrate Hive queries on Spark .  Now there are two
approaches


   1.  One is Hive on Spark , which is similar to changing the execution
   engine in hive queries like TEZ.
   2. Another one is migrating hive queries to Hivecontext/sparksql , an
   approach used by Facebook and presented in Spark conference.
   
https://databricks.com/session/experiences-migrating-hive-workload-to-sparksql#:~:text=Spark%20SQL%20in%20Apache%20Spark,SQL%20with%20minimal%20user%20intervention
   .


Can you please guide me which option to go for . I am personally inclined
to go for option 2 . It also allows the use of the latest spark .

Please help me on the same , as there are not much comparisons online
available keeping Spark 3.0 in perspective.

Regards
Pralabh Kumar


Unable to pickle pySpark PipelineModel

2020-12-10 Thread Pralabh Kumar
Hi Dev , User

I want to store spark ml model in databases , so that I can reuse them
later on .  I am
unable to pickle them . However while using scala I am able to convert them
into byte
array stream .

So for .eg I am able to do something below in scala but not in python

 val modelToByteArray = new ByteArrayOutputStream()
 val oos = new ObjectOutputStream(modelToByteArray)
 oos.writeObject(model)
 oos.close()
 oos.flush()

spark.sparkContext.parallelize(Seq((model.uid, "my-neural-network-model",
modelToByteArray.toByteArray)))
   .saveToCassandra("dfsdfs", "models", SomeColumns("uid", "name", "model")


But pickle.dumps(model) in pyspark throws error

cannot pickle '_thread.RLock' object


Please help on the same


Regards

Pralabh


Re: org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-02 Thread Pralabh Kumar
I am performing join operation , if I convert reduce side join to map side
(no shuffle will happen)  and I assume in that case this error shouldn't
come. Let me know if this understanding is correct

On Tue, May 1, 2018 at 9:37 PM, Ryan Blue  wrote:

> This is usually caused by skew. Sometimes you can work around it by in
> creasing the number of partitions like you tried, but when that doesn’t
> work you need to change the partitioning that you’re using.
>
> If you’re aggregating, try adding an intermediate aggregation. For
> example, if your query is select sum(x), a from t group by a, then try select
> sum(partial), a from (select sum(x) as partial, a, b from t group by a, b)
> group by a.
>
> rb
> ​
>
> On Tue, May 1, 2018 at 4:21 AM, Pralabh Kumar 
> wrote:
>
>> Hi
>>
>> I am getting the above error in Spark SQL . I have increase (using 5000 )
>> number of partitions but still getting the same error .
>>
>> My data most probably is skew.
>>
>>
>>
>> org.apache.spark.shuffle.FetchFailedException: Too large frame: 4247124829
>>  at 
>> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:419)
>>  at 
>> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:349)
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-01 Thread Pralabh Kumar
Hi

I am getting the above error in Spark SQL . I have increase (using 5000 )
number of partitions but still getting the same error .

My data most probably is skew.



org.apache.spark.shuffle.FetchFailedException: Too large frame: 4247124829
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:419)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:349)


Re: Best way to Hive to Spark migration

2018-04-04 Thread Pralabh Kumar
Hi

I have lot of ETL jobs (complex ones) , since they are SLA critical , I am
planning them to migrate to spark.

On Thu, Apr 5, 2018 at 10:46 AM, Jörn Franke  wrote:

> You need to provide more context on what you do currently in Hive and what
> do you expect from the migration.
>
> On 5. Apr 2018, at 05:43, Pralabh Kumar  wrote:
>
> Hi Spark group
>
> What's the best way to Migrate Hive to Spark
>
> 1) Use HiveContext of Spark
> 2) Use Hive on Spark (https://cwiki.apache.org/
> confluence/display/Hive/Hive+on+Spark%3A+Getting+Started)
> 3) Migrate Hive to Calcite to Spark SQL
>
>
> Regards
>
>


Best way to Hive to Spark migration

2018-04-04 Thread Pralabh Kumar
Hi Spark group

What's the best way to Migrate Hive to Spark

1) Use HiveContext of Spark
2) Use Hive on Spark (
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
)
3) Migrate Hive to Calcite to Spark SQL


Regards


Are there any alternatives to Hive "stored by" clause as Spark 2.0 does not support it

2018-02-07 Thread Pralabh Kumar
Hi

Spark 2.0 doesn't support stored by . Is there any alternative to achieve
the same.


Re: Kryo serialization failed: Buffer overflow : Broadcast Join

2018-02-02 Thread Pralabh Kumar
I am using spark 2.1.0

On Fri, Feb 2, 2018 at 5:08 PM, Pralabh Kumar 
wrote:

> Hi
>
> I am performing broadcast join where my small table is 1 gb .  I am
> getting following error .
>
> I am using
>
>
> org.apache.spark.SparkException:
> . Available: 0, required: 28869232. To avoid this, increase
> spark.kryoserializer.buffer.max value
>
>
>
> I increase the value to
>
> spark.conf.set("spark.kryoserializer.buffer.max","2g")
>
>
> But I am still getting the error .
>
> Please help
>
> Thx
>


Kryo serialization failed: Buffer overflow : Broadcast Join

2018-02-02 Thread Pralabh Kumar
Hi

I am performing broadcast join where my small table is 1 gb .  I am getting
following error .

I am using


org.apache.spark.SparkException:
. Available: 0, required: 28869232. To avoid this, increase
spark.kryoserializer.buffer.max value



I increase the value to

spark.conf.set("spark.kryoserializer.buffer.max","2g")


But I am still getting the error .

Please help

Thx


Does Spark and Hive use Same SQL parser : ANTLR

2018-01-18 Thread Pralabh Kumar
Hi


Does hive and spark uses same SQL parser provided by ANTLR . Did they
generate the same logical plan .

Please help on the same.


Regards
Pralabh Kumar


Spark build is failing in amplab Jenkins

2017-11-02 Thread Pralabh Kumar
Hi Dev

Spark build is failing in Jenkins


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83353/consoleFull


Python versions prior to 2.7 are not supported.
Build step 'Execute shell' marked build as failure
Archiving artifacts
Recording test results
ERROR: Step ?Publish JUnit test result report? failed: No test report
files were found. Configuration error?


Please help



Regards

Pralabh Kumar


[SPARK-20199][ML] : Provided featureSubsetStrategy to GBTClassifier and GBTRegressor

2017-09-11 Thread Pralabh Kumar
Hi Developers

Can somebody look into this pull request . Its being reviewed by MLnick
<https://github.com/apache/spark/pull/18118/files/79702933d321051222073057b25305831df84c6d>
, sethah
<https://github.com/apache/spark/pull/18118/files/12d83aaa5f24e17304de760c5664b8ede55d678a>
, mpjlu
<https://github.com/apache/spark/pull/18118/files/16ccbdfd8862c528c90fdde94c8ec20d6631126e>
?
.


Please review it .


Regards
Pralabh Kumar


Re: How to tune the performance of Tpch query5 within Spark

2017-07-17 Thread Pralabh Kumar
Hi

To read file parallely , you can follow the below code.


 case class readData (fileName : String , spark : SparkSession) extends
Callable[Dataset[Row]]{
  override def call(): Dataset[Row] = {
spark.read.parquet(fileName)
   // spark.read.csv(fileName)
  }
}

val spark =  SparkSession.builder()
 .appName("practice")
 .config("spark.scheduler.mode","FAIR")
 .enableHiveSupport().getOrCreate()
   val pool = Executors.newFixedThreadPool(6)
   val list = new util.ArrayList[Future[Dataset[Row]]]()


 for(fileName<-"orders,lineitem,customer,supplier,region,nation".split(",")){
 val o1 = new readData(fileName,spark)
 //pool.submit(o1).
 list.add(pool.submit(o1))
   }
   val rddList = new ArrayBuffer[Dataset[Row]]()
   for(result <- list){
 rddList += result.get()
   }

   pool.shutdown()
   pool.awaitTermination(Long.MaxValue, TimeUnit.NANOSECONDS)
   for(finalData<-rddList){
 finalData.show()
   }


This will read data in parallel ,which I think is your main bottleneck.

Regards
Pralabh Kumar



On Mon, Jul 17, 2017 at 6:25 PM, vaquar khan  wrote:

> Could you please let us know your Spark version?
>
>
> Regards,
> vaquar khan
>
>
> On Jul 17, 2017 12:18 AM, "163"  wrote:
>
>> I change the UDF but the performance seems still slow. What can I do else?
>>
>>
>> 在 2017年7月14日,下午8:34,Wenchen Fan  写道:
>>
>> Try to replace your UDF with Spark built-in expressions, it should be as
>> simple as `$”x” * (lit(1) - $”y”)`.
>>
>> On 14 Jul 2017, at 5:46 PM, 163  wrote:
>>
>> I modify the tech query5 to DataFrame:
>>
>> val forders = 
>> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/orders*”*).filter("o_orderdate
>>  < 1995-01-01 and o_orderdate >= 1994-01-01").select("o_custkey", 
>> "o_orderkey")
>> val flineitem = 
>> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/lineitem")
>> val fcustomer = 
>> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/customer")
>> val fsupplier = 
>> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/supplier")
>> val fregion = 
>> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/region*”*).where("r_name
>>  = 'ASIA'").select($"r_regionkey")
>> val fnation = 
>> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/nation*”*)
>>
>> val decrease = udf { (x: Double, y: Double) => x * (1 - y) }
>>
>> val res =   flineitem.join(forders, $"l_orderkey" === forders("o_orderkey"))
>>  .join(fcustomer, $"o_custkey" === fcustomer("c_custkey"))
>>  .join(fsupplier, $"l_suppkey" === fsupplier("s_suppkey") && 
>> $"c_nationkey" === fsupplier("s_nationkey"))
>>  .join(fnation, $"s_nationkey" === fnation("n_nationkey"))
>>  .join(fregion, $"n_regionkey" === fregion("r_regionkey"))
>>  .select($"n_name", decrease($"l_extendedprice", 
>> $"l_discount").as("value"))
>>  .groupBy($"n_name")
>>  .agg(sum($"value").as("revenue"))
>>  .sort($"revenue".desc).show()
>>
>>
>> My environment is one master(Hdfs-namenode), four workers(HDFS-datanode), 
>> each with 40 cores and 128GB memory.  TPCH 100G stored on HDFS using parquet 
>> format.
>>
>> It executed about 1.5m, I found that read these 6 tables using 
>> spark.read.parqeut is sequential, How can I made this to run parallelly ?
>>
>>  I’ve already set data locality and spark.default.parallelism, 
>> spark.serializer, using G1, But the runtime  is still not reduced.
>>
>> And is there any advices for me to tuning this performance?
>>
>> Thank you.
>>
>> Wenting He
>>
>>
>>
>>
>>


Re: Memory issue in pyspark for 1.6 mb file

2017-06-17 Thread Pralabh Kumar
Hi Naga

Is it failing because of driver memory full or executor  memory full ?

can you please try setting this property spark.cleaner.ttl ? . So that
older RDDs /metadata should also get clear automatically.

Can you please provide the complete error stacktrace and code snippet ?.


Regards
Pralabh Kumar



On Sun, Jun 18, 2017 at 12:06 AM, Naga Guduru  wrote:

> Hi,
>
> I am trying to load 1.6 mb excel file which has 16 tabs. We converted
> excel to csv and loaded 16 csv files to 8 tables. Job was running
> successful in 1st run in pyspark. When trying to run the same job 2 time,
> container getting killed due to memory issues.
>
> I am using unpersist and clearcache on all rdds and dataframes after each
> file loaded into table. Each csv file is loaded in sequence process ( for
> loop) as some of the files should go to same table. Job will run 15 min if
> it was success and 12-15 min if it was failed. If i increase the driver
> memory and executor memory to more than 5 gb, its getting success.
>
> My assumption is driver memory full, and unpersist clear cache not working.
>
> Error: physical memory of 2 gb used and virtual memory of 4.6 gb used.
>
> Spark 1.6 version running in Cloudera Enterprise .
>
> Please let me know, if you need any info.
>
>
> Thanks
>
>


Re: featureSubsetStrategy parameter for GradientBoostedTreesModel

2017-06-15 Thread Pralabh Kumar
Hi everyone

Currently GBT doesn't expose featureSubsetStrategy as exposed by Random
Forest.

.
GradientBoostedTrees in Spark have hardcoded feature subset strategy to
"all" while calling random forest in  DecisionTreeRegressor.scala

val trees = RandomForest.run(data, oldStrategy, numTrees = 1,
featureSubsetStrategy = "all",


It should provide functionality to the user to set featureSubsetStrategy
("auto", "all" ,"sqrt" , "log2" , "onethird") ,
the way random forest does.

This will help GBT to have randomness at feature level.

Jira SPARK-20199 <https://issues.apache.org/jira/browse/SPARK-20199>

Please let me know , if my understanding is correct.

Regards
Pralabh Kumar

On Fri, Jun 16, 2017 at 7:53 AM, Pralabh Kumar 
wrote:

> Hi everyone
>
> Currently GBT doesn't expose featureSubsetStrategy as exposed by Random
> Forest.
>
> .
> GradientBoostedTrees in Spark have hardcoded feature subset strategy to
> "all" while calling random forest in  DecisionTreeRegressor.scala
>
> val trees = RandomForest.run(data, oldStrategy, numTrees = 1,
> featureSubsetStrategy = "all",
>
>
> It should provide functionality to the user to set featureSubsetStrategy
> ("auto", "all" ,"sqrt" , "log2" , "onethird") ,
> the way random forest does.
>
> This will help GBT to have randomness at feature level.
>
> Jira SPARK-20199 <https://issues.apache.org/jira/browse/SPARK-20199>
>
> Please let me know , if my understanding is correct.
>
> Regards
> Pralabh Kumar
>


featureSubsetStrategy parameter for GradientBoostedTreesModel

2017-06-15 Thread Pralabh Kumar
Hi everyone

Currently GBT doesn't expose featureSubsetStrategy as exposed by Random
Forest.

.
GradientBoostedTrees in Spark have hardcoded feature subset strategy to
"all" while calling random forest in  DecisionTreeRegressor.scala

val trees = RandomForest.run(data, oldStrategy, numTrees = 1,
featureSubsetStrategy = "all",


It should provide functionality to the user to set featureSubsetStrategy
("auto", "all" ,"sqrt" , "log2" , "onethird") ,
the way random forest does.

This will help GBT to have randomness at feature level.

Jira SPARK-20199 <https://issues.apache.org/jira/browse/SPARK-20199>

Please let me know , if my understanding is correct.

Regards
Pralabh Kumar