Re: Spark standalone - reading kerberos hdfs

2021-01-21 Thread Sudhir Babu Pothineni
Any other insights into this issue? I tried multiple way to supply keytab to executor Does spark standalone doesn’t support Kerberos? > On Jan 8, 2021, at 1:53 PM, Sudhir Babu Pothineni > wrote: > >  > Incase of Spark on Yarn, Application Master shares the token. >

Re: Spark standalone - reading kerberos hdfs

2021-01-08 Thread Sudhir Babu Pothineni
can be obtained by Spark. > Please check the logs... > > On Fri, 8 Jan 2021, 18:51 Sudhir Babu Pothineni, > wrote: > >> I spin up a spark standalone cluster (spark.autheticate=false), submitted >> a job which reads remote kerberized HDFS, >>

Spark standalone - reading kerberos hdfs

2021-01-08 Thread Sudhir Babu Pothineni
I spin up a spark standalone cluster (spark.autheticate=false), submitted a job which reads remote kerberized HDFS, val spark = SparkSession.builder() .master("spark://spark-standalone:7077") .getOrCreate() UserGroupInformation.loginUserFromKeytab(principal,

Re: spark-sklearn

2019-04-08 Thread Sudhir Babu Pothineni
nt np.unique you have to convert the output: > > udf(lambda x: np.unique(x).tolist(), ArrayType(IntegerType())) > > > > > > > > > > > > >> Am Mo., 8. Apr. 2019 um 11:43 Uhr schrieb Sudhir Babu Pothineni >> : >> >> >> &

Re: spark-sklearn

2019-04-08 Thread Sudhir Babu Pothineni
> > Trying to run tests in spark-sklearn, anybody check the below exception > > pip freeze: > > nose==1.3.7 > numpy==1.16.1 > pandas==0.19.2 > python-dateutil==2.7.5 > pytz==2018.9 > scikit-learn==0.19.2 > scipy==1.2.0 > six==1.12.0 > spark-sklearn==0.3.0 > > Spark version: >

ORC file stripe statistics in Spark

2016-09-27 Thread Sudhir Babu Pothineni
I am trying to get number of rows each stripe of ORC file? hivecontext.orcFile doesn't exist anymore? I am using Spark 1.6.0 scala> val hiveSqlContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveSqlContext: org.apache.spark.sql.hive.HiveContext =

Re: Zero Data Loss in Spark with Kafka

2016-08-23 Thread Sudhir Babu Pothineni
saving offsets to zookeeper is old approach, check-pointing internally saves the offsets to HDFS/location of checkpointing. more details here: http://spark.apache.org/docs/latest/streaming-kafka-integration.html On Tue, Aug 23, 2016 at 10:30 AM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com>

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Sudhir Babu Pothineni
that >>>> someone might just start saying that KUDA has difficult lineage as well. >>>> After all dynastic rules dictate. >>>> >>>> Personally I feel that if something stores my data compressed and makes me >>>> access it faster I do

Re: Limit pyspark.daemon threads

2016-06-15 Thread Sudhir Babu Pothineni
Hi Ken, It may be also related to Grid Engine job scheduling? If it is 16 core (virtual cores?), grid engine allocates 16 slots, If you use 'max' scheduling, it will send 16 processes sequentially to same machine, on the top of it each spark job has its own executors. Limit the number of jobs