Re: spark driver pod stuck in Waiting: PodInitializing state in Kubernetes

2018-08-17 Thread purna pradeep
Resurfacing The question to get more attention

Hello,
>
> im running Spark 2.3 job on kubernetes cluster
>>
>> kubectl version
>>
>> Client Version: version.Info{Major:"1", Minor:"9",
>> GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b",
>> GitTreeState:"clean", BuildDate:"2018-02-09T21:51:06Z",
>> GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"}
>>
>> Server Version: version.Info{Major:"1", Minor:"8",
>> GitVersion:"v1.8.3", GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd",
>> GitTreeState:"clean", BuildDate:"2017-11-08T18:27:48Z",
>> GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
>>
>>
>>
>> when i ran spark submit on k8s master the driverpod is stuck in Waiting:
>> PodInitializing state.
>> I had to manually kill the driver pod and submit new job in this case
>> ,then it works.How this can be handled in production ?
>>
> This happens with executor pods as well
>

https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-25128
>
>
>>
>> This is happening if i submit the jobs almost parallel ie submit 5 jobs
>> one after the other simultaneously.
>>
>> I'm running spark jobs on 20 nodes each having below configuration
>>
>> I tried kubectl describe node on the node where trhe driver pod is
>> running this is what i got ,i do see there is overcommit on resources but i
>> expected kubernetes scheduler not to schedule if resources in node are
>> overcommitted or node is in Not Ready state ,in this case node is in Ready
>> State but i observe same behaviour if node is in "Not Ready" state
>>
>>
>>
>> Name:   **
>>
>> Roles:  worker
>>
>> Labels: beta.kubernetes.io/arch=amd64
>>
>> beta.kubernetes.io/os=linux
>>
>> kubernetes.io/hostname=
>>
>> node-role.kubernetes.io/worker=true
>>
>> Annotations:node.alpha.kubernetes.io/ttl=0
>>
>>
>> volumes.kubernetes.io/controller-managed-attach-detach=true
>>
>> Taints: 
>>
>> CreationTimestamp:  Tue, 31 Jul 2018 09:59:24 -0400
>>
>> Conditions:
>>
>>   Type Status  LastHeartbeatTime
>> LastTransitionTimeReason   Message
>>
>>    --  -
>> ----   ---
>>
>>   OutOfDiskFalse   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
>> Jul 2018 09:59:24 -0400   KubeletHasSufficientDisk kubelet has
>> sufficient disk space available
>>
>>   MemoryPressure   False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
>> Jul 2018 09:59:24 -0400   KubeletHasSufficientMemory   kubelet has
>> sufficient memory available
>>
>>   DiskPressure False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
>> Jul 2018 09:59:24 -0400   KubeletHasNoDiskPressure kubelet has no disk
>> pressure
>>
>>   ReadyTrueTue, 14 Aug 2018 09:31:20 -0400   Sat, 11
>> Aug 2018 00:41:27 -0400   KubeletReady kubelet is posting
>> ready status. AppArmor enabled
>>
>> Addresses:
>>
>>   InternalIP:  *
>>
>>   Hostname:**
>>
>> Capacity:
>>
>>  cpu: 16
>>
>>  memory:  125827288Ki
>>
>>  pods:110
>>
>> Allocatable:
>>
>>  cpu: 16
>>
>>  memory:  125724888Ki
>>
>>  pods:110
>>
>> System Info:
>>
>>  Machine ID: *
>>
>>  System UUID:**
>>
>>  Boot ID:1493028d-0a80-4f2f-b0f1-48d9b8910e9f
>>
>>  Kernel Version: 4.4.0-1062-aws
>>
>>  OS Image:   Ubuntu 16.04.4 LTS
>>
>>  Operating System:   linux
>>
>>  Architecture:   amd64
>>
>>  Container Runtime Version:  docker://Unknown
>>
>>  Kubelet Version:v1.8.3
>>
>>  Kube-Proxy Version: v1.8.3
>>
>> PodCIDR: **
>>
>> ExternalID:  **
>>
>> Non-terminated Pods: (11 in total)
>>
>>   Namespace  Name
>>CPU Requests  CPU Limits  Memory Requests  Memory
>> Limits
>>
>>   -  
>>  --  ---
>>  -
>>
>>   kube-systemcalico-node-gj5mb
>> 250m (1%) 0 (0%)  0 (0%)   0 (0%)
>>
>>   kube-system
>>  kube-proxy- 100m (0%)
>> 0 (0%)  0 (0%)   0 (0%)
>>
>>   kube-system
>>  prometheus-prometheus-node-exporter-9cntq   100m (0%)
>> 200m (1%)   30Mi (0%)50Mi (0%)
>>
>>   logging
>>  elasticsearch-elasticsearch-data-69df997486-gqcwg   400m (2%)
>> 1 (6%)  8Gi (6%) 16Gi (13%)
>>
>>   logging

Pyspark error when converting string to timestamp in map function

2018-08-17 Thread Keith Chapman
Hi all,

I'm trying to create a dataframe enforcing a schema so that I can write it
to a parquet file. The schema has timestamps and I get an error with
pyspark. The following is a snippet of code that exhibits the problem,

df = sqlctx.range(1000)
schema = StructType([StructField('a', TimestampType(), True)])
df1 = sqlctx.createDataFrame(df.rdd.map(row_gen_func), schema)

row_gen_func is a function that retruns timestamp strings of the form
"2018-03-21 11:09:44"

When I compile this with Spark 2.2 I get the following error,

raise TypeError("%s can not accept object %r in type %s" % (dataType, obj,
type(obj)))
TypeError: TimestampType can not accept object '2018-03-21 08:06:17' in
type 

Regards,
Keith.

http://keith-chapman.com


Re: Two different Hive instances running

2018-08-17 Thread Patrick Alwell
You probably need to take a look at your hive-site.xml and see what the 
location is for the Hive Metastore. As for beeline, you can explicitly use an 
instance of Hive server by passing in the JDBC url to the hiveServer when you 
launch the client; e.g. beeline –u “jdbc://example.com:5432”

Try taking a look at this 
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-hive-metastore.html

There should be conf settings you can update to make sure you are using the 
same metastore as the instance of HiveServer.

Hive Wiki is a great resource as well ☺

From: Fabio Wada 
Date: Friday, August 17, 2018 at 11:22 AM
To: "user@spark.apache.org" 
Subject: Two different Hive instances running

Hi,

I am executing a insert into Hive table using SparkSession in Java. When I 
execute select via beeline, I don't see these inserted data. And when I insert 
data using beeline I don't see via my program using SparkSession.

It's looks like there are different Hive instances running.

How can I point to same Hive instance? Using SparkSession and beeline.

Thanks
[mage removed by sender.]ᐧ


Two different Hive instances running

2018-08-17 Thread Fabio Wada
Hi,

I am executing a insert into Hive table using SparkSession in Java. When I
execute select via beeline, I don't see these inserted data. And when I
insert data using beeline I don't see via my program using SparkSession.

It's looks like there are different Hive instances running.

How can I point to same Hive instance? Using SparkSession and beeline.

Thanks
ᐧ


Re: [Spark Streaming] [ML]: Exception handling for the transform method of Spark ML pipeline model

2018-08-17 Thread sudododo
Hi,

Any help on this?

Thanks,



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: java.nio.file.FileSystemException: /tmp/spark- .._cache : No space left on device

2018-08-17 Thread Jeevan K. Srivatsa
Hi Venkata,

On a quick glance, it looks like a file-related issue more so than an
executor issue. If the logs are not that important, I would clear
/tmp/spark-events/ directory and assign a suitable permission (e.g., chmod
755) to that and rerun the application.

chmod 755 /tmp/spark-events/

Thanks and regards,
Jeevan K. Srivatsa


On Fri, 17 Aug 2018 at 15:20, Polisetti, Venkata Siva Rama Gopala Krishna <
vpolise...@spglobal.com> wrote:

> Hi
>
> Am getting below exception when I Run Spark-submit in linux machine , can
> someone give quick solution with commands
>
> Driver stacktrace:
>
> - Job 0 failed: count at DailyGainersAndLosersPublisher.scala:145, took
> 5.749450 s
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 4
> in stage 0.0 failed 4 times, most recent failure: Lost task 4.3 in stage
> 0.0 (TID 6, 172.29.62.145, executor 0): java.nio.file.FileSystemException:
> /tmp/spark-523d5331-3884-440c-ac0d-f46838c2029f/executor-390c9cd7-217e-42f3-97cb-fa2734405585/spark-206d92c0-f0d3-443c-97b2-39494e2c5fdd/-4230744641534510169119_cache
> -> ./PublishGainersandLosers-1.0-SNAPSHOT-shaded-Gopal.jar: No space left
> on device
>
> at
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
>
> at
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>
> at sun.nio.fs.UnixCopyFile.copyFile(UnixCopyFile.java:253)
>
> at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:581)
>
> at
> sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
>
> at java.nio.file.Files.copy(Files.java:1274)
>
> at
> org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:625)
>
> at org.apache.spark.util.Utils$.copyFile(Utils.scala:596)
>
> at org.apache.spark.util.Utils$.fetchFile(Utils.scala:473)
>
> at
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:696)
>
> at
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:688)
>
> at
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>
> at
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>
> at
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>
> at org.apache.spark.executor.Executor.org
> $apache$spark$executor$Executor$$updateDependencies(Executor.scala:688)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:308)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> --
>
> The information contained in this message is intended only for the
> recipient, and may be a confidential attorney-client communication or may
> otherwise be privileged and confidential and protected from disclosure. If
> the reader of this message is not the intended recipient, or an employee or
> agent responsible for delivering this message to the intended recipient,
> please be aware that any dissemination or copying of this communication is
> strictly prohibited. If you have received this communication in error,
> please immediately notify us by replying to the message and deleting it
> from your computer. S Global Inc. reserves the right, subject to
> applicable local law, to monitor, review and process the content of any
> electronic message or information sent to or from S Global Inc. e-mail
> addresses without informing the sender or recipient of the message. By
> sending electronic message or information to S Global Inc. e-mail
> addresses you, as the sender, are consenting to S Global Inc. processing
> any of your personal data therein.
>


Re: Use Spark extension points to implement row-level security

2018-08-17 Thread Maximiliano Patricio Méndez
Hi,

I've added table level security using spark extensions based on the ongoing
work proposed for ranger in RANGER-2128. Following the same logic, you
could mask columns and work on the logical plan, but not filtering or
skipping rows, as those are not present in these hooks.

The only difficult I found was integrating extensions with pyspark, since
in python the SparkContext is always created through the constructor and
not using the scala getOrCreate() method (I've sent an email regarding
this). But other than that, it works.


On Fri, Aug 17, 2018, 03:56 Richard Siebeling  wrote:

> Hi,
>
> I'd like to implement some kind of row-level security and am thinking of
> adding additional filters to the logical plan possibly using the Spark
> extensions.
> Would this be feasible, for example using the injectResolutionRule?
>
> thanks in advance,
> Richard
>


java.nio.file.FileSystemException: /tmp/spark- .._cache : No space left on device

2018-08-17 Thread Polisetti, Venkata Siva Rama Gopala Krishna
Hi
Am getting below exception when I Run Spark-submit in linux machine , can 
someone give quick solution with commands
Driver stacktrace:
- Job 0 failed: count at DailyGainersAndLosersPublisher.scala:145, took 
5.749450 s
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in 
stage 0.0 failed 4 times, most recent failure: Lost task 4.3 in stage 0.0 (TID 
6, 172.29.62.145, executor 0): java.nio.file.FileSystemException: 
/tmp/spark-523d5331-3884-440c-ac0d-f46838c2029f/executor-390c9cd7-217e-42f3-97cb-fa2734405585/spark-206d92c0-f0d3-443c-97b2-39494e2c5fdd/-4230744641534510169119_cache
 -> ./PublishGainersandLosers-1.0-SNAPSHOT-shaded-Gopal.jar: No space left on 
device
at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixCopyFile.copyFile(UnixCopyFile.java:253)
at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:581)
at 
sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
at java.nio.file.Files.copy(Files.java:1274)
at 
org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:625)
at org.apache.spark.util.Utils$.copyFile(Utils.scala:596)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:473)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:696)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:688)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:688)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:308)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)




The information contained in this message is intended only for the recipient, 
and may be a confidential attorney-client communication or may otherwise be 
privileged and confidential and protected from disclosure. If the reader of 
this message is not the intended recipient, or an employee or agent responsible 
for delivering this message to the intended recipient, please be aware that any 
dissemination or copying of this communication is strictly prohibited. If you 
have received this communication in error, please immediately notify us by 
replying to the message and deleting it from your computer. S Global Inc. 
reserves the right, subject to applicable local law, to monitor, review and 
process the content of any electronic message or information sent to or from 
S Global Inc. e-mail addresses without informing the sender or recipient of 
the message. By sending electronic message or information to S Global Inc. 
e-mail addresses you, as the sender, are consenting to S Global Inc. 
processing any of your personal data therein.


Re: Pass config file through spark-submit

2018-08-17 Thread James Starks
Accidentally to get it working, though don't thoroughly understand why (So far 
as I know, it's to configure in allowing executor refers to the conf file after 
copying to executors' working dir). Basically it's a combination of parameters 
--conf, --files, and --driver-class-path, instead of any single parameter.

spark-submit --class pkg.to.MyApp --master local[*] --conf 
"spark.executor.extraClassPath=-Dconfig.file=" --files 
 --driver-class-path ""

--conf requires to pass the conf file name e.g. myfile.conf along with spark 
executor class path as directive.

--files passes the conf file associated from the context root e.g. executing 
under dir , under which it contains folders such as conf, logs, 
work and so on. The conf file i.e. myfile.conf is located under conf folder.

--driver-class-path points to the conf directory with absolute path.


‐‐‐ Original Message ‐‐‐
On August 17, 2018 3:00 AM, yujhe.li  wrote:

> So can you read the file on executor side?
> I think the file passed by --files my.app.conf would be added under
> classpath, and you can use it directly.
>
>
> 
>
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> 
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Unable to see completed application in Spark 2 history web UI

2018-08-17 Thread Fawze Abujaber
Thanks Manu for your response.

I already checked the logs and didn't see anything that can help me
understanding the issue.

The more weird thing, i have a small CI cluster which run on single
NameNode and i see the Spark2 job in the UI, i'm still not sure if it may
related to the NameNode HA, i tried to replace the logdir from NameNode HA
to the activeNameNode like this
http://server:8020/user/spark/spark2historyapplication in the spark2
default conf but the UI still showing the the path with the HA NameNode
event after a restart of Spark2.

The issue become more intersting :)

On Fri, Aug 17, 2018 at 2:01 AM Manu Zhang  wrote:

> Hi Fawze,
>
> Sorry but I'm not familiar with CM. Maybe you can look into the logs (or
> turn on DEBUG log).
>
> On Thu, Aug 16, 2018 at 3:05 PM Fawze Abujaber  wrote:
>
>> Hi Manu,
>>
>> I'm using cloudera manager with single user mode and every process is
>> running with cloudera-scm user, the cloudera-scm is a super user and this
>> is why i was confused how it worked in spark 1.6 and not in spark 2.3
>>
>>
>> On Thu, Aug 16, 2018 at 5:34 AM Manu Zhang 
>> wrote:
>>
>>> If you are able to log onto the node where UI has been launched, then
>>> try `ps -aux | grep HistoryServer` and the first column of output should be
>>> the user.
>>>
>>> On Wed, Aug 15, 2018 at 10:26 PM Fawze Abujaber 
>>> wrote:
>>>
 Thanks Manu, Do you know how i can see which user the UI is running,
 because i'm using cloudera manager and i created a user for cloudera
 manager and called it spark but this didn't solve me issue and here i'm
 trying to find out the user for the spark hisotry UI.

 On Wed, Aug 15, 2018 at 5:11 PM Manu Zhang 
 wrote:

> Hi Fawze,
>
> A) The file permission is currently hard coded to 770 (
> https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L287
> ).
> B) I think add all users (including UI) to the group like Spark will
> do.
>
>
> On Wed, Aug 15, 2018 at 6:38 PM Fawze Abujaber 
> wrote:
>
>> Hi Manu,
>>
>> Thanks for your response.
>>
>> Yes, i see but still interesting to know how i can see these
>> applications from the spark history UI.
>>
>> How i can know with which user i'm  logged in when i'm navigating the
>> spark history UI.
>>
>> The Spark process is running with cloudera-scm and the events written
>> in the spark2history folder at the HDFS written with the user name who is
>> running the application and group spark (770 permissions).
>>
>> I'm interesting to see if i can force these logs to be written with
>> 774 or 775 permission or finding another solutions that enable Rnd or
>> anyone to be able to investigate his application logs using the UI.
>>
>> for example : can i use such spark conf :
>> spark.eventLog.permissions=755
>>
>> The 2 options i see here:
>>
>> A) find a way to enforce these logs to be written with other
>> permissions.
>>
>> B) Find the user that the UI running with as creating LDAP groups and
>> user that can handle this.
>>
>> for example creating group called Spark and create the user that the
>> UI running with and add this user to the spark group.
>> not sure if this option will work as i don't know if these steps
>> authenticate against the LDAP.
>>
>

 --
 Take Care
 Fawze Abujaber

>>>
>>
>> --
>> Take Care
>> Fawze Abujaber
>>
>

-- 
Take Care
Fawze Abujaber


Use Spark extension points to implement row-level security

2018-08-17 Thread Richard Siebeling
Hi,

I'd like to implement some kind of row-level security and am thinking of
adding additional filters to the logical plan possibly using the Spark
extensions.
Would this be feasible, for example using the injectResolutionRule?

thanks in advance,
Richard