How to start two Workers connected to two different masters

2019-02-27 Thread onmstester onmstester
I have two java applications sharing the same spark cluster, the applications 
should be running on different servers. 

Based on my experience, if spark driver (inside java application) connects 
remotely to spark master (which is running on different node), then the 
response time to submit a job would be increased significantly in compare with 
the scenario that app&driver&master are running on the same node. 



So i need to have two spark master beside my applications.



 On the other hand i don't want to waste resources by dividing the spark 
cluster to two separate groups (two clusters that phisicaly separated ), 
therefore i want to start two slaves on each node of cluster, each (slave) 
connecting to one of masters and each could access all of the cores. The 
problem is that spark won't let me. First it said that you already have a 
worker, "stop it first". I set WORKER_INSTANCES to 2, but it creates two slave 
to one of the masters on first try!





Sent using https://www.zoho.com/mail/

Faster Spark ML training using accelerators

2019-02-27 Thread inaccel
If you want to run much faster your Spark ML applications (up to 14x faster),
you can now use the Accelerated ML suite from  inaccel
  on aws. The Accelerated ML suite allows to
offload the most computational intensive part of the ML tasks to FPGA
hardware accelerators without having to change your code at all (it supports
Scala, Java and python). Instead of running "spark-submit" you only have to
use "spark-submit --inaccel". The accelerated ML suite is packed as a docker
for seamless integration with Spark. You can test it on AWS using the f1
instances:  https://aws.amazon.com/marketplace/pp/B07KXQLTJ3and you can
achieve up to 14x faster training of Spark ML applications and up to 4x
lower cost. 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: to_avro and from_avro not working with struct type in spark 2.4

2019-02-27 Thread Gabor Somogyi
Hi,

I was dealing with avro stuff lately and most of the time it has something
to do with the schema.
One thing I've pinpointed quickly (where I was struggling also) is the name
field should be nullable but the result is not yet correct so further
digging needed...

scala> val expectedSchema = StructType(Seq(StructField("name",
StringType,true),StructField("age", IntegerType, false)))
expectedSchema: org.apache.spark.sql.types.StructType =
StructType(StructField(name,StringType,true),
StructField(age,IntegerType,false))

scala> val avroTypeStruct =
SchemaConverters.toAvroType(expectedSchema).toString
avroTypeStruct: String =
{"type":"record","name":"topLevelRecord","fields":[{"name":"name","type":["string","null"]},{"name":"age","type":"int"}]}

scala> dfKV.select(from_avro('value, avroTypeStruct)).show
+-+
|from_avro(value, struct)|
+-+
|  [Mary Jane, 25]|
|  [Mary Jane, 25]|
+-+

BR,
G


On Wed, Feb 27, 2019 at 7:43 AM Hien Luu  wrote:

> Hi,
>
> I ran into a pretty weird issue with to_avro and from_avro where it was not
> able to parse the data in a struct correctly.  Please see the simple and
> self contained example below. I am using Spark 2.4.  I am not sure if I
> missed something.
>
> This is how I start the spark-shell on my Mac:
>
> ./bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0
>
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.avro._
> import org.apache.spark.sql.functions._
>
>
> spark.version
>
> val df = Seq((1, "John Doe",  30), (2, "Mary Jane", 25)).toDF("id", "name",
> "age")
>
> val dfStruct = df.withColumn("value", struct("name","age"))
>
> dfStruct.show
> dfStruct.printSchema
>
> val dfKV = dfStruct.select(to_avro('id).as("key"),
> to_avro('value).as("value"))
>
> val expectedSchema = StructType(Seq(StructField("name", StringType,
> false),StructField("age", IntegerType, false)))
>
> val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString
>
> val avroTypeStr = s"""
>   |{
>   |  "type": "int",
>   |  "name": "key"
>   |}
> """.stripMargin
>
>
> dfKV.select(from_avro('key, avroTypeStr)).show
>
> // output
> +---+
> |from_avro(key, int)|
> +---+
> |  1|
> |  2|
> +---+
>
> dfKV.select(from_avro('value, avroTypeStruct)).show
>
> // output
> +-+
> |from_avro(value, struct)|
> +-+
> |[, 9]|
> |[, 9]|
> +-+
>
> Please help and thanks in advance.
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark Streaming - Proeblem to manage offset Kafka and starts from the beginning.

2019-02-27 Thread Gabor Somogyi
Hi Akshay,

The feature what you've mentioned has a default value of 7 days...

BR,
G


On Wed, Feb 27, 2019 at 7:38 AM Akshay Bhardwaj <
akshay.bhardwaj1...@gmail.com> wrote:

> Hi Guillermo,
>
> What was the interval in between restarting the spark job? As a feature in
> Kafka, a broker deleted offsets for a consumer group after inactivity of 24
> hours.
> In such a case, the newly started spark streaming job will read offsets
> from beginning for the same groupId.
>
> Akshay Bhardwaj
> +91-97111-33849
>
>
> On Thu, Feb 21, 2019 at 9:08 PM Gabor Somogyi 
> wrote:
>
>> From the info you've provided not much to say.
>> Maybe you could collect sample app, logs etc, open a jira and we can take
>> a deeper look at it...
>>
>> BR,
>> G
>>
>>
>> On Thu, Feb 21, 2019 at 4:14 PM Guillermo Ortiz 
>> wrote:
>>
>>> I' working with Spark Streaming 2.0.2 and Kafka 1.0.0 using Direct
>>> Stream as connector. I consume data from Kafka and autosave the offsets.
>>> I can see Spark doing commits in the logs of the last offsets processed,
>>> Sometimes I have restarted spark and it starts from the beginning, when I'm
>>> using the same groupId.
>>>
>>> Why could it happen? it only happen rarely.
>>>
>>


Re: Spark Streaming - Proeblem to manage offset Kafka and starts from the beginning.

2019-02-27 Thread Akshay Bhardwaj
Hi Gabor,

I am talking about offset.retention.minutes which is set default as 1440
(or 24 hours)

Akshay Bhardwaj
+91-97111-33849


On Wed, Feb 27, 2019 at 4:47 PM Gabor Somogyi 
wrote:

> Hi Akshay,
>
> The feature what you've mentioned has a default value of 7 days...
>
> BR,
> G
>
>
> On Wed, Feb 27, 2019 at 7:38 AM Akshay Bhardwaj <
> akshay.bhardwaj1...@gmail.com> wrote:
>
>> Hi Guillermo,
>>
>> What was the interval in between restarting the spark job? As a feature
>> in Kafka, a broker deleted offsets for a consumer group after inactivity of
>> 24 hours.
>> In such a case, the newly started spark streaming job will read offsets
>> from beginning for the same groupId.
>>
>> Akshay Bhardwaj
>> +91-97111-33849
>>
>>
>> On Thu, Feb 21, 2019 at 9:08 PM Gabor Somogyi 
>> wrote:
>>
>>> From the info you've provided not much to say.
>>> Maybe you could collect sample app, logs etc, open a jira and we can
>>> take a deeper look at it...
>>>
>>> BR,
>>> G
>>>
>>>
>>> On Thu, Feb 21, 2019 at 4:14 PM Guillermo Ortiz 
>>> wrote:
>>>
 I' working with Spark Streaming 2.0.2 and Kafka 1.0.0 using Direct
 Stream as connector. I consume data from Kafka and autosave the offsets.
 I can see Spark doing commits in the logs of the last offsets
 processed, Sometimes I have restarted spark and it starts from the
 beginning, when I'm using the same groupId.

 Why could it happen? it only happen rarely.

>>>


Spark on k8s - map persistentStorage for data spilling

2019-02-27 Thread Tomasz Krol
Hey Guys,

I hope someone will be able to help me, as I've stuck with this for a
while:) Basically I am running some jobs on kubernetes as per documentation

https://spark.apache.org/docs/latest/running-on-kubernetes.html

All works fine, however if I run queries on bigger data volume, then jobs
failing that there is not enough space in /var/data/spark-1xxx directory.

Obviously the reason for this is that emptyDir mounted doesnt have enough
space.

I also mounted pvc to the driver and executors pods which I can see during
the runtime. I am wondering if someone knows how to set that data will be
spilled to different directory (i.e my persistent storage directory)
instead of empyDir with some limitted space. Or if I can mount the empyDir
somehow on my pvc. Basically at the moment I cant run any jobs as they are
failing due to insufficient space in that /var/data directory.

Thanks
-- 
Tomasz Krol
patric...@gmail.com


Re: Spark Streaming - Proeblem to manage offset Kafka and starts from the beginning.

2019-02-27 Thread Guillermo Ortiz
I'm going to check the value, but I didn't change it., normally, the
process is always running but sometimes I have to restarted to apply some
changes. Sometimes it starts from the beginning and others continue for the
last offset.

El mié., 27 feb. 2019 a las 12:25, Akshay Bhardwaj (<
akshay.bhardwaj1...@gmail.com>) escribió:

> Hi Gabor,
>
> I am talking about offset.retention.minutes which is set default as 1440
> (or 24 hours)
>
> Akshay Bhardwaj
> +91-97111-33849
>
>
> On Wed, Feb 27, 2019 at 4:47 PM Gabor Somogyi 
> wrote:
>
>> Hi Akshay,
>>
>> The feature what you've mentioned has a default value of 7 days...
>>
>> BR,
>> G
>>
>>
>> On Wed, Feb 27, 2019 at 7:38 AM Akshay Bhardwaj <
>> akshay.bhardwaj1...@gmail.com> wrote:
>>
>>> Hi Guillermo,
>>>
>>> What was the interval in between restarting the spark job? As a feature
>>> in Kafka, a broker deleted offsets for a consumer group after inactivity of
>>> 24 hours.
>>> In such a case, the newly started spark streaming job will read offsets
>>> from beginning for the same groupId.
>>>
>>> Akshay Bhardwaj
>>> +91-97111-33849
>>>
>>>
>>> On Thu, Feb 21, 2019 at 9:08 PM Gabor Somogyi 
>>> wrote:
>>>
 From the info you've provided not much to say.
 Maybe you could collect sample app, logs etc, open a jira and we can
 take a deeper look at it...

 BR,
 G


 On Thu, Feb 21, 2019 at 4:14 PM Guillermo Ortiz 
 wrote:

> I' working with Spark Streaming 2.0.2 and Kafka 1.0.0 using Direct
> Stream as connector. I consume data from Kafka and autosave the offsets.
> I can see Spark doing commits in the logs of the last offsets
> processed, Sometimes I have restarted spark and it starts from the
> beginning, when I'm using the same groupId.
>
> Why could it happen? it only happen rarely.
>



Re: Spark Streaming - Proeblem to manage offset Kafka and starts from the beginning.

2019-02-27 Thread Akshay Bhardwaj
Hi Gabor,

I guess you are looking at Kafka 2.1 but Guillermo mentioned initially that
they are working with Kafka 1.0

Akshay Bhardwaj
+91-97111-33849


On Wed, Feb 27, 2019 at 5:41 PM Gabor Somogyi 
wrote:

> Where exactly? In Kafka broker configuration section here it's 10080:
> https://kafka.apache.org/documentation/
>
> offsets.retention.minutes After a consumer group loses all its consumers
> (i.e. becomes empty) its offsets will be kept for this retention period
> before getting discarded. For standalone consumers (using manual
> assignment), offsets will be expired after the time of last commit plus
> this retention period. int 10080 [1,...] high read-only
>
> On Wed, Feb 27, 2019 at 1:04 PM Guillermo Ortiz 
> wrote:
>
>> I'm going to check the value, but I didn't change it., normally, the
>> process is always running but sometimes I have to restarted to apply some
>> changes. Sometimes it starts from the beginning and others continue for the
>> last offset.
>>
>> El mié., 27 feb. 2019 a las 12:25, Akshay Bhardwaj (<
>> akshay.bhardwaj1...@gmail.com>) escribió:
>>
>>> Hi Gabor,
>>>
>>> I am talking about offset.retention.minutes which is set default as
>>> 1440 (or 24 hours)
>>>
>>> Akshay Bhardwaj
>>> +91-97111-33849
>>>
>>>
>>> On Wed, Feb 27, 2019 at 4:47 PM Gabor Somogyi 
>>> wrote:
>>>
 Hi Akshay,

 The feature what you've mentioned has a default value of 7 days...

 BR,
 G


 On Wed, Feb 27, 2019 at 7:38 AM Akshay Bhardwaj <
 akshay.bhardwaj1...@gmail.com> wrote:

> Hi Guillermo,
>
> What was the interval in between restarting the spark job? As a
> feature in Kafka, a broker deleted offsets for a consumer group after
> inactivity of 24 hours.
> In such a case, the newly started spark streaming job will read
> offsets from beginning for the same groupId.
>
> Akshay Bhardwaj
> +91-97111-33849
>
>
> On Thu, Feb 21, 2019 at 9:08 PM Gabor Somogyi <
> gabor.g.somo...@gmail.com> wrote:
>
>> From the info you've provided not much to say.
>> Maybe you could collect sample app, logs etc, open a jira and we can
>> take a deeper look at it...
>>
>> BR,
>> G
>>
>>
>> On Thu, Feb 21, 2019 at 4:14 PM Guillermo Ortiz 
>> wrote:
>>
>>> I' working with Spark Streaming 2.0.2 and Kafka 1.0.0 using Direct
>>> Stream as connector. I consume data from Kafka and autosave the offsets.
>>> I can see Spark doing commits in the logs of the last offsets
>>> processed, Sometimes I have restarted spark and it starts from the
>>> beginning, when I'm using the same groupId.
>>>
>>> Why could it happen? it only happen rarely.
>>>
>>


Re: Spark Streaming - Proeblem to manage offset Kafka and starts from the beginning.

2019-02-27 Thread Gabor Somogyi
Where exactly? In Kafka broker configuration section here it's 10080:
https://kafka.apache.org/documentation/

offsets.retention.minutes After a consumer group loses all its consumers
(i.e. becomes empty) its offsets will be kept for this retention period
before getting discarded. For standalone consumers (using manual
assignment), offsets will be expired after the time of last commit plus
this retention period. int 10080 [1,...] high read-only

On Wed, Feb 27, 2019 at 1:04 PM Guillermo Ortiz 
wrote:

> I'm going to check the value, but I didn't change it., normally, the
> process is always running but sometimes I have to restarted to apply some
> changes. Sometimes it starts from the beginning and others continue for the
> last offset.
>
> El mié., 27 feb. 2019 a las 12:25, Akshay Bhardwaj (<
> akshay.bhardwaj1...@gmail.com>) escribió:
>
>> Hi Gabor,
>>
>> I am talking about offset.retention.minutes which is set default as 1440
>> (or 24 hours)
>>
>> Akshay Bhardwaj
>> +91-97111-33849
>>
>>
>> On Wed, Feb 27, 2019 at 4:47 PM Gabor Somogyi 
>> wrote:
>>
>>> Hi Akshay,
>>>
>>> The feature what you've mentioned has a default value of 7 days...
>>>
>>> BR,
>>> G
>>>
>>>
>>> On Wed, Feb 27, 2019 at 7:38 AM Akshay Bhardwaj <
>>> akshay.bhardwaj1...@gmail.com> wrote:
>>>
 Hi Guillermo,

 What was the interval in between restarting the spark job? As a feature
 in Kafka, a broker deleted offsets for a consumer group after inactivity of
 24 hours.
 In such a case, the newly started spark streaming job will read offsets
 from beginning for the same groupId.

 Akshay Bhardwaj
 +91-97111-33849


 On Thu, Feb 21, 2019 at 9:08 PM Gabor Somogyi <
 gabor.g.somo...@gmail.com> wrote:

> From the info you've provided not much to say.
> Maybe you could collect sample app, logs etc, open a jira and we can
> take a deeper look at it...
>
> BR,
> G
>
>
> On Thu, Feb 21, 2019 at 4:14 PM Guillermo Ortiz 
> wrote:
>
>> I' working with Spark Streaming 2.0.2 and Kafka 1.0.0 using Direct
>> Stream as connector. I consume data from Kafka and autosave the offsets.
>> I can see Spark doing commits in the logs of the last offsets
>> processed, Sometimes I have restarted spark and it starts from the
>> beginning, when I'm using the same groupId.
>>
>> Why could it happen? it only happen rarely.
>>
>


Re: Spark Streaming - Proeblem to manage offset Kafka and starts from the beginning.

2019-02-27 Thread Gabor Somogyi
Mixed up with Spark version. Seems like the issue is different based on
Guillermo last mail.

On Wed, Feb 27, 2019 at 1:16 PM Akshay Bhardwaj <
akshay.bhardwaj1...@gmail.com> wrote:

> Hi Gabor,
>
> I guess you are looking at Kafka 2.1 but Guillermo mentioned initially
> that they are working with Kafka 1.0
>
> Akshay Bhardwaj
> +91-97111-33849
>
>
> On Wed, Feb 27, 2019 at 5:41 PM Gabor Somogyi 
> wrote:
>
>> Where exactly? In Kafka broker configuration section here it's 10080:
>> https://kafka.apache.org/documentation/
>>
>> offsets.retention.minutes After a consumer group loses all its consumers
>> (i.e. becomes empty) its offsets will be kept for this retention period
>> before getting discarded. For standalone consumers (using manual
>> assignment), offsets will be expired after the time of last commit plus
>> this retention period. int 10080 [1,...] high read-only
>>
>> On Wed, Feb 27, 2019 at 1:04 PM Guillermo Ortiz 
>> wrote:
>>
>>> I'm going to check the value, but I didn't change it., normally, the
>>> process is always running but sometimes I have to restarted to apply some
>>> changes. Sometimes it starts from the beginning and others continue for the
>>> last offset.
>>>
>>> El mié., 27 feb. 2019 a las 12:25, Akshay Bhardwaj (<
>>> akshay.bhardwaj1...@gmail.com>) escribió:
>>>
 Hi Gabor,

 I am talking about offset.retention.minutes which is set default as
 1440 (or 24 hours)

 Akshay Bhardwaj
 +91-97111-33849


 On Wed, Feb 27, 2019 at 4:47 PM Gabor Somogyi <
 gabor.g.somo...@gmail.com> wrote:

> Hi Akshay,
>
> The feature what you've mentioned has a default value of 7 days...
>
> BR,
> G
>
>
> On Wed, Feb 27, 2019 at 7:38 AM Akshay Bhardwaj <
> akshay.bhardwaj1...@gmail.com> wrote:
>
>> Hi Guillermo,
>>
>> What was the interval in between restarting the spark job? As a
>> feature in Kafka, a broker deleted offsets for a consumer group after
>> inactivity of 24 hours.
>> In such a case, the newly started spark streaming job will read
>> offsets from beginning for the same groupId.
>>
>> Akshay Bhardwaj
>> +91-97111-33849
>>
>>
>> On Thu, Feb 21, 2019 at 9:08 PM Gabor Somogyi <
>> gabor.g.somo...@gmail.com> wrote:
>>
>>> From the info you've provided not much to say.
>>> Maybe you could collect sample app, logs etc, open a jira and we can
>>> take a deeper look at it...
>>>
>>> BR,
>>> G
>>>
>>>
>>> On Thu, Feb 21, 2019 at 4:14 PM Guillermo Ortiz <
>>> konstt2...@gmail.com> wrote:
>>>
 I' working with Spark Streaming 2.0.2 and Kafka 1.0.0 using Direct
 Stream as connector. I consume data from Kafka and autosave the 
 offsets.
 I can see Spark doing commits in the logs of the last offsets
 processed, Sometimes I have restarted spark and it starts from the
 beginning, when I'm using the same groupId.

 Why could it happen? it only happen rarely.

>>>


Spark 2.4.0 Master going down

2019-02-27 Thread lokeshkumar
Hi All

We are running Spark version 2.4.0 and we run few Spark streaming jobs
listening on Kafka topics. We receive an average of 10-20 msgs per second. 
And the Spark master has been going down after 1-2 hours of it running.
Exception is given below:
Along with that spark executors also get killed.

This was not happening with Spark 2.1.1 it started happening with Spark
2.4.0 any help/suggestion is appreciated.

The exception that we see is 

Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any
reply from 192.168.43.167:40007 in 120 seconds. This timeout is controlled
by spark.rpc.askTimeout
at
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
at scala.util.Try$.apply(Try.scala:192)
at scala.util.Failure.recover(Try.scala:216)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at
org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
at
scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
at
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at scala.concurrent.Promise$class.complete(Promise.scala:55)
at 
scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:157)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at
scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
at
scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
at
scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at
scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at 
scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at 
scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
at
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
at
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
at
scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:157)
at
org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:206)
at
org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:243)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply
from 192.168.43.167:40007 in 120 seconds



--
Sent from: http://apache-spark-user-list.100156

Hadoop free spark on kubernetes => NoClassDefFound

2019-02-27 Thread Sommer Tobias
Hi,

we are having problems with using a custom hadoop lib in a spark image

when running it on a kubernetes cluster while following the steps of the 
documentation.

Details in the description below.



Does anyone else had similar problems? Is there something missing in the setup 
below?

Or is this a bug?







Hadoop free spark on kubernetes





Using custom hadoop libraries in spark image

does not work with following the steps of the documentation (*)

for running spark pi on kubernetes cluster.





*Usage of hadoop free build:

https://spark.apache.org/docs/2.4.0/hadoop-provided.html





Steps:
1.   Download hadoop free spark  
spark-2.4.0-bin-without-hadoop.tgz
2.   Build spark image without hadoop from this with docker-image-tool.sh
3.   Create Dockerfile to add an image layer to the spark image without 
hadoop that adds a custom hadoop

(see: Dockerfile and conf/spark-enf.sh below)
4.   Use custom hadoop spark image to run spark examples

(see: k8s submit below)
5.   Produces JNI Error (see message below), expected instead is 
computation of pi.









Dockerfile

# some spark base image built via:

#  $SPARK_HOME/bin/docker-image-tool.sh  -r  -t sometag build

#  $SPARK_HOME/bin/docker-image-tool.sh  -r  -t sometag push

#

# docker build this to: >> docker build -t reg../...:1.0.0 .

#

# use spark 2.4.0 without hadoop as base image

#

FROM registry/spark-without-hadoop:2.4.0



ENV SPARK_HOME /opt/spark



# setup custom hadoop

COPY libs/hadoop-2.9.2 /opt/hadoop

ENV HADOOP_HOME /opt/hadoop



COPY conf/spark-env.sh ${SPARK_HOME}/conf/spark-env.sh



WORKDIR /opt/spark/work-dir







conf/spark-enf.sh

#!/usr/bin/env bash



# echo commands to the terminal output

set -ex



# With explicit path to 'hadoop' binary

export SPARK_DIST_CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath)



echo $SPARK_DIST_CLASSPATH





submit command to kubernetes

$SPARK_HOME/bin/spark-submit \

--master k8s://... \

--name sparkpiexample-custom-hadoop-original \

--deploy-mode cluster \

--class org.apache.spark.examples.SparkPi \

--conf spark.executor.instances=2 \

--conf spark.executor.extraJavaOptions="-XX:+UseG1GC -XX:-ResizePLAB" \

--conf spark.kubernetes.memoryOverheadFactor=0.2 \

--conf 
spark.kubernetes.container.image=registry/spark-custom-hadoop-original:1.0.0 \

--conf spark.kubernetes.container.image.pullSecrets=... \

--conf spark.kubernetes.container.image.pullPolicy=Always \

--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \

local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar





Retrieved error message:

++ id -u

+ myuid=0

++ id -g

+ mygid=0

+ set +e

++ getent passwd 0

+ uidentry=root:x:0:0:root:/root:/bin/ash

+ set -e

+ '[' -z root:x:0:0:root:/root:/bin/ash ']'

+ SPARK_K8S_CMD=driver

+ case "$SPARK_K8S_CMD" in

+ shift 1

+ SPARK_CLASSPATH=':/opt/spark/jars/*'

+ env

+ grep SPARK_JAVA_OPT_

+ sort -t_ -k4 -n

+ sed 's/[^=]*=\(.*\)/\1/g'

+ readarray -t SPARK_EXECUTOR_JAVA_OPTS

+ '[' -n '' ']'

+ '[' -n '' ']'

+ PYSPARK_ARGS=

+ '[' -n '' ']'

+ R_ARGS=

+ '[' -n '' ']'

+ '[' '' == 2 ']'

+ '[' '' == 3 ']'

+ case "$SPARK_K8S_CMD" in

+ CMD=("$SPARK_HOME/bin/spark-submit" --conf 
"spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")

+ exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=100.96.6.123 --deploy-mode client --properties-file 
/opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi 
spark-internal

Error: A JNI error has occurred, please check your installation and try again

Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger

at java.lang.Class.getDeclaredMethods0(Native Method)

at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)

at java.lang.Class.privateGetMethodRecursive(Class.java:3048)

at java.lang.Class.getMethod0(Class.java:3018)

at java.lang.Class.getMethod(Class.java:1784)

at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)

at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)

Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger

at java.net.URLClassLoader.findClass(URLClassLoader.java:381)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

... 7 more




Regards,

Tobias Sommer
M.Sc. (Uni)
Team eso-IN-Swarm
Software Engineer

[Beschreibung: Description: Description: Description: Description: Description: 
Description: e-solutions-logo-text-142]

e.solutions GmbH
Despag-Str. 4a, 85055 Ingolstadt, Germany

Phone +49-8458-3332-1219
Fax +49-8458-3332-2219
tobias.som...@esolutions.de

Registered Office:
Despag-Str. 4a, 85055 Ingolstadt, Germany

e.solutions GmbH
Managing Directors Uwe R

Re: to_avro and from_avro not working with struct type in spark 2.4

2019-02-27 Thread Hien Luu
Thanks for looking into this.  Does this mean string fields should alway be
nullable?

You are right that the result is not yet correct and further digging is
needed :(

On Wed, Feb 27, 2019 at 1:19 AM Gabor Somogyi 
wrote:

> Hi,
>
> I was dealing with avro stuff lately and most of the time it has something
> to do with the schema.
> One thing I've pinpointed quickly (where I was struggling also) is the
> name field should be nullable but the result is not yet correct so further
> digging needed...
>
> scala> val expectedSchema = StructType(Seq(StructField("name",
> StringType,true),StructField("age", IntegerType, false)))
> expectedSchema: org.apache.spark.sql.types.StructType =
> StructType(StructField(name,StringType,true),
> StructField(age,IntegerType,false))
>
> scala> val avroTypeStruct =
> SchemaConverters.toAvroType(expectedSchema).toString
> avroTypeStruct: String =
> {"type":"record","name":"topLevelRecord","fields":[{"name":"name","type":["string","null"]},{"name":"age","type":"int"}]}
>
> scala> dfKV.select(from_avro('value, avroTypeStruct)).show
> +-+
> |from_avro(value, struct)|
> +-+
> |  [Mary Jane, 25]|
> |  [Mary Jane, 25]|
> +-+
>
> BR,
> G
>
>
> On Wed, Feb 27, 2019 at 7:43 AM Hien Luu  wrote:
>
>> Hi,
>>
>> I ran into a pretty weird issue with to_avro and from_avro where it was
>> not
>> able to parse the data in a struct correctly.  Please see the simple and
>> self contained example below. I am using Spark 2.4.  I am not sure if I
>> missed something.
>>
>> This is how I start the spark-shell on my Mac:
>>
>> ./bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0
>>
>> import org.apache.spark.sql.types._
>> import org.apache.spark.sql.avro._
>> import org.apache.spark.sql.functions._
>>
>>
>> spark.version
>>
>> val df = Seq((1, "John Doe",  30), (2, "Mary Jane", 25)).toDF("id",
>> "name",
>> "age")
>>
>> val dfStruct = df.withColumn("value", struct("name","age"))
>>
>> dfStruct.show
>> dfStruct.printSchema
>>
>> val dfKV = dfStruct.select(to_avro('id).as("key"),
>> to_avro('value).as("value"))
>>
>> val expectedSchema = StructType(Seq(StructField("name", StringType,
>> false),StructField("age", IntegerType, false)))
>>
>> val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString
>>
>> val avroTypeStr = s"""
>>   |{
>>   |  "type": "int",
>>   |  "name": "key"
>>   |}
>> """.stripMargin
>>
>>
>> dfKV.select(from_avro('key, avroTypeStr)).show
>>
>> // output
>> +---+
>> |from_avro(key, int)|
>> +---+
>> |  1|
>> |  2|
>> +---+
>>
>> dfKV.select(from_avro('value, avroTypeStruct)).show
>>
>> // output
>> +-+
>> |from_avro(value, struct)|
>> +-+
>> |[, 9]|
>> |[, 9]|
>> +-+
>>
>> Please help and thanks in advance.
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>

-- 
Regards,


Issue with file names writeStream in Structured Streaming

2019-02-27 Thread SRK


Hi,

We are using something like the following to write data to files in
Structured Streaming and we seem to get file names as part* as mentioned in
https://stackoverflow.com/questions/51056764/how-to-define-a-spark-structured-streaming-file-sink-file-path-or-file-name.
 

How to get file names of our choice for each row in the dataframe? Like say
/day/month/id/log.txt?


df.writeStream 
  .format("parquet") // can be "orc", "json", "csv", etc.
  .option("path", "/path/to/save/") 
  .partitionBy("year", "month", "day", "hour") 
  .start()

Thanks for the help!!!



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



dummy coding in sparklyr

2019-02-27 Thread ya
Dear list,

I am trying to run some regression models with big data set using sparklyr. 
Some of the explanatory variables (Xs) in my model are categorical variables, 
they have to be converted into dummy codes before the analysis. I understand 
that in spark columns need to be treated as string type and ft_one_hot_encoder 
to the dummy code, there are some discussions online, however, I could not 
figure out how to properly write the code, could you give me some suggestions 
please? Thank you very much.

The code looks as below:

> sc_mtcars%>%ft_string_indexer("gear","gear1")%>%ft_one_hot_encoder("gear1","gear2")%>%ml_linear_regression(hp~gear1+wt)
>  
Formula: hp ~ gear1 + wt

Coefficients:
(Intercept)   gear1  wt 
  -78.3828536.4141662.17596 

As you can see, it seems "ft_one_hot_encoder("gear1","gear2”)” didn’t work, 
otherwise there should be two coefficients for gear2. Any idea what when wrong?

One more thing, there are some earlier posts online showing regression results 
with significance test info (standard errors and p values), is there any way to 
extract these info with the latest release of sparklyr? standard error, maybe?

Thank you very much.

Best regards,

YA. 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Issue with file names writeStream in Structured Streaming

2019-02-27 Thread Gourav Sengupta
Should that not cause more problems?

Regards,
Gourav Sengupta

On Wed, Feb 27, 2019 at 7:36 PM SRK  wrote:

>
> Hi,
>
> We are using something like the following to write data to files in
> Structured Streaming and we seem to get file names as part* as mentioned in
>
> https://stackoverflow.com/questions/51056764/how-to-define-a-spark-structured-streaming-file-sink-file-path-or-file-name.
>
>
> How to get file names of our choice for each row in the dataframe? Like say
> /day/month/id/log.txt?
>
>
> df.writeStream
>   .format("parquet") // can be "orc", "json", "csv", etc.
>   .option("path", "/path/to/save/")
>   .partitionBy("year", "month", "day", "hour")
>   .start()
>
> Thanks for the help!!!
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>