Re: find failed test

2020-03-06 Thread Wim Van Leuven
Srsly?

On Sat, 7 Mar 2020 at 03:28, Koert Kuipers  wrote:

> i just ran:
> mvn test -fae > log.txt
>
> at the end of log.txt i find it says there are failures:
> [INFO] Spark Project SQL .. FAILURE [47:55
> min]
>
> that is not very helpful. what tests failed?
>
> i could go scroll up but the file has 21,517 lines. ok let's skip that.
>
> so i figure there are test reports in sql/core/target. i was right! its
> sq/core/target/surefire-reports. but it has 276 files, so thats still a bit
> much to go through. i assume there is some nice summary that shows me the
> failed tests... maybe SparkTestSuite.txt? its 2687 lines, so again a bit
> much, but i do go through it and find nothing useful.
>
> so... how do i quickly find out which test failed exactly?
> there must be some maven trick here?
>
> thanks!
>


Re: Spark driver thread

2020-03-06 Thread Enrico Minack

James,

If you are having multithreaded code in your driver, then you should 
allocate multiple cores. In cluster mode you share the node with other 
jobs. If you allocate fewer cores than you are using in your driver, 
then that node gets over-allocated and you are stealing other 
applications' resources. Be nice and limit the parallelism of your 
driver and allocate as many spark cores (|spark.driver.cores| see 
https://spark.apache.org/docs/latest/configuration.html#application-properties).


Enrico


Am 06.03.20 um 18:36 schrieb James Yu:

Pol, thanks for your reply.

Actually I am running Spark apps in CLUSTER mode. Is what you said 
still applicable in cluster mode.  Thanks in advance for your further 
clarification.



*From:* Pol Santamaria 
*Sent:* Friday, March 6, 2020 12:59 AM
*To:* James Yu 
*Cc:* user@spark.apache.org 
*Subject:* Re: Spark driver thread
Hi james,

You can configure the Spark Driver to use more than a single thread. 
It is something that depends on the application, but the Spark driver 
can take advantage of multiple threads in many situations. For 
instance, when the driver program gathers or sends data to the workers.


So yes, if you do computation or I/O on the driver side, you should 
explore using multithreads and more than 1 vCPU.


Bests,
Pol Santamaria

On Fri, Mar 6, 2020 at 1:28 AM James Yu > wrote:


Hi,

Does a Spark driver always works as single threaded?

If yes, does it mean asking for more than one vCPU for the driver
is wasteful?


Thanks,
James





Spark not able to read from an Embedded Kafka Topic

2020-03-06 Thread Something Something
I am trying to write an integration test using Embedded Kafka but I keep
getting NullPointerException. My test case is very simple. It has following
steps:

   1. Read a JSON file & write messages to an inputTopic.
   2. Perform a 'readStream' operation.
   3. Do a 'select' on the Stream. This throws a NullPointerException.

What am I doing wrong? Code is given below:


"My Test which runs with Embedded Kafka" should "Generate correct Result" in {

implicit val config: EmbeddedKafkaConfig =
  EmbeddedKafkaConfig(
kafkaPort = 9066,
zooKeeperPort = 2066,
Map("log.dir" -> "./src/test/resources/")
  )

withRunningKafka {
  createCustomTopic(inputTopic)
  val source = Source.fromFile("src/test/resources/test1.json")
  source.getLines.toList.filterNot(_.isEmpty).foreach(
line => publishStringMessageToKafka(inputTopic, line)
  )
  source.close()
  implicit val deserializer: StringDeserializer = new StringDeserializer

  createCustomTopic(outputTopic)
  import spark2.implicits._

  val schema = spark.read.json("my.json").schema
  val myStream = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9066")
.option("subscribe", inputTopic)
.load()

  // Schema looks good
  myStream.printSchema()

  // Following line throws NULLPointerException! Why?
  val df = myStream.select(from_json($"value".cast("string"),
schema).alias("value"))

  // There's more code... but let's not worry about that for now.
}

  }


find failed test

2020-03-06 Thread Koert Kuipers
i just ran:
mvn test -fae > log.txt

at the end of log.txt i find it says there are failures:
[INFO] Spark Project SQL .. FAILURE [47:55
min]

that is not very helpful. what tests failed?

i could go scroll up but the file has 21,517 lines. ok let's skip that.

so i figure there are test reports in sql/core/target. i was right! its
sq/core/target/surefire-reports. but it has 276 files, so thats still a bit
much to go through. i assume there is some nice summary that shows me the
failed tests... maybe SparkTestSuite.txt? its 2687 lines, so again a bit
much, but i do go through it and find nothing useful.

so... how do i quickly find out which test failed exactly?
there must be some maven trick here?

thanks!


Re: Spark driver thread

2020-03-06 Thread Pol Santamaria
I totally agree with Russell.

In my opinion, the best way is to experiment and take measurements. There
are different chips, some of them have multithreading, some not, also
different system setups... so I'd recommend playing with the
'spark.driver.cores' option.

Bests,
Pol Santamaria

On Fri, Mar 6, 2020 at 6:50 PM Russell Spitzer 
wrote:

> So one thing to know here is that all Java applications are going to use
> many threads, and just because your particular main method doesn't spawn
> additional threads doesn't mean library you access or use won't spawn
> additional threads. The other important note is that Spark doesn't actually
> equate "core - threads", when you request a core or something like that
> spark doesn't do anything special to actually make sure only a single
> physical core is in use.
>
> That said, would allocating more vCpu to a driver make a difference?
> Probably not. This is very dependent on your own code and whether a lot of
> work is being done on the driver vs on the executors. For example, are you
> loading up and processing some data which is used to spawn remote work? If
> so having more cpus locally may help. So look into your app, is almost all
> the work inside dataframes or RDDs? Then more resources for the driver
> won't help.
>
>
> TLDR; For most use cases 1 core is sufficient regardless of client/cluster
> mode
>
> On Fri, Mar 6, 2020 at 11:36 AM James Yu  wrote:
>
>> Pol, thanks for your reply.
>>
>> Actually I am running Spark apps in CLUSTER mode. Is what you said still
>> applicable in cluster mode.  Thanks in advance for your further
>> clarification.
>>
>> --
>> *From:* Pol Santamaria 
>> *Sent:* Friday, March 6, 2020 12:59 AM
>> *To:* James Yu 
>> *Cc:* user@spark.apache.org 
>> *Subject:* Re: Spark driver thread
>>
>> Hi james,
>>
>> You can configure the Spark Driver to use more than a single thread. It
>> is something that depends on the application, but the Spark driver can take
>> advantage of multiple threads in many situations. For instance, when the
>> driver program gathers or sends data to the workers.
>>
>> So yes, if you do computation or I/O on the driver side, you should
>> explore using multithreads and more than 1 vCPU.
>>
>> Bests,
>> Pol Santamaria
>>
>> On Fri, Mar 6, 2020 at 1:28 AM James Yu  wrote:
>>
>> Hi,
>>
>> Does a Spark driver always works as single threaded?
>>
>> If yes, does it mean asking for more than one vCPU for the driver is
>> wasteful?
>>
>>
>> Thanks,
>> James
>>
>>


Re: Spark driver thread

2020-03-06 Thread Russell Spitzer
So one thing to know here is that all Java applications are going to use
many threads, and just because your particular main method doesn't spawn
additional threads doesn't mean library you access or use won't spawn
additional threads. The other important note is that Spark doesn't actually
equate "core - threads", when you request a core or something like that
spark doesn't do anything special to actually make sure only a single
physical core is in use.

That said, would allocating more vCpu to a driver make a difference?
Probably not. This is very dependent on your own code and whether a lot of
work is being done on the driver vs on the executors. For example, are you
loading up and processing some data which is used to spawn remote work? If
so having more cpus locally may help. So look into your app, is almost all
the work inside dataframes or RDDs? Then more resources for the driver
won't help.


TLDR; For most use cases 1 core is sufficient regardless of client/cluster
mode

On Fri, Mar 6, 2020 at 11:36 AM James Yu  wrote:

> Pol, thanks for your reply.
>
> Actually I am running Spark apps in CLUSTER mode. Is what you said still
> applicable in cluster mode.  Thanks in advance for your further
> clarification.
>
> --
> *From:* Pol Santamaria 
> *Sent:* Friday, March 6, 2020 12:59 AM
> *To:* James Yu 
> *Cc:* user@spark.apache.org 
> *Subject:* Re: Spark driver thread
>
> Hi james,
>
> You can configure the Spark Driver to use more than a single thread. It is
> something that depends on the application, but the Spark driver can take
> advantage of multiple threads in many situations. For instance, when the
> driver program gathers or sends data to the workers.
>
> So yes, if you do computation or I/O on the driver side, you should
> explore using multithreads and more than 1 vCPU.
>
> Bests,
> Pol Santamaria
>
> On Fri, Mar 6, 2020 at 1:28 AM James Yu  wrote:
>
> Hi,
>
> Does a Spark driver always works as single threaded?
>
> If yes, does it mean asking for more than one vCPU for the driver is
> wasteful?
>
>
> Thanks,
> James
>
>


Re: Spark driver thread

2020-03-06 Thread James Yu
Pol, thanks for your reply.

Actually I am running Spark apps in CLUSTER mode. Is what you said still 
applicable in cluster mode.  Thanks in advance for your further clarification.


From: Pol Santamaria 
Sent: Friday, March 6, 2020 12:59 AM
To: James Yu 
Cc: user@spark.apache.org 
Subject: Re: Spark driver thread

Hi james,

You can configure the Spark Driver to use more than a single thread. It is 
something that depends on the application, but the Spark driver can take 
advantage of multiple threads in many situations. For instance, when the driver 
program gathers or sends data to the workers.

So yes, if you do computation or I/O on the driver side, you should explore 
using multithreads and more than 1 vCPU.

Bests,
Pol Santamaria

On Fri, Mar 6, 2020 at 1:28 AM James Yu mailto:ja...@ispot.tv>> 
wrote:
Hi,

Does a Spark driver always works as single threaded?

If yes, does it mean asking for more than one vCPU for the driver is wasteful?


Thanks,
James


unsubscribe

2020-03-06 Thread Sriraman Velayudhan



Re: Spark driver thread

2020-03-06 Thread Pol Santamaria
Hi james,

You can configure the Spark Driver to use more than a single thread. It is
something that depends on the application, but the Spark driver can take
advantage of multiple threads in many situations. For instance, when the
driver program gathers or sends data to the workers.

So yes, if you do computation or I/O on the driver side, you should explore
using multithreads and more than 1 vCPU.

Bests,
Pol Santamaria

On Fri, Mar 6, 2020 at 1:28 AM James Yu  wrote:

> Hi,
>
> Does a Spark driver always works as single threaded?
>
> If yes, does it mean asking for more than one vCPU for the driver is
> wasteful?
>
>
> Thanks,
> James
>