Re: Spark with Hadoop 3.3 distribution release in Download page

2021-01-19 Thread Chao Sun
Hi Gabriel,

The distribution won’t be available for download until Spark supports
Hadoop 3.3.x. At the moment, Spark cannot use Hadoop 3.3.0 because of
various issues. Our best bet is to wait until Hadoop 3.3.1 comes out.

Best,
Chao

On Tue, Jan 19, 2021 at 8:00 PM Gabriel Magno 
wrote:

> Correction: spark-3.0.1-bin-hadoop3.3.tgz
>
> --
> Gabriel Magno
>
>
> Em qua., 20 de jan. de 2021 às 00:34, Gabriel Magno <
> gabrielmag...@gmail.com> escreveu:
>
>> Are there any plans to provide a distribution release of Spark with
>> Hadoop3.3 (spark-3.0.1-bin-hadoop3.2.tgz) directly in the Spark Download
>> page (https://spark.apache.org/downloads.html)?
>>
>> --
>> Gabriel Magno
>>
>


Re: Spark with Hadoop 3.3 distribution release in Download page

2021-01-19 Thread Gabriel Magno
Correction: spark-3.0.1-bin-hadoop3.3.tgz

--
Gabriel Magno


Em qua., 20 de jan. de 2021 às 00:34, Gabriel Magno 
escreveu:

> Are there any plans to provide a distribution release of Spark with
> Hadoop3.3 (spark-3.0.1-bin-hadoop3.2.tgz) directly in the Spark Download
> page (https://spark.apache.org/downloads.html)?
>
> --
> Gabriel Magno
>


Spark with Hadoop 3.3 distribution release in Download page

2021-01-19 Thread Gabriel Magno
Are there any plans to provide a distribution release of Spark with
Hadoop3.3 (spark-3.0.1-bin-hadoop3.2.tgz) directly in the Spark Download
page (https://spark.apache.org/downloads.html)?

--
Gabriel Magno


Application Timeout

2021-01-19 Thread Brett Spark
Hello!
When using Spark Standalone & Spark 2.4.4 / 3.0.0 - we are seeing our
standalone Spark "applications" timeout and show as "Finished" after around
an hour of time.

Here is a screenshot from the Spark master before it's marked as finished.
[image: image.png]
Here is a screenshot from the Spark master after it's marked as finished.
(After over an hour of idle time).
[image: image.png]
Here are the logs from the Spark Master / Worker:

spark-master-2d733568b2a7e82de7b2b09b6daa17e9-7cd4cfcddb-f84q7 master
2021-01-19 21:55:47,282 INFO master.Master: 172.32.3.66:34570 got
disassociated, removing it.
spark-master-2d733568b2a7e82de7b2b09b6daa17e9-7cd4cfcddb-f84q7 master
2021-01-19 21:55:52,095 INFO master.Master: 172.32.115.115:36556 got
disassociated, removing it.
spark-master-2d733568b2a7e82de7b2b09b6daa17e9-7cd4cfcddb-f84q7 master
2021-01-19 21:55:52,095 INFO master.Master: 172.32.115.115:37305 got
disassociated, removing it.
spark-master-2d733568b2a7e82de7b2b09b6daa17e9-7cd4cfcddb-f84q7 master
2021-01-19 21:55:52,096 INFO master.Master: Removing app
app-20210119204911-
spark-worker-2d733568b2a7e82de7b2b09b6daa17e9-7bbb75f9b6-8mv2b worker
2021-01-19 21:55:52,112 INFO shuffle.ExternalShuffleBlockResolver:
Application app-20210119204911- removed, cleanupLocalDirs = true

Is there a setting that causes an application to timeout after an hour of a
Spark application or Spark worker being idle?

I would like to keep our Spark applications alive as long as possible.

I haven't been able to find a setting in the Spark confs documentation that
corresponds to this so i'm wondering if this is something that's hard coded.

Please let me know,
Thank you!


Persisting Customized Transformer

2021-01-19 Thread Artemis User
We are trying to create a customized transformer for a ML pipeline and 
also want to persist the trained pipeline and retrieve it for 
production.  To enable persistency, we will have to implement read/write 
functions.  However, this is not feasible in Scala since the read/write 
methods are private members of the MLModel class.  This problem was 
described in a JIRA ticket https://issues.apache.org/jira/browse/SPARK-17048


Although the ticket suggested some workaround, but only in Java. I was 
wondering if anyone has tried in Scala?  The scala work-around suggested 
in the ticket didn't work for us.  Does anyone know if this issue has 
been resolved in 3.0.1 or the upcoming 3.1?  Any suggestion is highly 
appreciated.


-- ND


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: subscribe user@spark.apache.org

2021-01-19 Thread Sean Owen
You have to sign up by sending an email - see
http://spark.apache.org/community.html for what to send where.

On Tue, Jan 19, 2021 at 12:25 PM Peter Podlovics <
peter.d.podlov...@gmail.com> wrote:

> Hello,
>
> I would like to subscribe to the above mailing list. I already tried
> subscribing through the webpage, but I still haven't received the email yet.
>
> Thanks,
> Peter
>


subscribe user@spark.apache.org

2021-01-19 Thread Peter Podlovics
Hello,

I would like to subscribe to the above mailing list. I already tried
subscribing through the webpage, but I still haven't received the email yet.

Thanks,
Peter


Re: Data source v2 streaming sinks does not support Update mode

2021-01-19 Thread Eric Beabes
Will do, thanks!

On Tue, Jan 19, 2021 at 1:39 PM Gabor Somogyi 
wrote:

> Thanks for double checking the version. Please report back with 3.1
> version whether it works or not.
>
> G
>
>
> On Tue, 19 Jan 2021, 07:41 Eric Beabes,  wrote:
>
>> Confirmed. The cluster Admin said his team installed the latest version
>> from Cloudera which comes with Spark 3.0.0-preview2. They are going to try
>> to upgrade it with the Community edition Spark 3.1.0.
>>
>> Thanks Jungtaek for the tip. Greatly appreciate it.
>>
>> On Tue, Jan 19, 2021 at 8:45 AM Eric Beabes 
>> wrote:
>>
>>> >> "Could you please make sure you're not using "3.0.0-preview".
>>>
>>> This could be the reason. I will check with our Hadoop cluster
>>> administrator. It's quite possible that they installed the "Preview" mode.
>>> Yes, the code works in the Local dev environment.
>>>
>>>
>>> On Tue, Jan 19, 2021 at 5:29 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 I see no issue from running this code in local dev. (changed the scope
 of Spark artifacts to "compile" of course)

 Could you please make sure you're not using "3.0.0-preview"? In
 3.0.0-preview update mode was restricted (as the error message says) and it
 was reverted before releasing Spark 3.0.0. Clearing Spark artifacts in your
 .m2 cache may work.

 On Tue, Jan 19, 2021 at 8:50 AM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> And also include some test data as well. I quickly looked through the
> code and the code may require a specific format of the record.
>
> On Tue, Jan 19, 2021 at 12:10 AM German Schiavon <
> gschiavonsp...@gmail.com> wrote:
>
>> Hi,
>>
>> This is the jira
>>  and
>> regarding the repo, I believe just commit it to your personal repo and 
>> that
>> should be it.
>>
>> Regards
>>
>> On Mon, 18 Jan 2021 at 15:46, Eric Beabes 
>> wrote:
>>
>>> Sorry. Can you please tell me where to create the JIRA? Also is
>>> there any specific Github repository I need to commit code into - OR - 
>>> just
>>> in our own? Please let me know. Thanks.
>>>
>>> On Mon, Jan 18, 2021 at 7:07 PM Gabor Somogyi <
>>> gabor.g.somo...@gmail.com> wrote:
>>>
 Thanks you, as we've asked could you please create a jira and
 commit the code into github?
 It would speed things up a lot.

 G


 On Mon, Jan 18, 2021 at 2:14 PM Eric Beabes <
 mailinglist...@gmail.com> wrote:

> Here's a very simple reproducer app. I've attached 3 files:
> SparkTest.scala, QueryListener.scala & pom.xml. Copying contents in 
> the
> email as well:
>
> package com.myorg
>
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.hadoop.security.UserGroupInformation
> import org.apache.kafka.clients.producer.ProducerConfig
> import org.apache.spark.sql.streaming.{GroupState, GroupStateTimeout}
> import org.apache.spark.sql.{Dataset, Encoder, Encoders, SparkSession}
>
> import scala.util.{Failure, Success, Try}
>
> object Spark3Test {
>
>   val isLocal = false
>
>   implicit val stringEncoder: Encoder[String] = Encoders.STRING
>   implicit val myStateEncoder: Encoder[MyState] = 
> Encoders.kryo[MyState]
>
>   val START_DATE_INDEX = 21
>   val END_DATE_INDEX = 40
>
>   def main(args: Array[String]) {
>
> val spark: SparkSession = initializeSparkSession("Spark 3.0 
> Upgrade", isLocal)
> spark.sparkContext.setLogLevel("WARN")
>
> readKafkaStream(spark)
>   .groupByKey(row => {
> row.substring(START_DATE_INDEX, END_DATE_INDEX)
>   })
>   .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(
> updateAcrossEvents
>   )
>   .filter(row => !row.inProgress)
>   .map(row => "key: " + row.dateTime + " " + "count: " + 
> row.count)
>   .writeStream
>   .format("kafka")
>   .option(
> s"kafka.${ProducerConfig.BOOTSTRAP_SERVERS_CONFIG}",
> "10.29.42.141:9092"
> //"localhost:9092"
>   )
>   .option("topic", "spark3test")
>   .option("checkpointLocation", "/tmp/checkpoint_5")
>   .outputMode("update")
>   .start()
> manageStreamingQueries(spark)
>   }
>
>   def readKafkaStream(sparkSession: SparkSession): Dataset[String] = {
>
> val stream = sparkSession.readStream
>   

Re: Data source v2 streaming sinks does not support Update mode

2021-01-19 Thread Gabor Somogyi
Thanks for double checking the version. Please report back with 3.1 version
whether it works or not.

G


On Tue, 19 Jan 2021, 07:41 Eric Beabes,  wrote:

> Confirmed. The cluster Admin said his team installed the latest version
> from Cloudera which comes with Spark 3.0.0-preview2. They are going to try
> to upgrade it with the Community edition Spark 3.1.0.
>
> Thanks Jungtaek for the tip. Greatly appreciate it.
>
> On Tue, Jan 19, 2021 at 8:45 AM Eric Beabes 
> wrote:
>
>> >> "Could you please make sure you're not using "3.0.0-preview".
>>
>> This could be the reason. I will check with our Hadoop cluster
>> administrator. It's quite possible that they installed the "Preview" mode.
>> Yes, the code works in the Local dev environment.
>>
>>
>> On Tue, Jan 19, 2021 at 5:29 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> I see no issue from running this code in local dev. (changed the scope
>>> of Spark artifacts to "compile" of course)
>>>
>>> Could you please make sure you're not using "3.0.0-preview"? In
>>> 3.0.0-preview update mode was restricted (as the error message says) and it
>>> was reverted before releasing Spark 3.0.0. Clearing Spark artifacts in your
>>> .m2 cache may work.
>>>
>>> On Tue, Jan 19, 2021 at 8:50 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 And also include some test data as well. I quickly looked through the
 code and the code may require a specific format of the record.

 On Tue, Jan 19, 2021 at 12:10 AM German Schiavon <
 gschiavonsp...@gmail.com> wrote:

> Hi,
>
> This is the jira
>  and regarding
> the repo, I believe just commit it to your personal repo and that should 
> be
> it.
>
> Regards
>
> On Mon, 18 Jan 2021 at 15:46, Eric Beabes 
> wrote:
>
>> Sorry. Can you please tell me where to create the JIRA? Also is there
>> any specific Github repository I need to commit code into - OR - just in
>> our own? Please let me know. Thanks.
>>
>> On Mon, Jan 18, 2021 at 7:07 PM Gabor Somogyi <
>> gabor.g.somo...@gmail.com> wrote:
>>
>>> Thanks you, as we've asked could you please create a jira and commit
>>> the code into github?
>>> It would speed things up a lot.
>>>
>>> G
>>>
>>>
>>> On Mon, Jan 18, 2021 at 2:14 PM Eric Beabes <
>>> mailinglist...@gmail.com> wrote:
>>>
 Here's a very simple reproducer app. I've attached 3 files:
 SparkTest.scala, QueryListener.scala & pom.xml. Copying contents in the
 email as well:

 package com.myorg

 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.{FileSystem, Path}
 import org.apache.hadoop.security.UserGroupInformation
 import org.apache.kafka.clients.producer.ProducerConfig
 import org.apache.spark.sql.streaming.{GroupState, GroupStateTimeout}
 import org.apache.spark.sql.{Dataset, Encoder, Encoders, SparkSession}

 import scala.util.{Failure, Success, Try}

 object Spark3Test {

   val isLocal = false

   implicit val stringEncoder: Encoder[String] = Encoders.STRING
   implicit val myStateEncoder: Encoder[MyState] = 
 Encoders.kryo[MyState]

   val START_DATE_INDEX = 21
   val END_DATE_INDEX = 40

   def main(args: Array[String]) {

 val spark: SparkSession = initializeSparkSession("Spark 3.0 
 Upgrade", isLocal)
 spark.sparkContext.setLogLevel("WARN")

 readKafkaStream(spark)
   .groupByKey(row => {
 row.substring(START_DATE_INDEX, END_DATE_INDEX)
   })
   .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(
 updateAcrossEvents
   )
   .filter(row => !row.inProgress)
   .map(row => "key: " + row.dateTime + " " + "count: " + row.count)
   .writeStream
   .format("kafka")
   .option(
 s"kafka.${ProducerConfig.BOOTSTRAP_SERVERS_CONFIG}",
 "10.29.42.141:9092"
 //"localhost:9092"
   )
   .option("topic", "spark3test")
   .option("checkpointLocation", "/tmp/checkpoint_5")
   .outputMode("update")
   .start()
 manageStreamingQueries(spark)
   }

   def readKafkaStream(sparkSession: SparkSession): Dataset[String] = {

 val stream = sparkSession.readStream
   .format("kafka")
   .option("kafka.bootstrap.servers", "10.29.42.141:9092")
   .option("subscribe", "inputTopic")
   .option("startingOffsets", "latest")
   .option("failOnDataLoss", "false")