Re: Hostname :BUG

2020-03-04 Thread Zahid Rahman
Please explain why you think that if there is a different reason from this
: -

If you think that, because the header of /etc/hostname says hosts then that
is because I copied the file header from /etc/hosts to  /etc/hostname.




On Wed, 4 Mar 2020, 21:14 Andrew Melo,  wrote:

> Hello Zabid,
>
> On Wed, Mar 4, 2020 at 1:47 PM Zahid Rahman  wrote:
>
>> Hi,
>>
>> I found the problem was because on my  Linux   Operating System the
>> /etc/hostname was blank.
>>
>> *STEP 1*
>> I searched  on google the error message and there was an answer suggesting
>> I should add to /etc/hostname
>>
>> 127.0.0.1  [hostname] localhost.
>>
>
> I believe you've confused /etc/hostname and /etc/hosts --
>
>
>>
>> I did that but there was still  an error,  this time the spark  log in
>> standard output was concatenating the text content
>> of etc/hostname  like so ,   127.0.0.1[hostname]localhost.
>>
>> *STEP 2*
>> My second attempt was to change the /etc/hostname to 127.0.0.1
>> This time I was getting a warning with information about "using loop
>> back"  rather than an error.
>>
>> *STEP 3*
>> I wasn't happy with that so then I changed the /etc/hostname to (see
>> below) ,
>> then the warning message disappeared. my guess is that it is the act of
>> creating spark session as to the cause of error,
>> in SparkConf() API.
>>
>>  SparkConf sparkConf = new SparkConf()
>>  .setAppName("Simple Application")
>>  .setMaster("local")
>>  .set("spark.executor.memory","2g");
>>
>> $ cat /etc/hostname
>> # hosts This file describes a number of hostname-to-address
>> #   mappings for the TCP/IP subsystem.  It is mostly
>> #   used at boot time, when no name servers are running.
>> #   On small systems, this file can be used instead of a
>> #   "named" name server.
>> # Syntax:
>> #
>> # IP-Address  Full-Qualified-Hostname  Short-Hostname
>> #
>>
>> 192.168.0.42
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> zahid@localhost
>> :~/Downloads/apachespark/Apache-Spark-Example/Java-Code-Geek>
>> mvn exec:java -Dexec.mainClass=com.javacodegeek.examples.SparkExampleRDD
>> -Dexec.args="input.txt"
>> [INFO] Scanning for projects...
>> [WARNING]
>> [WARNING] Some problems were encountered while building the effective
>> model for javacodegeek:examples:jar:1.0-SNAPSHOT
>> [WARNING] 'build.plugins.plugin.version' for
>> org.apache.maven.plugins:maven-compiler-plugin is missing. @ line 12,
>> column 21
>> [WARNING]
>> [WARNING] It is highly recommended to fix these problems because they
>> threaten the stability of your build.
>> [WARNING]
>> [WARNING] For this reason, future Maven versions might no longer support
>> building such malformed projects.
>> [WARNING]
>> [INFO]
>> [INFO] ---< javacodegeek:examples
>> >
>> [INFO] Building examples 1.0-SNAPSHOT
>> [INFO] [ jar
>> ]-
>> [INFO]
>> [INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ examples ---
>> WARNING: An illegal reflective access operation has occurred
>> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
>> (file:/home/zahid/.m2/repository/org/apache/spark/spark-unsafe_2.12/2.4.5/spark-unsafe_2.12-2.4.5.jar)
>> to method java.nio.Bits.unaligned()
>> WARNING: Please consider reporting this to the maintainers of
>> org.apache.spark.unsafe.Platform
>> WARNING: Use --illegal-access=warn to enable warnings of further illegal
>> reflective access operations
>> WARNING: All illegal access operations will be denied in a future release
>> Using Spark's default log4j profile:
>> org/apache/spark/log4j-defaults.properties
>> 20/02/29 17:20:40 INFO SparkContext: Running Spark version 2.4.5
>> 20/02/29 17:20:40 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>> 20/02/29 17:20:41 INFO SparkContext: Submitted application: Word Count
>> 20/02/29 17:20:41 INFO SecurityManager: Changing view acls to: zahid
>> 20/02/29 17:20:41 INFO SecurityManager: Changing modify acls to: zahid
>> 20/02/29 17:20:41 INFO SecurityManager: Changing view acls groups to:
>> 20/02/29 17:20:41 INFO SecurityManager: Changing modify acls groups to:
>> 20/02/29 17:20:41 INFO SecurityManager: SecurityManager: authentication
>> disabled; ui acls disabled; users  with view permissions: Set(zahid);
>> groups with view permissions: Set(); users  with modify permissions:
>> Set(zahid); groups with modify permissions: Set()
>> 20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
>> random free port. You may check whether configuring an appropriate binding
>> address.
>> 20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
>> random free port. You may check whether configuring an appropriate binding
>> address.
>> 20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
>> random free port. Yo

Re: Can't get Spark to interface with S3A Filesystem with correct credentials

2020-03-04 Thread Hariharan
If you're using hadoop 2.7 or below, you may also need to use the
following hadoop settings:

fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3a.S3A
fs.AbstractFileSystem.s3a.impl=org.apache.hadoop.fs.s3a.S3A

Hadoop 2.8 and above would have these set by default.

Thanks,
Hariharan

On Thu, Mar 5, 2020 at 2:41 AM Devin Boyer
 wrote:
>
> Hello,
>
> I'm attempting to run Spark within a Docker container with the hope of 
> eventually running Spark on Kubernetes. Nearly all the data we currently 
> process with Spark is stored in S3, so I need to be able to interface with it 
> using the S3A filesystem.
>
> I feel like I've gotten close to getting this working but for some reason 
> cannot get my local Spark installations to correctly interface with S3 yet.
>
> A basic example of what I've tried:
>
> Build Kubernetes docker images by downloading the 
> spark-2.4.5-bin-hadoop2.7.tgz archive and building the 
> kubernetes/dockerfiles/spark/Dockerfile image.
> Run an interactive docker container using the above built image.
> Within that container, run spark-shell. This command passes valid AWS 
> credentials by setting spark.hadoop.fs.s3a.access.key and 
> spark.hadoop.fs.s3a.secret.key using --conf flags, and downloads the 
> hadoop-aws package by specifying the --packages 
> org.apache.hadoop:hadoop-aws:2.7.3 flag.
> Try to access the simple public file as outlined in the "Integration with 
> Cloud Infrastructures" documentation by running: 
> sc.textFile("s3a://landsat-pds/scene_list.gz").take(5)
> Observe this to fail with a 403 Forbidden exception thrown by S3
>
>
> I've tried a variety of other means of setting credentials (like exporting 
> the standard AWS_ACCESS_KEY_ID environment variable before launching 
> spark-shell), and other means of building a Spark image and including the 
> appropriate libraries (see this Github repo: 
> https://github.com/drboyer/spark-s3a-demo), all with the same results. I've 
> tried also accessing objects within our AWS account, rather than the object 
> from the public landsat-pds bucket, with the same 403 error being thrown.
>
> Can anyone help explain why I can't seem to connect to S3 successfully using 
> Spark, or even explain where I could look for additional clues as to what's 
> misconfigured? I've tried turning up the logging verbosity and didn't see 
> much that was particularly useful, but happy to share additional log output 
> too.
>
> Thanks for any help you can provide!
>
> Best,
> Devin Boyer

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Stateful Structured Spark Streaming: Timeout is not getting triggered

2020-03-04 Thread Tathagata Das
Make sure that you are continuously feeding data into the query to trigger
the batches. only then timeouts are processed.
See the timeout behavior details here -
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.streaming.GroupState

On Wed, Mar 4, 2020 at 2:51 PM Something Something 
wrote:

> I've set the timeout duration to "2 minutes" as follows:
>
> def updateAcrossEvents (tuple3: Tuple3[String, String, String], inputs: 
> Iterator[R00tJsonObject],
>   oldState: GroupState[MyState]): OutputRow = {
>
> println(" Inside updateAcrossEvents with : " + tuple3._1 + ", " + 
> tuple3._2 + ", " + tuple3._3)
> var state: MyState = if (oldState.exists) oldState.get else 
> MyState(tuple3._1, tuple3._2, tuple3._3)
>
> if (oldState.hasTimedOut) {
>   println("@ oldState has timed out ")
>   // Logic to Write OutputRow
>   OutputRow("some values here...")
> } else {
>   for (input <- inputs) {
> state = updateWithEvent(state, input)
> oldState.update(state)
> *oldState.setTimeoutDuration("2 minutes")*
>   }
>   OutputRow(null, null, null)
> }
>
>   }
>
> I have also specified ProcessingTimeTimeout in 'mapGroupsWithState' as 
> follows...
>
> .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(updateAcrossEvents)
>
> But 'hasTimedOut' is never true so I don't get any output! What am I doing 
> wrong?
>
>
>
>


Re: SPARK Suitable IDE

2020-03-04 Thread Holden Karau
I work in emacs with ensime. I think really any IDE is ok, so go with the
one you feel most at home in.

On Wed, Mar 4, 2020 at 5:49 PM tianlangstudio
 wrote:

> We use IntelliJ IDEA,Whether it's Java, Scala or Python
>
> 
> 
> 
> 
> 
> 
> TianlangStudio 
> Some of the biggest lies: I will start tomorrow/Others are better than
> me/I am not good enough/I don't have time/This is the way I am
> 
>
>
> --
> 发件人:Zahid Rahman 
> 发送时间:2020年3月3日(星期二) 06:43
> 收件人:user 
> 主 题:SPARK Suitable IDE
>
> Hi,
>
> Can you recommend a suitable IDE for Apache sparks from the list below or
> if you know a more suitable one ?
>
> Codeanywhere
> goormIDE
> Koding
> SourceLair
> ShiftEdit
> Browxy
> repl.it
> PaizaCloud IDE
> Eclipse Che
> Visual Studio Online
> Gitpod
> Google Cloud Shell
> Codio
> Codepen
> CodeTasty
> Glitch
> JSitor
> ICEcoder
> Codiad
> Dirigible
> Orion
> Codiva.io
> Collide
> Codenvy
> AWS Cloud9
> JSFiddle
> GitLab
> SLAppForge Sigma
> Jupyter
> CoCalc
>
> Backbutton.co.uk
> ¯\_(ツ)_/¯
> ♡۶Java♡۶RMI ♡۶
> Make Use Method {MUM}
> makeuse.org
> 
>
>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


回复:SPARK Suitable IDE

2020-03-04 Thread tianlangstudio
We use IntelliJ IDEA,Whether it's Java, Scala or Python

 
TianlangStudio
Some of the biggest lies: I will start tomorrow/Others are better than me/I am 
not good enough/I don't have time/This is the way I am
 


--
发件人:Zahid Rahman 
发送时间:2020年3月3日(星期二) 06:43
收件人:user 
主 题:SPARK Suitable IDE

Hi,

Can you recommend a suitable IDE for Apache sparks from the list below or if 
you know a more suitable one ?

Codeanywhere
goormIDE
Koding
SourceLair
ShiftEdit
Browxy
repl.it
PaizaCloud IDE
Eclipse Che
Visual Studio Online
Gitpod
Google Cloud Shell
Codio
Codepen
CodeTasty
Glitch
JSitor
ICEcoder
Codiad
Dirigible
Orion
Codiva.io
Collide
Codenvy
AWS Cloud9
JSFiddle
GitLab
SLAppForge Sigma
Jupyter
CoCalc

Backbutton.co.uk
¯\_(ツ)_/¯ 
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
makeuse.org




github-logo.png
Description: Binary data
<>


51cto-logo.png
Description: Binary data


duxiaomai-logo (1).png
Description: Binary data


iqiyi-logo.png
Description: Binary data


huya-logo.png
Description: Binary data


logo-baidu-220X220.png
Description: Binary data


Spark DataSet class is not truly private[sql]

2020-03-04 Thread Nirav Patel
I see Spark dataset is defined as:

class Dataset[T] private[sql](

@transient val sparkSession: SparkSession,

@DeveloperApi @InterfaceStability.Unstable @transient val queryExecution:
QueryExecution,

encoder: Encoder[T])


However it has public constructors which allows DataSet to be extended
which I don't think is intended by Developer.


  def this(sparkSession: SparkSession, logicalPlan: LogicalPlan, encoder:
Encoder[T]) = {
this(sparkSession, sparkSession.sessionState.executePlan(logicalPlan),
encoder)
  }

  def this(sqlContext: SQLContext, logicalPlan: LogicalPlan, encoder:
Encoder[T]) = {
this(sqlContext.sparkSession, logicalPlan, encoder)
  }

-- 
 


 




Stateful Structured Spark Streaming: Timeout is not getting triggered

2020-03-04 Thread Something Something
I've set the timeout duration to "2 minutes" as follows:

def updateAcrossEvents (tuple3: Tuple3[String, String, String],
inputs: Iterator[R00tJsonObject],
  oldState: GroupState[MyState]): OutputRow = {

println(" Inside updateAcrossEvents with : " + tuple3._1 + ",
" + tuple3._2 + ", " + tuple3._3)
var state: MyState = if (oldState.exists) oldState.get else
MyState(tuple3._1, tuple3._2, tuple3._3)

if (oldState.hasTimedOut) {
  println("@ oldState has timed out ")
  // Logic to Write OutputRow
  OutputRow("some values here...")
} else {
  for (input <- inputs) {
state = updateWithEvent(state, input)
oldState.update(state)
*oldState.setTimeoutDuration("2 minutes")*
  }
  OutputRow(null, null, null)
}

  }

I have also specified ProcessingTimeTimeout in 'mapGroupsWithState' as
follows...

.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(updateAcrossEvents)

But 'hasTimedOut' is never true so I don't get any output! What am I
doing wrong?


Re: Stateful Spark Streaming: Required attribute 'value' not found

2020-03-04 Thread Something Something
By simply adding 'toJSON' before 'writeStream' the problem was fixed. Maybe
it will help someone.

On Tue, Mar 3, 2020 at 6:02 PM Something Something 
wrote:

> In a Stateful Spark Streaming application I am writing the 'OutputRow' in
> the 'updateAcrossEvents' but I keep getting this error (*Required
> attribute 'value' not found*) while it's trying to write to Kafka. I know
> from the documentation that 'value' attribute needs to be set but how do I
> do that in the 'Stateful Structured Streaming'? Where & how do I add this
> 'value' attribute in the following code? *Note: I am using Spark 2.3.1*
>
> withEventTime
>   .as[R00tJsonObject]
>   .withWatermark("event_time", "5 minutes")
>   .groupByKey(row => (row.value.Id, row.value.time.toString, 
> row.value.cId))
>   
> .mapGroupsWithState(GroupStateTimeout.EventTimeTimeout)(updateAcrossEvents)
>   .writeStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers", "localhost:9092")
>   .option("topic", "myTopic")
>   .option("checkpointLocation", "/Users/username/checkpointLocation")
>   .outputMode("update")
>   .start()
>   .awaitTermination()
>
>


Spark 2.4.5 - Structured Streaming - Failed Jobs expire from the UI

2020-03-04 Thread puneetloya
Hi,

I have been using Spark 2.4.5, for the past month. When a structured
streaming query fails, it appears on the UI as a failed job. But after a
while these failed jobs expire(disappear) from the UI. Is there a setting
which expires failed jobs? I was using Spark 2.2 before this, I have never
seen this behavior. 

Also I tried to call the Spark API to return failed jobs, it returns empty.

Thanks,
Puneet



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Can't get Spark to interface with S3A Filesystem with correct credentials

2020-03-04 Thread Steven Stetzler
To successfully read from S3 using s3a, I've had to also set
```
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
```
in addition to `spark.hadoop.fs.s3a.access.key` and
`spark.hadoop.fs.s3a.secret.key`. I've also needed to ensure Spark has
access to the AWS SDK jar. I have downloaded `aws-java-sdk-1.7.4.jar` (maven
)
paired with `hadoop-aws-2.7.3.jar` in `$SPARK_HOME/jars`.

These additionally configurations don't seem related to credentials and
security (and may not even be needed in my case), but perhaps it will help
you.

Thanks,
Steven

On Wed, Mar 4, 2020 at 1:11 PM Devin Boyer 
wrote:

> Hello,
>
> I'm attempting to run Spark within a Docker container with the hope of
> eventually running Spark on Kubernetes. Nearly all the data we currently
> process with Spark is stored in S3, so I need to be able to interface with
> it using the S3A filesystem.
>
> I feel like I've gotten close to getting this working but for some reason
> cannot get my local Spark installations to correctly interface with S3 yet.
>
> A basic example of what I've tried:
>
>- Build Kubernetes docker images by downloading the
>spark-2.4.5-bin-hadoop2.7.tgz archive and building the
>kubernetes/dockerfiles/spark/Dockerfile image.
>- Run an interactive docker container using the above built image.
>- Within that container, run spark-shell. This command passes valid
>AWS credentials by setting spark.hadoop.fs.s3a.access.key and
>spark.hadoop.fs.s3a.secret.key using --conf flags, and downloads the
>hadoop-aws package by specifying the --packages
>org.apache.hadoop:hadoop-aws:2.7.3 flag.
>- Try to access the simple public file as outlined in the "Integration
>with Cloud Infrastructures
>"
>documentation by running:
>sc.textFile("s3a://landsat-pds/scene_list.gz").take(5)
>- Observe this to fail with a 403 Forbidden exception thrown by S3
>
>
> I've tried a variety of other means of setting credentials (like exporting
> the standard AWS_ACCESS_KEY_ID environment variable before launching
> spark-shell), and other means of building a Spark image and including the
> appropriate libraries (see this Github repo:
> https://github.com/drboyer/spark-s3a-demo), all with the same results.
> I've tried also accessing objects within our AWS account, rather than the
> object from the public landsat-pds bucket, with the same 403 error being
> thrown.
>
> Can anyone help explain why I can't seem to connect to S3 successfully
> using Spark, or even explain where I could look for additional clues as to
> what's misconfigured? I've tried turning up the logging verbosity and
> didn't see much that was particularly useful, but happy to share additional
> log output too.
>
> Thanks for any help you can provide!
>
> Best,
> Devin Boyer
>


Re: Hostname :BUG

2020-03-04 Thread Andrew Melo
Hello Zabid,

On Wed, Mar 4, 2020 at 1:47 PM Zahid Rahman  wrote:

> Hi,
>
> I found the problem was because on my  Linux   Operating System the
> /etc/hostname was blank.
>
> *STEP 1*
> I searched  on google the error message and there was an answer suggesting
> I should add to /etc/hostname
>
> 127.0.0.1  [hostname] localhost.
>

I believe you've confused /etc/hostname and /etc/hosts --


>
> I did that but there was still  an error,  this time the spark  log in
> standard output was concatenating the text content
> of etc/hostname  like so ,   127.0.0.1[hostname]localhost.
>
> *STEP 2*
> My second attempt was to change the /etc/hostname to 127.0.0.1
> This time I was getting a warning with information about "using loop
> back"  rather than an error.
>
> *STEP 3*
> I wasn't happy with that so then I changed the /etc/hostname to (see
> below) ,
> then the warning message disappeared. my guess is that it is the act of
> creating spark session as to the cause of error,
> in SparkConf() API.
>
>  SparkConf sparkConf = new SparkConf()
>  .setAppName("Simple Application")
>  .setMaster("local")
>  .set("spark.executor.memory","2g");
>
> $ cat /etc/hostname
> # hosts This file describes a number of hostname-to-address
> #   mappings for the TCP/IP subsystem.  It is mostly
> #   used at boot time, when no name servers are running.
> #   On small systems, this file can be used instead of a
> #   "named" name server.
> # Syntax:
> #
> # IP-Address  Full-Qualified-Hostname  Short-Hostname
> #
>
> 192.168.0.42
>
>
>
>
>
>
>
>
>
>
>
>
> zahid@localhost
> :~/Downloads/apachespark/Apache-Spark-Example/Java-Code-Geek>
> mvn exec:java -Dexec.mainClass=com.javacodegeek.examples.SparkExampleRDD
> -Dexec.args="input.txt"
> [INFO] Scanning for projects...
> [WARNING]
> [WARNING] Some problems were encountered while building the effective
> model for javacodegeek:examples:jar:1.0-SNAPSHOT
> [WARNING] 'build.plugins.plugin.version' for
> org.apache.maven.plugins:maven-compiler-plugin is missing. @ line 12,
> column 21
> [WARNING]
> [WARNING] It is highly recommended to fix these problems because they
> threaten the stability of your build.
> [WARNING]
> [WARNING] For this reason, future Maven versions might no longer support
> building such malformed projects.
> [WARNING]
> [INFO]
> [INFO] ---< javacodegeek:examples
> >
> [INFO] Building examples 1.0-SNAPSHOT
> [INFO] [ jar
> ]-
> [INFO]
> [INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ examples ---
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
> (file:/home/zahid/.m2/repository/org/apache/spark/spark-unsafe_2.12/2.4.5/spark-unsafe_2.12-2.4.5.jar)
> to method java.nio.Bits.unaligned()
> WARNING: Please consider reporting this to the maintainers of
> org.apache.spark.unsafe.Platform
> WARNING: Use --illegal-access=warn to enable warnings of further illegal
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 20/02/29 17:20:40 INFO SparkContext: Running Spark version 2.4.5
> 20/02/29 17:20:40 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 20/02/29 17:20:41 INFO SparkContext: Submitted application: Word Count
> 20/02/29 17:20:41 INFO SecurityManager: Changing view acls to: zahid
> 20/02/29 17:20:41 INFO SecurityManager: Changing modify acls to: zahid
> 20/02/29 17:20:41 INFO SecurityManager: Changing view acls groups to:
> 20/02/29 17:20:41 INFO SecurityManager: Changing modify acls groups to:
> 20/02/29 17:20:41 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users  with view permissions: Set(zahid);
> groups with view permissions: Set(); users  with modify permissions:
> Set(zahid); groups with modify permissions: Set()
> 20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
> random free port. You may check whether configuring an appropriate binding
> address.
> 20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
> random free port. You may check whether configuring an appropriate binding
> address.
> 20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
> random free port. You may check whether configuring an appropriate binding
> address.
> 20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
> random free port. You may check whether configuring an appropriate binding
> address.
> 20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
> random free port. You may check whether configuring an appropriate binding
> address.
> 20/02/29 17:

Can't get Spark to interface with S3A Filesystem with correct credentials

2020-03-04 Thread Devin Boyer
Hello,

I'm attempting to run Spark within a Docker container with the hope of
eventually running Spark on Kubernetes. Nearly all the data we currently
process with Spark is stored in S3, so I need to be able to interface with
it using the S3A filesystem.

I feel like I've gotten close to getting this working but for some reason
cannot get my local Spark installations to correctly interface with S3 yet.

A basic example of what I've tried:

   - Build Kubernetes docker images by downloading the
   spark-2.4.5-bin-hadoop2.7.tgz archive and building the
   kubernetes/dockerfiles/spark/Dockerfile image.
   - Run an interactive docker container using the above built image.
   - Within that container, run spark-shell. This command passes valid AWS
   credentials by setting spark.hadoop.fs.s3a.access.key and
   spark.hadoop.fs.s3a.secret.key using --conf flags, and downloads the
   hadoop-aws package by specifying the --packages
   org.apache.hadoop:hadoop-aws:2.7.3 flag.
   - Try to access the simple public file as outlined in the "Integration
   with Cloud Infrastructures
   "
   documentation by running:
   sc.textFile("s3a://landsat-pds/scene_list.gz").take(5)
   - Observe this to fail with a 403 Forbidden exception thrown by S3


I've tried a variety of other means of setting credentials (like exporting
the standard AWS_ACCESS_KEY_ID environment variable before launching
spark-shell), and other means of building a Spark image and including the
appropriate libraries (see this Github repo:
https://github.com/drboyer/spark-s3a-demo), all with the same results. I've
tried also accessing objects within our AWS account, rather than the object
from the public landsat-pds bucket, with the same 403 error being thrown.

Can anyone help explain why I can't seem to connect to S3 successfully
using Spark, or even explain where I could look for additional clues as to
what's misconfigured? I've tried turning up the logging verbosity and
didn't see much that was particularly useful, but happy to share additional
log output too.

Thanks for any help you can provide!

Best,
Devin Boyer


Hostname :BUG

2020-03-04 Thread Zahid Rahman
Hi,

I found the problem was because on my  Linux   Operating System the
/etc/hostname was blank.

*STEP 1*
I searched  on google the error message and there was an answer suggesting
I should add to /etc/hostname

127.0.0.1  [hostname] localhost.

I did that but there was still  an error,  this time the spark  log in
standard output was concatenating the text content
of etc/hostname  like so ,   127.0.0.1[hostname]localhost.

*STEP 2*
My second attempt was to change the /etc/hostname to 127.0.0.1
This time I was getting a warning with information about "using loop back"
rather than an error.

*STEP 3*
I wasn't happy with that so then I changed the /etc/hostname to (see below)
,
then the warning message disappeared. my guess is that it is the act of
creating spark session as to the cause of error,
in SparkConf() API.

 SparkConf sparkConf = new SparkConf()
 .setAppName("Simple Application")
 .setMaster("local")
 .set("spark.executor.memory","2g");

$ cat /etc/hostname
# hosts This file describes a number of hostname-to-address
#   mappings for the TCP/IP subsystem.  It is mostly
#   used at boot time, when no name servers are running.
#   On small systems, this file can be used instead of a
#   "named" name server.
# Syntax:
#
# IP-Address  Full-Qualified-Hostname  Short-Hostname
#

192.168.0.42












zahid@localhost:~/Downloads/apachespark/Apache-Spark-Example/Java-Code-Geek>
mvn exec:java -Dexec.mainClass=com.javacodegeek.examples.SparkExampleRDD
-Dexec.args="input.txt"
[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model
for javacodegeek:examples:jar:1.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for
org.apache.maven.plugins:maven-compiler-plugin is missing. @ line 12,
column 21
[WARNING]
[WARNING] It is highly recommended to fix these problems because they
threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support
building such malformed projects.
[WARNING]
[INFO]
[INFO] ---< javacodegeek:examples
>
[INFO] Building examples 1.0-SNAPSHOT
[INFO] [ jar
]-
[INFO]
[INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ examples ---
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
(file:/home/zahid/.m2/repository/org/apache/spark/spark-unsafe_2.12/2.4.5/spark-unsafe_2.12-2.4.5.jar)
to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of
org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal
reflective access operations
WARNING: All illegal access operations will be denied in a future release
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
20/02/29 17:20:40 INFO SparkContext: Running Spark version 2.4.5
20/02/29 17:20:40 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
20/02/29 17:20:41 INFO SparkContext: Submitted application: Word Count
20/02/29 17:20:41 INFO SecurityManager: Changing view acls to: zahid
20/02/29 17:20:41 INFO SecurityManager: Changing modify acls to: zahid
20/02/29 17:20:41 INFO SecurityManager: Changing view acls groups to:
20/02/29 17:20:41 INFO SecurityManager: Changing modify acls groups to:
20/02/29 17:20:41 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users  with view permissions: Set(zahid);
groups with view permissions: Set(); users  with modify permissions:
Set(zahid); groups with modify permissions: Set()
20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
random free port. You may check whether configuring an appropriate binding
address.
20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
random free port. You may check whether configuring an appropriate binding
address.
20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
random free port. You may check whether configuring an appropriate binding
address.
20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
random free port. You may check whether configuring an appropriate binding
address.
20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
random free port. You may check whether configuring an appropriate binding
address.
20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
random free port. You may check whether configuring an appropriate binding
address.
20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
random free port. You may check whether configuring an appropriate binding
address.
20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' c

Re: What is the best way to consume parallely from multiple topics in Spark Stream with Kafka

2020-03-04 Thread Gerard Maas
Hi Hrishi,

When using the Direct Kafka stream approach, processing tasks will be
distributed to the cluster.
The level of parallelism is dependent on how many partitions the consumed
topics have.
Why do you think that the processing is not happening in parallel?

I would advise you to get the base scenario working before looking into
advanced features like `concurrentJobs` or a particular scheduler.

kind regards, Gerard.

On Wed, Mar 4, 2020 at 7:42 PM Hrishikesh Mishra 
wrote:

> Hi
>
> My spark stream job consumes from multiple Kafka topics. How can I process
> parallely? Should I try for *spark.streaming.concurrentJobs,* but it has
> some adverse effects as mentioned by the creator. Is it still valid with
> Spark 2.4 and Direct Kafka Stream? What about FAIR scheduling mode, will it
> help in this scenario. I am not getting any valid links around this.
>
> Regards
> Hrishi
>
>


Re: Schema store for Parquet

2020-03-04 Thread Magnus Nilsson
Apache Atlas is the apache data catalog. Maybe want to look into that. It
depends on what your use case is.

On Wed, Mar 4, 2020 at 8:01 PM Ruijing Li  wrote:

> Thanks Lucas and Magnus,
>
> Would there be any open source solutions other than Apache Hive metastore,
> if we don’t wish to use Apache Hive and spark?
>
> Thanks.
>
> On Wed, Mar 4, 2020 at 10:40 AM lucas.g...@gmail.com 
> wrote:
>
>> Or AWS glue catalog if you're in AWS
>>
>> On Wed, 4 Mar 2020 at 10:35, Magnus Nilsson  wrote:
>>
>>> Google hive metastore.
>>>
>>> On Wed, Mar 4, 2020 at 7:29 PM Ruijing Li  wrote:
>>>
 Hi all,

 Has anyone explored efforts to have a centralized storage of schemas of
 different parquet files? I know there is schema management for Avro, but
 couldn’t find solutions for parquet schema management. Thanks!
 --
 Cheers,
 Ruijing Li

>>> --
> Cheers,
> Ruijing Li
>


Re: Schema store for Parquet

2020-03-04 Thread Ruijing Li
Thanks Lucas and Magnus,

Would there be any open source solutions other than Apache Hive metastore,
if we don’t wish to use Apache Hive and spark?

Thanks.

On Wed, Mar 4, 2020 at 10:40 AM lucas.g...@gmail.com 
wrote:

> Or AWS glue catalog if you're in AWS
>
> On Wed, 4 Mar 2020 at 10:35, Magnus Nilsson  wrote:
>
>> Google hive metastore.
>>
>> On Wed, Mar 4, 2020 at 7:29 PM Ruijing Li  wrote:
>>
>>> Hi all,
>>>
>>> Has anyone explored efforts to have a centralized storage of schemas of
>>> different parquet files? I know there is schema management for Avro, but
>>> couldn’t find solutions for parquet schema management. Thanks!
>>> --
>>> Cheers,
>>> Ruijing Li
>>>
>> --
Cheers,
Ruijing Li


What is the best way to consume parallely from multiple topics in Spark Stream with Kafka

2020-03-04 Thread Hrishikesh Mishra
Hi

My spark stream job consumes from multiple Kafka topics. How can I process
parallely? Should I try for *spark.streaming.concurrentJobs,* but it has
some adverse effects as mentioned by the creator. Is it still valid with
Spark 2.4 and Direct Kafka Stream? What about FAIR scheduling mode, will it
help in this scenario. I am not getting any valid links around this.

Regards
Hrishi


Re: Schema store for Parquet

2020-03-04 Thread lucas.g...@gmail.com
Or AWS glue catalog if you're in AWS

On Wed, 4 Mar 2020 at 10:35, Magnus Nilsson  wrote:

> Google hive metastore.
>
> On Wed, Mar 4, 2020 at 7:29 PM Ruijing Li  wrote:
>
>> Hi all,
>>
>> Has anyone explored efforts to have a centralized storage of schemas of
>> different parquet files? I know there is schema management for Avro, but
>> couldn’t find solutions for parquet schema management. Thanks!
>> --
>> Cheers,
>> Ruijing Li
>>
>


Re: Schema store for Parquet

2020-03-04 Thread Magnus Nilsson
Google hive metastore.

On Wed, Mar 4, 2020 at 7:29 PM Ruijing Li  wrote:

> Hi all,
>
> Has anyone explored efforts to have a centralized storage of schemas of
> different parquet files? I know there is schema management for Avro, but
> couldn’t find solutions for parquet schema management. Thanks!
> --
> Cheers,
> Ruijing Li
>


Schema store for Parquet

2020-03-04 Thread Ruijing Li
Hi all,

Has anyone explored efforts to have a centralized storage of schemas of
different parquet files? I know there is schema management for Avro, but
couldn’t find solutions for parquet schema management. Thanks!
-- 
Cheers,
Ruijing Li


Read Hive ACID Managed table in Spark

2020-03-04 Thread Chetan Khatri
Hi Spark Users,
I want to read Hive ACID managed table data (ORC) in Spark. Can someone
help me here.
I've tried, https://github.com/qubole/spark-acid but no success.

Thanks


Re: Way to get the file name of the output when doing ORC write from dataframe

2020-03-04 Thread Manjunath Shetty H
Or is there any way to provide a Unique file name to the ORC write function 
itself ?

Any suggestions will be helpful.

Regards
Manjunath Shetty

From: Manjunath Shetty H 
Sent: Wednesday, March 4, 2020 2:28 PM
To: user 
Subject: Way to get the file name of the output when doing ORC write from 
dataframe

Hi,

I wanted to know if there is any way to get the output file name that 
`Dataframe.orc()` will write to ?. This is needed to track which file is 
written by which job during incremental batch jobs.

Thanks
Manjunath


Re: How to collect Spark dataframe write metrics

2020-03-04 Thread Manjunath Shetty H
Thanks Zohar,

Will try that


-
Manjunath

From: Zohar Stiro 
Sent: Tuesday, March 3, 2020 1:49 PM
To: Manjunath Shetty H 
Cc: user 
Subject: Re: How to collect Spark dataframe write metrics

Hi,

to get DataFrame level write metrics you can take a look at the following trait 
:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteStatsTracker.scala
and a basic implementation example:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala

and here is an example of how it is being used in FileStreamSink:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L178

- about the good practise - it depends on your use case but Generally speaking 
I would not do it - at least not for checking your logic/ checking spark is 
working correctly.

‫בתאריך יום א׳, 1 במרץ 2020 ב-14:32 מאת ‪Manjunath Shetty H‬‏ 
<‪manjunathshe...@live.com‬‏>:‬
Hi all,

Basically my use case is to validate the DataFrame rows count before and after 
writing to HDFS. Is this even to good practice ? Or Should relay on spark for 
guaranteed writes ?.

If it is a good practice to follow then how to get the DataFrame level write 
metrics ?

Any pointers would be helpful.


Thanks and Regards
Manjunath


Way to get the file name of the output when doing ORC write from dataframe

2020-03-04 Thread Manjunath Shetty H
Hi,

I wanted to know if there is any way to get the output file name that 
`Dataframe.orc()` will write to ?. This is needed to track which file is 
written by which job during incremental batch jobs.

Thanks
Manjunath