Re: Why were changes of SPARK-9241 removed?

2020-03-12 Thread Xiao Li
I do not think we intentionally dropped it. Could you open a ticket in
Spark JIRA with your query?

Cheers,

Xiao

On Thu, Mar 12, 2020 at 8:24 PM 马阳阳  wrote:

> Hi,
> I wonder why the changes made in
> "[SPARK-9241][SQL] Supporting
> multiple DISTINCT columns (2) -
> Rewriting Rule" are not present in
> Spark (verson 2.4) now. This caused
> execution of count distinct in Spark
> much slower than Spark 1.6 and hive
> (Spark 2.4.4 more than 18 minutes;
> hive about 80s, spark 1.6 about 3
> minutes).
>
>
> --
> Sent from Postbox 
>


-- 



Why were changes of SPARK-9241 removed?

2020-03-12 Thread 马阳阳

Hi,
I wonder why the changes made in
"[SPARK-9241][SQL] Supporting
multiple DISTINCT columns (2) -
Rewriting Rule" are not present in
Spark (verson 2.4) now. This caused
execution of count distinct in Spark
much slower than Spark 1.6 and hive
(Spark 2.4.4 more than 18 minutes;
hive about 80s, spark 1.6 about 3
minutes).




--
Sent from Postbox 


Re: Hostname :BUG

2020-03-12 Thread Zahid Rahman
hey Dodgy Bob, Linux &  C programmers, conscientious non - objector,

I have a great idea I want share with you.
In linux I am familiar with wc {wc = word count} (linux users don't like
long winded typing ).
wc  flags are :
-c, --bytes print the byte counts
   -m, --chars
  print the character counts
   -l, --lines
  print the newline counts.


*zahid@192:~/Downloads> wc -w /etc/hostname55 /etc/hostname*

The first programme I was tasked to write in C was to replicate the linux
wc utility .
I called it wordcount.c with word -c -l -m or word wordcount  -c -l  /etc.

Anyway  on this page https://spark.apache.org/examples.html
there are examples of word count in scala,python and Java.

I kinda feel left out because I know a little  C and little Linux.
I think  it is great idea for the sake of "*familiarity* *for the client"*
( application developer ).
I was thinking of raising a JIRA but I thought I would consult with fellow
developers first. :)

Please be kind.

Backbutton.co.uk
¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
makeuse.org



On Mon, 9 Mar 2020 at 08:57, Zahid Rahman  wrote:

> Hey floyd ,
>
> I just realised something:
> You need to practice using the adduser command to create users
> or in your case useradd  because that's  less painless for you to create a
> user.
> Instead of working in root.
> Trust me it is good for you.
> Then you will realise this bit of code new SparkConf() is reading from the
> etc/hostname and not etc/host file for ip_address.
>
> Backbutton.co.uk
> ¯\_(ツ)_/¯
> ♡۶Java♡۶RMI ♡۶
> Make Use Method {MUM}
> makeuse.org
> 
>
>
> On Wed, 4 Mar 2020 at 21:14, Andrew Melo  wrote:
>
>> Hello Zabid,
>>
>> On Wed, Mar 4, 2020 at 1:47 PM Zahid Rahman  wrote:
>>
>>> Hi,
>>>
>>> I found the problem was because on my  Linux   Operating System the
>>> /etc/hostname was blank.
>>>
>>> *STEP 1*
>>> I searched  on google the error message and there was an answer
>>> suggesting
>>> I should add to /etc/hostname
>>>
>>> 127.0.0.1  [hostname] localhost.
>>>
>>
>> I believe you've confused /etc/hostname and /etc/hosts --
>>
>>
>>>
>>> I did that but there was still  an error,  this time the spark  log in
>>> standard output was concatenating the text content
>>> of etc/hostname  like so ,   127.0.0.1[hostname]localhost.
>>>
>>> *STEP 2*
>>> My second attempt was to change the /etc/hostname to 127.0.0.1
>>> This time I was getting a warning with information about "using loop
>>> back"  rather than an error.
>>>
>>> *STEP 3*
>>> I wasn't happy with that so then I changed the /etc/hostname to (see
>>> below) ,
>>> then the warning message disappeared. my guess is that it is the act of
>>> creating spark session as to the cause of error,
>>> in SparkConf() API.
>>>
>>>  SparkConf sparkConf = new SparkConf()
>>>  .setAppName("Simple Application")
>>>  .setMaster("local")
>>>  .set("spark.executor.memory","2g");
>>>
>>> $ cat /etc/hostname
>>> # hosts This file describes a number of hostname-to-address
>>> #   mappings for the TCP/IP subsystem.  It is mostly
>>> #   used at boot time, when no name servers are running.
>>> #   On small systems, this file can be used instead of a
>>> #   "named" name server.
>>> # Syntax:
>>> #
>>> # IP-Address  Full-Qualified-Hostname  Short-Hostname
>>> #
>>>
>>> 192.168.0.42
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> zahid@localhost
>>> :~/Downloads/apachespark/Apache-Spark-Example/Java-Code-Geek>
>>> mvn exec:java -Dexec.mainClass=com.javacodegeek.examples.SparkExampleRDD
>>> -Dexec.args="input.txt"
>>> [INFO] Scanning for projects...
>>> [WARNING]
>>> [WARNING] Some problems were encountered while building the effective
>>> model for javacodegeek:examples:jar:1.0-SNAPSHOT
>>> [WARNING] 'build.plugins.plugin.version' for
>>> org.apache.maven.plugins:maven-compiler-plugin is missing. @ line 12,
>>> column 21
>>> [WARNING]
>>> [WARNING] It is highly recommended to fix these problems because they
>>> threaten the stability of your build.
>>> [WARNING]
>>> [WARNING] For this reason, future Maven versions might no longer support
>>> building such malformed projects.
>>> [WARNING]
>>> [INFO]
>>> [INFO] ---< javacodegeek:examples
>>> >
>>> [INFO] Building examples 1.0-SNAPSHOT
>>> [INFO] [ jar
>>> ]-
>>> [INFO]
>>> [INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ examples ---
>>> WARNING: An illegal reflective access operation has occurred
>>> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
>>> (file:/home/zahid/.m2/repository/org/apache/spark/spark-unsafe_2.12/2.4.5/spark-unsafe_2.12-2.4.5.jar)
>>> to method java.nio.Bits.unaligned()
>>> WARNING: Please consider reporting this to the maintainers of
>>> org.apache.spark.unsafe.Platform
>>> WA

Scala vs PySpark Inconsistency: SQLContext/SparkSession access from DataFrame/DataSet

2020-03-12 Thread Ben Roling
I've noticed that DataSet.sqlContext is public in Scala but the equivalent
(DataFrame._sc) in PySpark is named as if it should be treated as private.

Is this intentional?  If so, what's the rationale?  If not, then it feels
like a bug and DataFrame should have some form of public access back to the
context/session.  I'm happy to log the bug but thought I would ask here
first.  Thanks!


[Spark MicroBatchExecution] Error fetching kafka/checkpoint/state/0/0/1.delta does not exist

2020-03-12 Thread Miguel Silvestre
Hi community,

I'm having this error in some kafka streams:

Caused by: java.io.FileNotFoundException: File
file:/efs/.../kafka/checkpoint/state/0/0/1.delta does not exist

Because of this I have some streams down. How can I fix this?

Thank you.

--
Miguel Silvestre


Exception during writing a spark Dataframe to Redshift

2020-03-12 Thread Sandeep Patra
This is where the exception occurs:

myAppDes.coalesce(1)
.write
.format("com.databricks.spark.redshift")
.option("url", redshiftURL)
.option("dbtable", redshiftTableName)
.option("forward_spark_s3_credentials", "true")
.option("tempdir", "s3a://zest-hevo-datalake/temp/data")
.mode(SaveMode.Append)
.save()

I have attached the stack trace with the email.

My build.sbt looks like:
version := "1.0"

scalaVersion := "2.11.8"

val sparkVersion = "2.3.0"
val hadoopVersion = "3.1.2"

resolvers ++= Seq(
  "apache-snapshots" at "http://repository.apache.org/snapshots/";,
  "redshift" at "https://s3.amazonaws.com/redshift-maven-repository/release
",
  "redshift" at "
http://redshift-maven-repository.s3-website-us-east-1.amazonaws.com/release
",
  "jitpack" at "https://jitpack.io";
)

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion,
  "org.apache.spark" %% "spark-sql" % sparkVersion,
  "org.apache.spark" %% "spark-hive" % sparkVersion,
  "org.apache.hadoop" % "hadoop-aws" % hadoopVersion,
  "org.apache.hadoop" % "hadoop-common" % hadoopVersion,
  "org.apache.hadoop" % "hadoop-mapreduce-client-core" % hadoopVersion,
  "com.amazon.redshift" % "redshift-jdbc42-no-awssdk" % "1.2.15.1025",
  "com.databricks" %% "spark-redshift" % "3.0.0-preview1",
  "com.amazon.redshift" % "redshift-jdbc42" % "1.2.1.1001"
)

dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-core" %
"2.8.8"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" %
"2.8.8"
dependencyOverrides += "com.fasterxml.jackson.module" %
"jackson-module-scala_2.11" % "2.8.8"

I have checked and I can read data from the same Redshift Instance. I also
have created table in redshift. (I also tried it without creating table but
got the same error)


exception
Description: Binary data

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org