spark 3.1.1 combine hadoop(version 2.6.0-cdh5.13.1) compile error

2021-03-17 Thread jiahong li
Hi,everyone,
 when i compile combine with hadoop version 2.6.0-cdh5.13.1 ,compile comand
is
./dev/make-distribution.sh --name 2.6.0-cdh5.13.1  --pip  --tgz  -Phive
-Phive-thriftserver -Pyarn -Dhadoop.version=2.6.0-cdh5.13.1,
there exists error like this:
[INFO] --- scala-maven-plugin:4.3.0:compile (scala-compile-first) @
spark-core_2.12 ---
[INFO] Using incremental compilation using Mixed compile order
[INFO] Compiler bridge file:
.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.3.1-bin_2.12.10__52.0-1.3.1_20191012T045515.jar
[INFO] compiler plugin:
BasicArtifact(com.github.ghik,silencer-plugin_2.12.10,1.6.0,null)
[INFO] Compiling 560 Scala sources and 99 Java sources to
spark/core/target/scala-2.12/classes ...
[ERROR] [Error]
spark/core/src/main/scala/org/apache/spark/ui/HttpSecurityFilter.scala:107:
type mismatch;
 found   : K where type K
 required: String
[ERROR] [Error]
spark/core/src/main/scala/org/apache/spark/ui/HttpSecurityFilter.scala:107:
value map is not a member of V
[ERROR] [Error]
spark/core/src/main/scala/org/apache/spark/ui/HttpSecurityFilter.scala:107:
missing argument list for method stripXSS in class XssSafeRequest
Unapplied methods are only converted to functions when a function type is
expected.
You can make this conversion explicit by writing `stripXSS _` or
`stripXSS(_)` instead of `stripXSS`.
[ERROR] [Error]
spark/core/src/main/scala/org/apache/spark/ui/PagedTable.scala:307: value
startsWith is not a member of K
[ERROR] [Error]
spark/core/src/main/scala/org/apache/spark/util/Utils.scala:580: value
toLowerCase is not a member of object org.apache.hadoop.util.StringUtils
[ERROR] 5 errors found

how i can compile combine with hadoop version 2.6.0-cdh5.13.1? is there any
jira?

Dereck Li
Apache Spark Contributor
Continuing Learner
@Hangzhou,China


Is Spark rdd.toDF() thread-safe?

2021-03-17 Thread yujhe.li
Hi, 

I have an application that runs in a Spark-2.4.4 cluster and it transforms
two RDD to DataFrame with `rdd.toDF()` then outputs them to file. 

For slave resource usage optimization, the application executes the job in
multi-thread. The code snippet looks like this:



And I found that `toDF()` is not thread-safe. The application failed
sometimes by `java.lang.UnsupportedOperationException`.
You can reproduce it from the following code snippet (1% to happen, it's
easier to happen when the case class has a large number of fields.):






You may get a similar exception message: `Schema for type A is not
supported`



>From the error message, it was caused by `ScalaReflection.schemaFor()`. 
I had looked at the code, it seems like Spark uses Scala reflection to get
the data type and as I know there is a concurrency issue in Scala
reflection. 

SPARK-26555   
thread-safety-in-scala-reflection-with-type-matching

  

Should we fix it? I can not find any document about thread-safe in creating
DataFrame.

I had workarounded this by adding a lock when transforming RDD to DataFrame.





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Structured Streaming and Kafka message schema evolution

2021-03-17 Thread Mich Talebzadeh
Thanks Jungtaek.

I have reasons for this. So I will bring it up in another thread

Cheers,



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 15 Mar 2021 at 21:38, Jungtaek Lim 
wrote:

> If I understand correctly, SQL semantics are strict on column schema.
> Reading via Kafka data source doesn't require you to specify the schema as
> it provides the key and value as binary, but once you deserialize them,
> unless you keep the type as primitive (e.g. String), you'll need to specify
> the schema, like from_json requires you to.
>
> This wouldn't be changed even if you leverage Schema Registry - you'll
> need to provide the schema which is compatible with all schemas which
> records are associated with. I guess that's guaranteed if you use the
> latest version of the schema and you've changed the schema as
> "backward-compatible ways". I admit I haven't dealt with SR in SSS, but if
> you integrate the schema to the query plan, running query is unlikely
> getting the latest schema, but it still wouldn't matter as your query
> should only leverage the part of schema you've integrated, and the latest
> schema is "backward compatible" with the integrated schema.
>
> Hope this helps.
>
> Thanks
> Jungtaek Lim (HeartSaVioR)
>
> On Mon, Mar 15, 2021 at 9:25 PM Mich Talebzadeh 
> wrote:
>
>> This is just a query.
>>
>> In general Kafka-connect requires means to register that schema such that
>> producers and consumers understand that. It also allows schema evolution,
>> i.e. changes to metadata that identifies the structure of data sent via
>> topic.
>>
>> When we stream a kafka topic into (Spark Structured Streaming (SSS), the
>> assumption is that by the time Spark processes that data, its structure
>> can be established. With foreachBatch, we create a dataframe on top of
>> incoming batches of Json messages and the dataframe can be interrogated.
>> However, the processing may fail if another column is added to the topic
>> and the consumer (in this case SSS) is not aware of it. How can this change
>> of schema be verified?
>>
>> Thanks
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


Re: Submitting insert query from beeline failing on executor server with java 11

2021-03-17 Thread kaki mahesh raja
HI Jungtaek Lim ,

Thanks for the response, so we have no option only to wait till hadoop
officially supports java 11.


Thanks and regards,
kaki mahesh raja



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org