pyspark + from_json(col("col_name"), schema) returns all null

2017-12-09 Thread salemi
Hi All,

I am using pyspark and consuming messages from Kafka and when I
.select(from_json(col("col_name"), schema)) the  return values are all null.

I looked at the json messages and they are valid strings.

any ideas?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Weight column values not used in Binary Logistic Regression Summary

2017-12-09 Thread Sea aj
Hello everyone,

I have a data frame which has two columns: ids and features

each cell in feature column is an array of Vectors.dense type.
like:

[(DenseVector([0.5692]),), (DenseVector([0.5086]),)]


I need to train a new model for every single row of my data frame. How can
I do it?





‌

On Sat, Nov 18, 2017 at 9:53 AM, Stephen Boesch  wrote:

> In BinaryLogisticRegressionSummary there are @Since("1.5.0") tags on a
> number of comments identical to the following:
>
> * @note This ignores instance weights (setting all to 1.0) from 
> `LogisticRegression.weightCol`.
> * This will change in later Spark versions.
>
>
> Are there any plans to address this? Our team is using instance weights
> with sklearn LogisticRegression - and this limitation will complicate a
> potential migration.
>
>  https://github.com/apache/spark/blob/master/mllib/src/
> main/scala/org/apache/spark/ml/classification/
> LogisticRegression.scala#L1543
>
>
>


[CFP] DataWorks Summit Europe 2018 - Call for abstracts

2017-12-09 Thread Yanbo Liang
The DataWorks Summit Europe is in Berlin, Germany this year, on April 16-19, 
2018. This is a great place to talk about work you are doing in Apache Spark or 
how you are using Spark for SQL/streaming processing, machine learning and data 
science. Information on submitting an abstract is at 
https://dataworkssummit.com/berlin-2018/ 
 .

Tracks:
Data Warehousing and Operational Data Stores
Artificial Intelligence and Data Science
Big Compute and Storage
Cloud and Operations
Governance and Security
Cyber Security
IoT and Streaming
Enterprise Adoption

Deadline: December 15th, 2017

Re: Save hive table from spark in hive 2.1.1

2017-12-09 Thread रविशंकर नायर
There is an option in Hive site.xml to ignore metadata validation. I mean
try making below as false and try. Hive schematool also can help.




hive.metastore.schema.verification
true



Best,
Ravion


On Dec 9, 2017 5:56 PM, "konu"  wrote:

Iam using Spark 2.2 with scala, hive 2.1.1 and zeppelin on ubuntu 16.04.
In addition, i copied hive-site.xml to spark/conf/ and
mysql-connector-java.jar from hive/libs to spark/jars

I want to save a dataframe as hivetable and iam doing this with:

val hc = new org.apache.spark.sql.hive.HiveContext(sc)
df.registerTempTable("myTempTable")
hc.sql("create table store_sales as select * from myTempTable")

after in my notebook, i run this.

%hive
show tables;

And i can see that my new hivetable store_sales was created, but i cant run
hive after of this.

This is my hive-site.xml



javax.jdo.option.ConnectionURL

jdbc:mysql://localhost/metastore?useSSL=false
metadata is stored in a MySQL
server


javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
MySQL JDBC driver class


javax.jdo.option.ConnectionUserName
hive
user name for connecting to mysql
server


javax.jdo.option.ConnectionPassword
hive
password for connecting to mysql
server


hive.execution.engine
spark
set hive on spark



when i running hive

root@alex-bi:/usr/local/hive/bin# hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/usr/local/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/
StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-
log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type
[org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in
jar:file:/usr/local/hive/lib/hive-common-2.1.1.jar!/hive-log4j2.properties
Async: true
Exception in thread "main" java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:591)
at
org.apache.hadoop.hive.ql.session.SessionState.beginStart(SessionState.java:
531)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:705)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at
org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.
java:226)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:366)
at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:310)
at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(
Hive.java:290)
at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:266)
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:558)
... 9 more
Caused by: java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.
java:1654)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.
(RetryingMetaStoreClient.java:80)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(
RetryingMetaStoreClient.java:130)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(
RetryingMetaStoreClient.java:101)
at
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.
java:3367)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3406)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3386)
at
org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3640)
at

Save hive table from spark in hive 2.1.1

2017-12-09 Thread konu
Iam using Spark 2.2 with scala, hive 2.1.1 and zeppelin on ubuntu 16.04.
In addition, i copied hive-site.xml to spark/conf/ and
mysql-connector-java.jar from hive/libs to spark/jars

I want to save a dataframe as hivetable and iam doing this with:

val hc = new org.apache.spark.sql.hive.HiveContext(sc)
df.registerTempTable("myTempTable")
hc.sql("create table store_sales as select * from myTempTable")

after in my notebook, i run this.

%hive
show tables;

And i can see that my new hivetable store_sales was created, but i cant run
hive after of this.

This is my hive-site.xml



javax.jdo.option.ConnectionURL
   
jdbc:mysql://localhost/metastore?useSSL=false
metadata is stored in a MySQL
server


javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
MySQL JDBC driver class


javax.jdo.option.ConnectionUserName
hive
user name for connecting to mysql
server


javax.jdo.option.ConnectionPassword
hive
password for connecting to mysql
server


hive.execution.engine
spark
set hive on spark



when i running hive

root@alex-bi:/usr/local/hive/bin# hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/usr/local/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type
[org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in
jar:file:/usr/local/hive/lib/hive-common-2.1.1.jar!/hive-log4j2.properties
Async: true
Exception in thread "main" java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:591)
at
org.apache.hadoop.hive.ql.session.SessionState.beginStart(SessionState.java:531)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:705)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at
org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:226)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:366)
at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:310)
at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:290)
at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:266)
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:558)
... 9 more
Caused by: java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1654)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:80)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:130)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:101)
at
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3367)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3406)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3386)
at
org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3640)
at
org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:236)
at
org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:221)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native

Spark + AI Summit CfP Open

2017-12-09 Thread Jules Damji

Fellow Sparkers!

The CfP for the renamed and expanded summit is open now. If you’ve an idea and 
implementation of that idea you want to share with the Apache Spark and AI 
community, please do so now. 

https://databricks.com/blog/2017/12/06/spark-summit-is-becoming-the-spark-ai-summit.html

https://databricks.com/sparkaisummit/north-america

Thank you.

Cheers,

Jules 
Spark Community Evangelist & Program Co-Chair

—
Sent from my iPhone
Pardon the dumb thumb typos :)

Structured Streaming + Kafka 0.10. connectors + valueDecoder and messageHandler with python

2017-12-09 Thread salemi
Hi All,

we are currently using direct streams to get the data from a kafka topic as
followed

KafkaUtils.createDirectStream(ssc=self.streaming_context,
topics=topics,
kafkaParams=kafka_params,
valueDecoder=message_decoder,
messageHandler=message_handler)

We would like to switch to to Structured Streaming approach such as 

self.spark_session \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafkaServers) \
.option("subscribe", self.topic_id) \
.option("auto.offset.reset", self.msgoffset)\
.load()

I was wondering how I can apply the existing message_decoder and
message_handler  functions to the message stream?


Thank you,

Ali




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: JDBC to hive batch use case in spark

2017-12-09 Thread ayan guha
Please enable hive support onnspark session,  if using spark2. If you are
on spark1, use hiveContext instead of sqlContext.

On Sun, 10 Dec 2017 at 12:20 am, 张万新  wrote:

> If you don't mind, I think it will help if you post your code
>
> Hokam Singh Chauhan 于2017年12月9日周六 下午8:02写道:
>
>> Hi,
>> I have an use case in which I wants to read data from a jdbc
>> source(Oracle) table and write it to hive table on periodic basis. I tried
>> this using the SQL context to read from Oracle and Hive context to write
>> the data in hive. The data read parts works fine but when I ran the save
>> call on hive context to write data, it throws the exception and it says the
>> table or view does not exists even though the table is precreated in hive.
>>
>> Please help if anyone tried such scenario.
>>
>> Thanks
>>
> --
Best Regards,
Ayan Guha


Re: JDBC to hive batch use case in spark

2017-12-09 Thread 张万新
If you don't mind, I think it will help if you post your code

Hokam Singh Chauhan 于2017年12月9日周六 下午8:02写道:

> Hi,
> I have an use case in which I wants to read data from a jdbc
> source(Oracle) table and write it to hive table on periodic basis. I tried
> this using the SQL context to read from Oracle and Hive context to write
> the data in hive. The data read parts works fine but when I ran the save
> call on hive context to write data, it throws the exception and it says the
> table or view does not exists even though the table is precreated in hive.
>
> Please help if anyone tried such scenario.
>
> Thanks
>


JDBC to hive batch use case in spark

2017-12-09 Thread Hokam Singh Chauhan
Hi,
I have an use case in which I wants to read data from a jdbc source(Oracle)
table and write it to hive table on periodic basis. I tried this using the
SQL context to read from Oracle and Hive context to write the data in hive.
The data read parts works fine but when I ran the save call on hive context
to write data, it throws the exception and it says the table or view does
not exists even though the table is precreated in hive.

Please help if anyone tried such scenario.

Thanks


Re: ML Transformer: create feature that uses multiple columns

2017-12-09 Thread Filipp Zhinkin
Hi,

you can combine multiple columns using
org.apache.spark.sql.functions.struct and invoke UDF on resulting
column.
In that case your UDF have to accept Row as an argument.

See VectorAssermber's sources for example:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L109

Regards,
Filipp.

On Sat, Dec 9, 2017 at 2:41 PM, davideanastasia
 wrote:
> Hi,
> I am trying to write a custom ml.Transformer. It's a very simple row-by-row
> transformation, but it takes in account multiple columns of the DataFrame
> (and sometimes, interaction between columns).
>
> I was wondering what the best way to achieve this is. I have used a udf in
> the Transformer before, but that only allows me to use one column (am I
> right?). How can I use multiple columns?
>
> Thanks,
> D.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



ML Transformer: create feature that uses multiple columns

2017-12-09 Thread davideanastasia
Hi,
I am trying to write a custom ml.Transformer. It's a very simple row-by-row
transformation, but it takes in account multiple columns of the DataFrame
(and sometimes, interaction between columns).

I was wondering what the best way to achieve this is. I have used a udf in
the Transformer before, but that only allows me to use one column (am I
right?). How can I use multiple columns?

Thanks,
D.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: RDD[internalRow] -> DataSet

2017-12-09 Thread Jacek Laskowski
Hi Satyajit,

That's exactly what Dataset.rdd does -->
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala?utf8=%E2%9C%93#L2916-L2921

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

On Fri, Dec 8, 2017 at 5:25 AM, satyajit vegesna  wrote:

> Hi All,
>
> Is there a way to convert RDD[internalRow] to Dataset , from outside spark
> sql package.
>
> Regards,
> Satyajit.
>