Re: [Spark Kafka Structured Streaming] Adding partition and topic to the kafka dynamically

2020-08-27 Thread Amit Joshi
Any pointers will be appreciated.

On Thursday, August 27, 2020, Amit Joshi  wrote:

> Hi All,
>
> I am trying to understand the effect of adding topics and partitions to a
> topic in kafka, which is being consumed by spark structured streaming
> applications.
>
> Do we have to restart the spark structured streaming application to read
> from the newly added topic?
> Do we have to restart the spark structured streaming application to read
> from the newly added partition to a topic?
>
> Kafka consumers have a meta data refresh property that works without
> restarting.
>
> Thanks advance.
>
> Regards
> Amit Joshi
>


[Spark Kafka Structured Streaming] Adding partition and topic to the kafka dynamically

2020-08-27 Thread Amit Joshi
Hi All,

I am trying to understand the effect of adding topics and partitions to a
topic in kafka, which is being consumed by spark structured streaming
applications.

Do we have to restart the spark structured streaming application to read
from the newly added topic?
Do we have to restart the spark structured streaming application to read
from the newly added partition to a topic?

Kafka consumers have a meta data refresh property that works without
restarting.

Thanks advance.

Regards
Amit Joshi


Re: Connecting to Oracle Autonomous Data warehouse (ADW) from Spark via JDBC

2020-08-27 Thread kuassi . mensah

Mich,

That's right, referring to you guys.

Cheers, Kuassi

On 8/27/20 9:27 AM, Mich Talebzadeh wrote:

Thanks Kuassi,

I presume you mean Spark DEV team by "they are using ... "

cheers,

Mich



LinkedIn 
/https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw 
/




*Disclaimer:* Use it at your own risk.Any and all responsibility for 
any loss, damage or destruction of data or any other property which 
may arise from relying on this email's technical content is explicitly 
disclaimed. The author will in no case be liable for any monetary 
damages arising from such loss, damage or destruction.




On Thu, 27 Aug 2020 at 17:11, > wrote:


According to our dev team.


From the error it is evident that they are using a jdbc jar which
does not support setting tns_admin in URL.
They might have some old jar in class-path which is being used
instead of 18.3 jar.
You can ask them to use either full URL or tns alias format URL
with tns_admin path set as either connection property or system
property.

Regards, Kuassi
On 8/26/20 2:11 PM, Mich Talebzadeh wrote:

And this is a test using Oracle supplied JAVA
script DataSourceSample.java with slight amendment for
login/password and table. it connects ok

hduser@rhes76: /home/hduser/dba/bin/ADW/src> javac -classpath
./ojdbc8.jar:. DataSourceSample.java
hduser@rhes76: /home/hduser/dba/bin/ADW/src> java -classpath
./ojdbc8.jar:. DataSourceSample
AArray = [B@57d5872c
AArray = [B@667a738
AArray = [B@2145433b
Driver Name: Oracle JDBC driver
Driver Version: 18.3.0.0.0
Default Row Prefetch Value is: 20
Database Username is: SCRATCHPAD

DATETAKEN WEIGHT
-
2017-09-07 07:22:09 74.7
2017-09-08 07:26:18 74.8
2017-09-09 07:15:53 75
2017-09-10 07:53:30 75.9
2017-09-11 07:21:49 75.8
2017-09-12 07:31:27 75.6
2017-09-26 07:11:26 75.4
2017-09-27 07:22:48 75.6
2017-09-28 07:15:52 75.4
2017-09-29 07:30:40 74.9


Regards,


LinkedIn

/https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

/



*Disclaimer:* Use it at your own risk.Any and all responsibility
for any loss, damage or destruction of data or any other property
which may arise from relying on this email's technical content is
explicitly disclaimed. The author will in no case be liable for
any monetary damages arising from such loss, damage or destruction.



On Wed, 26 Aug 2020 at 21:58, Mich Talebzadeh
mailto:mich.talebza...@gmail.com>> wrote:

Hi Kuassi,

This is the error. Only test running on local mode

scala> val driverName = "oracle.jdbc.OracleDriver"
driverName: String = oracle.jdbc.OracleDriver

scala> var url =

"jdbc:oracle:thin:@mydb_high?TNS_ADMIN=/home/hduser/dba/bin/ADW/DBAccess"
url: String =
jdbc:oracle:thin:@mydb_high?TNS_ADMIN=/home/hduser/dba/bin/ADW/DBAccess
scala> var _username = "scratchpad"
_username: String = scratchpad
scala> var _password = "xx" -- no special characters
_password: String = xxx
scala> var _dbschema = "SCRATCHPAD"
_dbschema: String = SCRATCHPAD
scala> var _dbtable = "LL_18201960"
_dbtable: String = LL_18201960
scala> var e:SQLException = null
e: java.sql.SQLException = null
scala> var connection:Connection = null
connection: java.sql.Connection = null
scala> var metadata:DatabaseMetaData = null
metadata: java.sql.DatabaseMetaData = null
scala> val prop = new java.util.Properties
prop: java.util.Properties = {}
scala> prop.setProperty("user", _username)
res1: Object = null
scala> prop.setProperty("password",_password)
res2: Object = null
scala> // Check Oracle is accessible
*scala> try {
*
*   |       connection = DriverManager.getConnection(url,
_username, _password)*
*   | } catch {*
*   |   case e: SQLException => e.printStackTrace*
*   |   connection.close()*
*   | }*
*java.sql.SQLRecoverableException: IO Error: Invalid
connection string format, a valid format is: "host:port:sid"*
      at
oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:489)
      at

oracle.jdbc.driver.PhysicalConnection.(PhysicalConnection.java:553)
      at

Re: Connecting to Oracle Autonomous Data warehouse (ADW) from Spark via JDBC

2020-08-27 Thread Mich Talebzadeh
Thanks Kuassi,

I presume you mean Spark DEV team by "they are using ... "

cheers,

Mich



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 27 Aug 2020 at 17:11,  wrote:

> According to our dev team.
>
> From the error it is evident that they are using a jdbc jar which does not
> support setting tns_admin in URL.
> They might have some old jar in class-path which is being used instead of
> 18.3 jar.
> You can ask them to use either full URL or tns alias format URL with
> tns_admin path set as either connection property or system property.
>
> Regards, Kuassi
>
> On 8/26/20 2:11 PM, Mich Talebzadeh wrote:
>
> And this is a test using Oracle supplied JAVA script DataSourceSample.java
> with slight amendment for login/password and table. it connects ok
>
> hduser@rhes76: /home/hduser/dba/bin/ADW/src> javac -classpath
> ./ojdbc8.jar:. DataSourceSample.java
> hduser@rhes76: /home/hduser/dba/bin/ADW/src> java -classpath
> ./ojdbc8.jar:. DataSourceSample
> AArray = [B@57d5872c
> AArray = [B@667a738
> AArray = [B@2145433b
> Driver Name: Oracle JDBC driver
> Driver Version: 18.3.0.0.0
> Default Row Prefetch Value is: 20
> Database Username is: SCRATCHPAD
>
> DATETAKEN  WEIGHT
> -
> 2017-09-07 07:22:09 74.7
> 2017-09-08 07:26:18 74.8
> 2017-09-09 07:15:53 75
> 2017-09-10 07:53:30 75.9
> 2017-09-11 07:21:49 75.8
> 2017-09-12 07:31:27 75.6
> 2017-09-26 07:11:26 75.4
> 2017-09-27 07:22:48 75.6
> 2017-09-28 07:15:52 75.4
> 2017-09-29 07:30:40 74.9
>
>
>
> Regards,
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 26 Aug 2020 at 21:58, Mich Talebzadeh 
> wrote:
>
>> Hi Kuassi,
>>
>> This is the error. Only test running on local mode
>>
>> scala> val driverName = "oracle.jdbc.OracleDriver"
>> driverName: String = oracle.jdbc.OracleDriver
>>
>> scala> var url = "jdbc:oracle:thin:@mydb_high
>> ?TNS_ADMIN=/home/hduser/dba/bin/ADW/DBAccess"
>> url: String = jdbc:oracle:thin:@mydb_high
>> ?TNS_ADMIN=/home/hduser/dba/bin/ADW/DBAccess
>> scala> var _username = "scratchpad"
>> _username: String = scratchpad
>> scala> var _password = "xx"  -- no special characters
>> _password: String = xxx
>> scala> var _dbschema = "SCRATCHPAD"
>> _dbschema: String = SCRATCHPAD
>> scala> var _dbtable = "LL_18201960"
>> _dbtable: String = LL_18201960
>> scala> var e:SQLException = null
>> e: java.sql.SQLException = null
>> scala> var connection:Connection = null
>> connection: java.sql.Connection = null
>> scala> var metadata:DatabaseMetaData = null
>> metadata: java.sql.DatabaseMetaData = null
>> scala> val prop = new java.util.Properties
>> prop: java.util.Properties = {}
>> scala> prop.setProperty("user", _username)
>> res1: Object = null
>> scala> prop.setProperty("password",_password)
>> res2: Object = null
>> scala> // Check Oracle is accessible
>>
>> *scala> try { *
>> * |   connection = DriverManager.getConnection(url, _username,
>> _password)*
>> * | } catch {*
>> * |   case e: SQLException => e.printStackTrace*
>> * |   connection.close()*
>> * | }*
>> *java.sql.SQLRecoverableException: IO Error: Invalid connection string
>> format, a valid format is: "host:port:sid"*
>> at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:489)
>> at
>> oracle.jdbc.driver.PhysicalConnection.(PhysicalConnection.java:553)
>> at oracle.jdbc.driver.T4CConnection.(T4CConnection.java:254)
>> at
>> oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:32)
>> at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:528)
>> at java.sql.DriverManager.getConnection(DriverManager.java:664)
>>
>> Is this related to Oracle or Spark? Do I need to set up another
>> connection parameter etc?
>>
>>
>>
>> Cheers
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which 

Re: Connecting to Oracle Autonomous Data warehouse (ADW) from Spark via JDBC

2020-08-27 Thread kuassi . mensah

According to our dev team.

From the error it is evident that they are using a jdbc jar which does 
not support setting tns_admin in URL.
They might have some old jar in class-path which is being used instead 
of 18.3 jar.
You can ask them to use either full URL or tns alias format URL with 
tns_admin path set as either connection property or system property.

Regards, Kuassi

On 8/26/20 2:11 PM, Mich Talebzadeh wrote:
And this is a test using Oracle supplied JAVA 
script DataSourceSample.java with slight amendment for login/password 
and table. it connects ok


hduser@rhes76: /home/hduser/dba/bin/ADW/src> javac -classpath 
./ojdbc8.jar:. DataSourceSample.java
hduser@rhes76: /home/hduser/dba/bin/ADW/src> java -classpath 
./ojdbc8.jar:. DataSourceSample

AArray = [B@57d5872c
AArray = [B@667a738
AArray = [B@2145433b
Driver Name: Oracle JDBC driver
Driver Version: 18.3.0.0.0
Default Row Prefetch Value is: 20
Database Username is: SCRATCHPAD

DATETAKEN WEIGHT
-
2017-09-07 07:22:09 74.7
2017-09-08 07:26:18 74.8
2017-09-09 07:15:53 75
2017-09-10 07:53:30 75.9
2017-09-11 07:21:49 75.8
2017-09-12 07:31:27 75.6
2017-09-26 07:11:26 75.4
2017-09-27 07:22:48 75.6
2017-09-28 07:15:52 75.4
2017-09-29 07:30:40 74.9


Regards,


LinkedIn 
/https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw 
/




*Disclaimer:* Use it at your own risk.Any and all responsibility for 
any loss, damage or destruction of data or any other property which 
may arise from relying on this email's technical content is explicitly 
disclaimed. The author will in no case be liable for any monetary 
damages arising from such loss, damage or destruction.




On Wed, 26 Aug 2020 at 21:58, Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:


Hi Kuassi,

This is the error. Only test running on local mode

scala> val driverName = "oracle.jdbc.OracleDriver"
driverName: String = oracle.jdbc.OracleDriver

scala> var url =
"jdbc:oracle:thin:@mydb_high?TNS_ADMIN=/home/hduser/dba/bin/ADW/DBAccess"
url: String =
jdbc:oracle:thin:@mydb_high?TNS_ADMIN=/home/hduser/dba/bin/ADW/DBAccess
scala> var _username = "scratchpad"
_username: String = scratchpad
scala> var _password = "xx" -- no special characters
_password: String = xxx
scala> var _dbschema = "SCRATCHPAD"
_dbschema: String = SCRATCHPAD
scala> var _dbtable = "LL_18201960"
_dbtable: String = LL_18201960
scala> var e:SQLException = null
e: java.sql.SQLException = null
scala> var connection:Connection = null
connection: java.sql.Connection = null
scala> var metadata:DatabaseMetaData = null
metadata: java.sql.DatabaseMetaData = null
scala> val prop = new java.util.Properties
prop: java.util.Properties = {}
scala> prop.setProperty("user", _username)
res1: Object = null
scala> prop.setProperty("password",_password)
res2: Object = null
scala> // Check Oracle is accessible
*scala> try {
*
*     |      connection = DriverManager.getConnection(url,
_username, _password)*
*     | } catch {*
*     |  case e: SQLException => e.printStackTrace*
*     |  connection.close()*
*     | }*
*java.sql.SQLRecoverableException: IO Error: Invalid connection
string format, a valid format is: "host:port:sid"*
        at
oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:489)
        at
oracle.jdbc.driver.PhysicalConnection.(PhysicalConnection.java:553)
        at
oracle.jdbc.driver.T4CConnection.(T4CConnection.java:254)
        at

oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:32)
        at
oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:528)
        at
java.sql.DriverManager.getConnection(DriverManager.java:664)

Is this related to Oracle or Spark? Do I need to set up another
connection parameter etc?

Cheers


*Disclaimer:* Use it at your own risk.Any and all responsibility
for any loss, damage or destruction of data or any other property
which may arise from relying on this email's technical content is
explicitly disclaimed. The author will in no case be liable for
any monetary damages arising from such loss, damage or destruction.



On Wed, 26 Aug 2020 at 21:09, mailto:kuassi.men...@oracle.com>> wrote:

Mich,

All looks fine.
Perhaps some special chars in username or password?


it is recommended not to use such characters like '@', '.' in
your password.

Best, Kuassi
On 8/26/20 12:52 PM, Mich Talebzadeh wrote:

Thanks Kuassi.

This is the version of jar file that work OK with JDBC
connection via JAVA to 

Some sort of chaos monkey for spark jobs, do we have it?

2020-08-27 Thread Ivan Petrov
Hi, I'm feeling pain while trying to insert 2-3 millions of records into
Mongo using plain Spark RDD. There were so many hidden problems.

I would like to avoid this in future and looking for a way to kill
individual spark tasks at specific stage and verify expected behaviour of
my Spark job.

ideal setup
1. write spark job
2. run spark job on YARN
3. run a tool that kills certain % or number of tasks at specific stage
4. verify results

Real world scenario.
Mongo spark driver has very optimistic assumption that insert never fails.
I've enabled ordered=false for the driver to ignore duplicated records
insertion.
It kind-a worked before I met speculative execution.
- Task failed once because of duplicates. It's expected, another task
uploaded same data
- Then spark killed same task twice during speculative execution.
- The whole job failed since there were 3 fails for a given task
and spark.task.maxFailures=4

I didn't get three kills in dev cluster during 100+ runs but got it in
production :) Production cluster is a bit noisy.
Such a chaos monkey would help to tune my job configuration for production
using the dev cluster.


Export subset of Oracle database

2020-08-27 Thread pduflot
Dear Spark users,

 

I am trying to figure out whether Spark is a good tool for my use case.

I'm trying to ETL a subset of a customers/orders database from Oracle to
JSON. Rougly 3-5% of the overall customers table.

 

I tried to use the Spark JDBC datasource but it ends up fetching the
entire customers and orders table to one executor. I read about the
partitionColumn, lowerBound and upperBound options. Could they be used
somehow to distribute the load across a set of executors while also
filtering out at the source customers that are not part of my subset?

 

Or would it better to parallelize the subset of customers to export and
have a map operation that will query the Oracle Database to transform the
customer ID to a JSON object containing customer and orders details?

 

Or is Spark not suitable for this kind of processes?

 

Just asking for guidance in order to not lose too much time in wrong
directions. Thanks for your help!

 

Best,

 

Patrick