RE: Schema Evolution Parquet vs Avro

2017-05-29 Thread Venkata ramana gollamudi
Hi Sam,

You can consider checking Carbondata 
format(https://github.com/apache/carbondata). It supports Column removal and 
Datatype change of existing column. Column rename you can raise a issue to 
support.

Regards,
Ramana

From: Joel D [games2013@gmail.com]
Sent: Tuesday, May 30, 2017 7:34 AM
To: user@spark.apache.org
Subject: Schema Evolution Parquet vs Avro

Hi,

We are trying to come up with the best storage format for handling schema 
changes in ingested data.

We noticed that both avro and parquet allows one to select based on column name 
instead of the data index/position of data. However, we are inclined towards 
parquet for better read performance since it's columnar and we will be 
selecting few columns instead of all. Data will be processed and saved to 
partitions on which we will have hive external tables.

Will parquet be able to handle the following:
- Column renaming from between data
- Column removal from between
- DataType change of existing column (int to bigint should be allowed, right?)

Please advise.

Thanks,
Sam


Re: [Spark Streaming] DAG Output Processing mechanism

2017-05-29 Thread Nipun Arora
Sending out the message again.. Hopefully someone cal clarify :)

I would like some clarification on the execution model for spark streaming. 

Broadly, I am trying to understand if output operations in a DAG are only
processed after all intermediate operations are finished for all parts of
the DAG. 

Let me give an example: 

I have a dstream -A , I do map operations on this dstream and create two
different dstreams -B and C such that 

A ---> B -> (some operations) ---> kafka output 1   
 \> C---> ( some operations) --> kafka output 2 

I want to understand will kafka output 1 and kafka output 2 wait for all
operations to finish on B and C before sending an output, or will they
simply send an output as soon as the ops in B and C are done. 

What kind of synchronization guarantees are there?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-DAG-Output-Processing-mechanism-tp28713p28715.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Schema Evolution Parquet vs Avro

2017-05-29 Thread Joel D
Hi,

We are trying to come up with the best storage format for handling schema
changes in ingested data.

We noticed that both avro and parquet allows one to select based on column
name instead of the data index/position of data. However, we are inclined
towards parquet for better read performance since it's columnar and we will
be selecting few columns instead of all. Data will be processed and saved
to partitions on which we will have hive external tables.

Will parquet be able to handle the following:
- Column renaming from between data
- Column removal from between
- DataType change of existing column (int to bigint should be allowed,
right?)

Please advise.

Thanks,
Sam


Re: Dynamically working out upperbound in JDBC connection to Oracle DB

2017-05-29 Thread Don Drake
Try passing maxID.toString, I think it wants the number as a string.

On Mon, May 29, 2017 at 3:12 PM, Mich Talebzadeh 
wrote:

> thanks Gents but no luck!
>
> scala> val s = HiveContext.read.format("jdbc").options(
>  | Map("url" -> _ORACLEserver,
>  | "dbtable" -> "(SELECT ID, CLUSTERED, SCATTERED, RANDOMISED,
> RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
>  | "partitionColumn" -> "ID",
>  | "lowerBound" -> "1",
>  | "upperBound" -> maxID,
>  | "numPartitions" -> "4",
>  |  "user" -> _username,
>  | "password" -> _password)).load
> :34: error: overloaded method value options with alternatives:
>   (options: java.util.Map[String,String])org.apache.spark.sql.DataFrameReader
> 
>   (options: scala.collection.Map[String,String])org.apache.spark.sql.
> DataFrameReader
>  cannot be applied to (scala.collection.immutable.Map[String,Comparable[_
> >: java.math.BigDecimal with String <: Comparable[_ >: java.math.BigDecimal
> with String <: java.io.Serializable] with java.io.Serializable] with
> java.io.Serializable])
>val s = HiveContext.read.format("jdbc").options(
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 May 2017 at 20:12, ayan guha  wrote:
>
>> You are using maxId as a string literal. Try removing the quotes around
>> maxId
>>
>> On Tue, 30 May 2017 at 2:56 am, Jörn Franke  wrote:
>>
>>> I think you need to remove the hyphen around maxid
>>>
>>> On 29. May 2017, at 18:11, Mich Talebzadeh 
>>> wrote:
>>>
>>> Hi,
>>>
>>> This JDBC connection works with Oracle table with primary key ID
>>>
>>> val s = HiveContext.read.format("jdbc").options(
>>> Map("url" -> _ORACLEserver,
>>> "dbtable" -> "(SELECT ID, CLUSTERED, SCATTERED, RANDOMISED,
>>> RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
>>> "partitionColumn" -> "ID",
>>>
>>> *"lowerBound" -> "1","upperBound" -> "1",*
>>> "numPartitions" -> "4",
>>> "user" -> _username,
>>> "password" -> _password)).load
>>>
>>> Note that both lowerbound and upperbound for ID column are fixed.
>>>
>>> However, Itried to workout upperbound dynamically as follows:
>>>
>>> //
>>> // Get maxID first
>>> //
>>> scala> val maxID = HiveContext.read.format("jdbc").options(Map("url" ->
>>> _ORACLEserver,"dbtable" -> "(SELECT MAX(ID) AS maxID FROM
>>> scratchpad.dummy)",
>>>  | "user" -> _username, "password" -> _password)).load().collect.app
>>> ly(0).getDecimal(0)
>>> maxID: java.math.BigDecimal = 1.00
>>>
>>> and this fails
>>>
>>> scala> val s = HiveContext.read.format("jdbc").options(
>>>  | Map("url" -> _ORACLEserver,
>>>  | "dbtable" -> "(SELECT ID, CLUSTERED, SCATTERED, RANDOMISED,
>>> RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
>>>  | "partitionColumn" -> "ID",
>>>
>>>
>>> *| "lowerBound" -> "1", | "upperBound" -> "maxID",* |
>>> "numPartitions" -> "4",
>>>  | "user" -> _username,
>>>  | "password" -> _password)).load
>>> java.lang.NumberFormatException: For input string: "maxID"
>>>   at java.lang.NumberFormatException.forInputString(NumberFormatE
>>> xception.java:65)
>>>   at java.lang.Long.parseLong(Long.java:589)
>>>   at java.lang.Long.parseLong(Long.java:631)
>>>   at scala.collection.immutable.StringLike$class.toLong(StringLik
>>> e.scala:276)
>>>   at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
>>>   at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelation
>>> Provider.createRelation(JdbcRelationProvider.scala:42)
>>>   at org.apache.spark.sql.execution.datasources.DataSource.
>>> resolveRelation(DataSource.scala:330)
>>>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.
>>> scala:152)
>>>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.
>>> scala:125)
>>>   ... 56 elided
>>>
>>>
>>> Any ideas how this can work!
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake 
800-733-2143


Re: Dynamically working out upperbound in JDBC connection to Oracle DB

2017-05-29 Thread Mich Talebzadeh
thanks Gents but no luck!

scala> val s = HiveContext.read.format("jdbc").options(
 | Map("url" -> _ORACLEserver,
 | "dbtable" -> "(SELECT ID, CLUSTERED, SCATTERED, RANDOMISED,
RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
 | "partitionColumn" -> "ID",
 | "lowerBound" -> "1",
 | "upperBound" -> maxID,
 | "numPartitions" -> "4",
 |  "user" -> _username,
 | "password" -> _password)).load
:34: error: overloaded method value options with alternatives:
  (options:
java.util.Map[String,String])org.apache.spark.sql.DataFrameReader 
  (options:
scala.collection.Map[String,String])org.apache.spark.sql.DataFrameReader
 cannot be applied to (scala.collection.immutable.Map[String,Comparable[_
>: java.math.BigDecimal with String <: Comparable[_ >: java.math.BigDecimal
with String <: java.io.Serializable] with java.io.Serializable] with
java.io.Serializable])
   val s = HiveContext.read.format("jdbc").options(



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 May 2017 at 20:12, ayan guha  wrote:

> You are using maxId as a string literal. Try removing the quotes around
> maxId
>
> On Tue, 30 May 2017 at 2:56 am, Jörn Franke  wrote:
>
>> I think you need to remove the hyphen around maxid
>>
>> On 29. May 2017, at 18:11, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> This JDBC connection works with Oracle table with primary key ID
>>
>> val s = HiveContext.read.format("jdbc").options(
>> Map("url" -> _ORACLEserver,
>> "dbtable" -> "(SELECT ID, CLUSTERED, SCATTERED, RANDOMISED,
>> RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
>> "partitionColumn" -> "ID",
>>
>> *"lowerBound" -> "1","upperBound" -> "1",*
>> "numPartitions" -> "4",
>> "user" -> _username,
>> "password" -> _password)).load
>>
>> Note that both lowerbound and upperbound for ID column are fixed.
>>
>> However, Itried to workout upperbound dynamically as follows:
>>
>> //
>> // Get maxID first
>> //
>> scala> val maxID = HiveContext.read.format("jdbc").options(Map("url" ->
>> _ORACLEserver,"dbtable" -> "(SELECT MAX(ID) AS maxID FROM
>> scratchpad.dummy)",
>>  | "user" -> _username, "password" -> _password)).load().collect.
>> apply(0).getDecimal(0)
>> maxID: java.math.BigDecimal = 1.00
>>
>> and this fails
>>
>> scala> val s = HiveContext.read.format("jdbc").options(
>>  | Map("url" -> _ORACLEserver,
>>  | "dbtable" -> "(SELECT ID, CLUSTERED, SCATTERED, RANDOMISED,
>> RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
>>  | "partitionColumn" -> "ID",
>>
>>
>> *| "lowerBound" -> "1", | "upperBound" -> "maxID",* |
>> "numPartitions" -> "4",
>>  | "user" -> _username,
>>  | "password" -> _password)).load
>> java.lang.NumberFormatException: For input string: "maxID"
>>   at java.lang.NumberFormatException.forInputString(
>> NumberFormatException.java:65)
>>   at java.lang.Long.parseLong(Long.java:589)
>>   at java.lang.Long.parseLong(Long.java:631)
>>   at scala.collection.immutable.StringLike$class.toLong(
>> StringLike.scala:276)
>>   at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
>>   at org.apache.spark.sql.execution.datasources.jdbc.
>> JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:42)
>>   at org.apache.spark.sql.execution.datasources.
>> DataSource.resolveRelation(DataSource.scala:330)
>>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>>   ... 56 elided
>>
>>
>> Any ideas how this can work!
>>
>> Thanks
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
> Best Regards,
> Ayan Guha
>


Re: Dynamically working out upperbound in JDBC connection to Oracle DB

2017-05-29 Thread ayan guha
You are using maxId as a string literal. Try removing the quotes around
maxId

On Tue, 30 May 2017 at 2:56 am, Jörn Franke  wrote:

> I think you need to remove the hyphen around maxid
>
> On 29. May 2017, at 18:11, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> This JDBC connection works with Oracle table with primary key ID
>
> val s = HiveContext.read.format("jdbc").options(
> Map("url" -> _ORACLEserver,
> "dbtable" -> "(SELECT ID, CLUSTERED, SCATTERED, RANDOMISED, RANDOM_STRING,
> SMALL_VC, PADDING FROM scratchpad.dummy)",
> "partitionColumn" -> "ID",
>
> *"lowerBound" -> "1","upperBound" -> "1",*
> "numPartitions" -> "4",
> "user" -> _username,
> "password" -> _password)).load
>
> Note that both lowerbound and upperbound for ID column are fixed.
>
> However, Itried to workout upperbound dynamically as follows:
>
> //
> // Get maxID first
> //
> scala> val maxID = HiveContext.read.format("jdbc").options(Map("url" ->
> _ORACLEserver,"dbtable" -> "(SELECT MAX(ID) AS maxID FROM
> scratchpad.dummy)",
>  | "user" -> _username, "password" ->
> _password)).load().collect.apply(0).getDecimal(0)
> maxID: java.math.BigDecimal = 1.00
>
> and this fails
>
> scala> val s = HiveContext.read.format("jdbc").options(
>  | Map("url" -> _ORACLEserver,
>  | "dbtable" -> "(SELECT ID, CLUSTERED, SCATTERED, RANDOMISED,
> RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
>  | "partitionColumn" -> "ID",
>
>
> *| "lowerBound" -> "1", | "upperBound" -> "maxID",* |
> "numPartitions" -> "4",
>  | "user" -> _username,
>  | "password" -> _password)).load
> java.lang.NumberFormatException: For input string: "maxID"
>   at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Long.parseLong(Long.java:589)
>   at java.lang.Long.parseLong(Long.java:631)
>   at
> scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
>   at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
>   at
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:42)
>   at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 56 elided
>
>
> Any ideas how this can work!
>
> Thanks
>
>
>
>
>
>
>
>
>
> --
Best Regards,
Ayan Guha


Re: Dynamically working out upperbound in JDBC connection to Oracle DB

2017-05-29 Thread Jörn Franke
I think you need to remove the hyphen around maxid 

> On 29. May 2017, at 18:11, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> This JDBC connection works with Oracle table with primary key ID
> 
> val s = HiveContext.read.format("jdbc").options(
> Map("url" -> _ORACLEserver,
> "dbtable" -> "(SELECT ID, CLUSTERED, SCATTERED, RANDOMISED, RANDOM_STRING, 
> SMALL_VC, PADDING FROM scratchpad.dummy)",
> "partitionColumn" -> "ID",
> "lowerBound" -> "1",
> "upperBound" -> "1",
> "numPartitions" -> "4",
> "user" -> _username,
> "password" -> _password)).load
> 
> Note that both lowerbound and upperbound for ID column are fixed.
> 
> However, Itried to workout upperbound dynamically as follows:
> 
> //
> // Get maxID first
> //
> scala> val maxID = HiveContext.read.format("jdbc").options(Map("url" -> 
> _ORACLEserver,"dbtable" -> "(SELECT MAX(ID) AS maxID FROM scratchpad.dummy)",
>  | "user" -> _username, "password" -> 
> _password)).load().collect.apply(0).getDecimal(0)
> maxID: java.math.BigDecimal = 1.00
> 
> and this fails
> 
> scala> val s = HiveContext.read.format("jdbc").options(
>  | Map("url" -> _ORACLEserver,
>  | "dbtable" -> "(SELECT ID, CLUSTERED, SCATTERED, RANDOMISED, 
> RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
>  | "partitionColumn" -> "ID",
>  | "lowerBound" -> "1",
>  | "upperBound" -> "maxID",
>  | "numPartitions" -> "4",
>  | "user" -> _username,
>  | "password" -> _password)).load
> java.lang.NumberFormatException: For input string: "maxID"
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Long.parseLong(Long.java:589)
>   at java.lang.Long.parseLong(Long.java:631)
>   at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
>   at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:42)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 56 elided
> 
> 
> Any ideas how this can work!
> 
> Thanks
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 


Dynamically working out upperbound in JDBC connection to Oracle DB

2017-05-29 Thread Mich Talebzadeh
Hi,

This JDBC connection works with Oracle table with primary key ID

val s = HiveContext.read.format("jdbc").options(
Map("url" -> _ORACLEserver,
"dbtable" -> "(SELECT ID, CLUSTERED, SCATTERED, RANDOMISED, RANDOM_STRING,
SMALL_VC, PADDING FROM scratchpad.dummy)",
"partitionColumn" -> "ID",

*"lowerBound" -> "1","upperBound" -> "1",*
"numPartitions" -> "4",
"user" -> _username,
"password" -> _password)).load

Note that both lowerbound and upperbound for ID column are fixed.

However, Itried to workout upperbound dynamically as follows:

//
// Get maxID first
//
scala> val maxID = HiveContext.read.format("jdbc").options(Map("url" ->
_ORACLEserver,"dbtable" -> "(SELECT MAX(ID) AS maxID FROM
scratchpad.dummy)",
 | "user" -> _username, "password" -> _password)).load().collect.
apply(0).getDecimal(0)
maxID: java.math.BigDecimal = 1.00

and this fails

scala> val s = HiveContext.read.format("jdbc").options(
 | Map("url" -> _ORACLEserver,
 | "dbtable" -> "(SELECT ID, CLUSTERED, SCATTERED, RANDOMISED,
RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
 | "partitionColumn" -> "ID",


*| "lowerBound" -> "1", | "upperBound" -> "maxID",* |
"numPartitions" -> "4",
 | "user" -> _username,
 | "password" -> _password)).load
java.lang.NumberFormatException: For input string: "maxID"
  at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
  at java.lang.Long.parseLong(Long.java:589)
  at java.lang.Long.parseLong(Long.java:631)
  at
scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
  at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
  at
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:42)
  at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
  ... 56 elided


Any ideas how this can work!

Thanks


RE: Disable queuing of spark job on Mesos cluster if sufficient resources are not found

2017-05-29 Thread Mevada, Vatsal
Is there any configurable timeout which controls queuing of the driver in Mesos 
cluster mode or the driver will remain in queue for indefinite until it find 
resource on cluster?

From: Michael Gummelt [mailto:mgumm...@mesosphere.io]
Sent: Friday, May 26, 2017 11:33 PM
To: Mevada, Vatsal 
Cc: user@spark.apache.org
Subject: Re: Disable queuing of spark job on Mesos cluster if sufficient 
resources are not found

Nope, sorry.

On Fri, May 26, 2017 at 4:38 AM, Mevada, Vatsal 
> wrote:

Hello,

I am using Mesos with cluster deployment mode to submit my jobs.

When sufficient resources are not available on Mesos cluster, I can see that my 
jobs are queuing up on Mesos dispatcher UI.

Is it possible to tweak some configuration so that my job submission fails 
gracefully(instead of queuing up) if sufficient resources are not found on 
Mesos cluster?
Regards,
Vatsal



--
Michael Gummelt
Software Engineer
Mesosphere