date:20170129

Re: Having multiple spark context

2017-01-29 Thread vincent gromakowski

A clustering lib is necessary to manage multiple jvm. Akka cluster for
instance

Le 30 janv. 2017 8:01 AM, "Rohit Verma"  a
écrit :

> Hi,
>
> If I am right, you need to launch other context from another jvm. If you
> are trying to launch from same jvm another context it will return you the
> existing context.
>
> Rohit
>
> On Jan 30, 2017, at 12:24 PM, Mark Hamstra 
> wrote:
>
> More than one Spark Context in a single Application is not supported.
>
> On Sun, Jan 29, 2017 at 9:08 PM,  wrote:
>
>> Hi,
>>
>>
>>
>> I have a requirement in which, my application creates one Spark context
>> in Distributed mode whereas another Spark context in local mode.
>>
>> When I am creating this, my complete application is working on only one
>> SparkContext (created in Distributed mode). Second spark context is not
>> getting created.
>>
>>
>>
>> Can you please help me out in how to create two spark contexts.
>>
>>
>>
>> Regards,
>>
>> Jasbir singh
>>
>> --
>>
>> This message is for the designated recipient only and may contain
>> privileged, proprietary, or otherwise confidential information. If you have
>> received it in error, please notify the sender immediately and delete the
>> original. Any other use of the e-mail by you is prohibited. Where allowed
>> by local law, electronic communications with Accenture and its affiliates,
>> including e-mail and instant messaging (including content), may be scanned
>> by our systems for the purposes of information security and assessment of
>> internal compliance with Accenture policy.
>> 
>> __
>>
>> www.accenture.com
>>
>
>
>

Re: Having multiple spark context

2017-01-29 Thread Rohit Verma

Hi,

If I am right, you need to launch other context from another jvm. If you are 
trying to launch from same jvm another context it will return you the existing 
context.

Rohit
On Jan 30, 2017, at 12:24 PM, Mark Hamstra 
> wrote:

More than one Spark Context in a single Application is not supported.

On Sun, Jan 29, 2017 at 9:08 PM, 
> wrote:
Hi,

I have a requirement in which, my application creates one Spark context in 
Distributed mode whereas another Spark context in local mode.
When I am creating this, my complete application is working on only one 
SparkContext (created in Distributed mode). Second spark context is not getting 
created.

Can you please help me out in how to create two spark contexts.

Regards,
Jasbir singh

This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy.
__

www.accenture.com

Re: Having multiple spark context

2017-01-29 Thread Mark Hamstra

More than one Spark Context in a single Application is not supported.

On Sun, Jan 29, 2017 at 9:08 PM,  wrote:

> Hi,
>
>
>
> I have a requirement in which, my application creates one Spark context in
> Distributed mode whereas another Spark context in local mode.
>
> When I am creating this, my complete application is working on only one
> SparkContext (created in Distributed mode). Second spark context is not
> getting created.
>
>
>
> Can you please help me out in how to create two spark contexts.
>
>
>
> Regards,
>
> Jasbir singh
>
> --
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
> 
> __
>
> www.accenture.com
>

Failures on JavaSparkContext. - "Futures timed out after [10000 milliseconds]"

2017-01-29 Thread Gili Nachum

Hi,

I sometimes get these Random init failures in test and prod. Is there a use
case that could lead to these errors?
For example: Not enough cores? driver and worker not on the same LAN? etc...

Running Spark 1.5.1. Retrying solves it.

Caused by: java.util.concurrent.TimeoutException: Futures timed out after
[1 milliseconds]
at
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at akka.remote.Remoting.start(Remoting.scala:179)
at
akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:620)
at
akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:617)
at akka.actor.ActorSystemImpl._start(ActorSystem.scala:617)
at akka.actor.ActorSystemImpl.start(ActorSystem.scala:634)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:142)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:119)
at
org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
at
org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
at
org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:52)
at
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1913)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1904)
at
org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:55)
at
org.apache.spark.rpc.akka.AkkaRpcEnvFactory.create(AkkaRpcEnv.scala:253)
at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:53)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:252)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193)
at
org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277)
at org.apache.spark.SparkContext.(SparkContext.scala:450)
at
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
at ...

Re: Error Saving Dataframe to Hive with Spark 2.0.0

2017-01-29 Thread Chetan Khatri

Okey, you are saying that 2.0.0 don't have that patch fixed ? @dev cc--
I don't like everytime changing the service versions !

Thanks.

On Mon, Jan 30, 2017 at 1:10 AM, Jacek Laskowski  wrote:

> Hi,
>
> I think you have to upgrade to 2.1.0. There were few changes wrt the ERROR
> since.
>
> Jacek
>
>
> On 29 Jan 2017 9:24 a.m., "Chetan Khatri" 
> wrote:
>
> Hello Spark Users,
>
> I am getting error while saving Spark Dataframe to Hive Table:
> Hive 1.2.1
> Spark 2.0.0
> Local environment.
> Note: Job is getting executed successfully and the way I want but still
> Exception raised.
> *Source Code:*
>
> package com.chetan.poc.hbase
>
> /**
>   * Created by chetan on 24/1/17.
>   */
> import org.apache.hadoop.hbase.{CellUtil, HBaseConfiguration}
> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
> import org.apache.hadoop.hbase.util.Bytes
> import org.apache.hadoop.hbase.KeyValue.Type
> import org.apache.spark.sql.SparkSession
> import scala.collection.JavaConverters._
> import java.util.Date
> import java.text.SimpleDateFormat
>
>
> object IncrementalJob {
> val APP_NAME: String = "SparkHbaseJob"
> var HBASE_DB_HOST: String = null
> var HBASE_TABLE: String = null
> var HBASE_COLUMN_FAMILY: String = null
> var HIVE_DATA_WAREHOUSE: String = null
> var HIVE_TABLE_NAME: String = null
>   def main(args: Array[String]) {
> // Initializing HBASE Configuration variables
> HBASE_DB_HOST="127.0.0.1"
> HBASE_TABLE="university"
> HBASE_COLUMN_FAMILY="emp"
> // Initializing Hive Metastore configuration
> HIVE_DATA_WAREHOUSE = "/usr/local/hive/warehouse"
> // Initializing Hive table name - Target table
> HIVE_TABLE_NAME = "employees"
> // setting spark application
> // val sparkConf = new SparkConf().setAppName(APP_NAME).setMaster("local")
> //initialize the spark context
> //val sparkContext = new SparkContext(sparkConf)
> //val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext)
> // Enable Hive with Hive warehouse in SparkSession
> val spark = 
> SparkSession.builder().appName(APP_NAME).config("hive.metastore.warehouse.dir",
>  HIVE_DATA_WAREHOUSE).config("spark.sql.warehouse.dir", 
> HIVE_DATA_WAREHOUSE).enableHiveSupport().getOrCreate()
> import spark.implicits._
> import spark.sql
>
> val conf = HBaseConfiguration.create()
> conf.set(TableInputFormat.INPUT_TABLE, HBASE_TABLE)
> conf.set(TableInputFormat.SCAN_COLUMNS, HBASE_COLUMN_FAMILY)
> // Load an RDD of rowkey, result(ImmutableBytesWritable, Result) tuples 
> from the table
> val hBaseRDD = spark.sparkContext.newAPIHadoopRDD(conf, 
> classOf[TableInputFormat],
>   classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
>   classOf[org.apache.hadoop.hbase.client.Result])
>
> println(hBaseRDD.count())
> //hBaseRDD.foreach(println)
>
> //keyValue is a RDD[java.util.list[hbase.KeyValue]]
> val keyValue = hBaseRDD.map(x => x._2).map(_.list)
>
> //outPut is a RDD[String], in which each line represents a record in HBase
> val outPut = keyValue.flatMap(x =>  x.asScala.map(cell =>
>
>   HBaseResult(
> Bytes.toInt(CellUtil.cloneRow(cell)),
> Bytes.toStringBinary(CellUtil.cloneFamily(cell)),
> Bytes.toStringBinary(CellUtil.cloneQualifier(cell)),
> cell.getTimestamp,
> new SimpleDateFormat("-MM-dd HH:mm:ss:SSS").format(new 
> Date(cell.getTimestamp.toLong)),
> Bytes.toStringBinary(CellUtil.cloneValue(cell)),
> Type.codeToType(cell.getTypeByte).toString
> )
>   )
> ).toDF()
> // Output dataframe
> outPut.show
>
> // get timestamp
> val datetimestamp_threshold = "2016-08-25 14:27:02:001"
> val datetimestampformat = new SimpleDateFormat("-MM-dd 
> HH:mm:ss:SSS").parse(datetimestamp_threshold).getTime()
>
> // Resultset filteration based on timestamp
> val filtered_output_timestamp = outPut.filter($"colDatetime" >= 
> datetimestampformat)
> // Resultset filteration based on rowkey
> val filtered_output_row = 
> outPut.filter($"colDatetime".between(1668493360,1668493365))
>
>
> // Saving Dataframe to Hive Table Successfully.
> 
> filtered_output_timestamp.write.mode("append").saveAsTable(HIVE_TABLE_NAME)
>   }
>   case class HBaseResult(rowkey: Int, colFamily: String, colQualifier: 
> String, colDatetime: Long, colDatetimeStr: String, colValue: String, colType: 
> String)
> }
>
>
> Error:
>
> 17/01/29 13:51:53 INFO metastore.HiveMetaStore: 0: create_database: 
> Database(name:default, description:default database, 
> locationUri:hdfs://localhost:9000/usr/local/hive/warehouse, parameters:{})
> 17/01/29 13:51:53 INFO HiveMetaStore.audit: ugi=hduser
> ip=unknown-ip-addr  cmd=create_database: Database(name:default, 
> description:default database, 
> locationUri:hdfs://localhost:9000/usr/local/hive/warehouse, parameters:{})

No Reducer scenarios

2017-01-29 Thread रविशंकर नायर

Dear all,


1) When we don't set the reducer class in driver program, IdentityReducer
is invoked.

2) When we set setNumReduceTasks(0), no reducer, even IdentityReducer is
invoked.

Now, in the second scenario, we observed that the output is part-m-xx
format(instead of part-r-xx format) , which shows the map output. But we
know that the output of Map is always written to intermediate local file
system. So who/which class is responsible for taking these intermediate Map
outputs from local file system and writes to HDFS ? Does this particular
class performs this write operation only when setNumReduceTasks is set to
zero?

Best, Ravion

Having multiple spark context

2017-01-29 Thread jasbir.sing

Hi,

I have a requirement in which, my application creates one Spark context in 
Distributed mode whereas another Spark context in local mode.
When I am creating this, my complete application is working on only one 
SparkContext (created in Distributed mode). Second spark context is not getting 
created.

Can you please help me out in how to create two spark contexts.

Regards,
Jasbir singh



This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy.
__

www.accenture.com

Re: question on spark streaming based on event time

2017-01-29 Thread kant kodali

Hi,

I am sorry I made a really bad Typo. What I meant in my email was actually
structured streaming so I wish I could do s/Spark Streaming/Structured
Streaming/g. Thanks for the pointers looks like what I was looking for is
actually watermarking since my question is all about what I should do if my
data is 24 hours late or in general what I should do if my data is late for
longer periods of time. Watermarking in Spark looks very new concept so let
me do that reading!

Thanks,
kant

On Sun, Jan 29, 2017 at 6:38 PM, Tathagata Das 
wrote:

> Spark Streaming (DStreams) wasnt designed keeping event-time in mind.
> Instead, we have designed Structured Streaming to naturally deal with event
> time. You should check that out. Here are the pointers.
>
> - Programming guide - http://spark.apache.org/docs/latest/structured-
> streaming-programming-guide.html
> - Blog posts
>1. https://databricks.com/blog/2016/07/28/continuous-
> applications-evolving-streaming-in-apache-spark-2-0.html
>2. https://databricks.com/blog/2016/07/28/structured-
> streaming-in-apache-spark.html
> - Talk - https://spark-summit.org/2016/events/a-deep-dive-into-
> structured-streaming/
>
> On Sat, Jan 28, 2017 at 7:05 PM, kant kodali  wrote:
>
>> Hi All,
>>
>> I read through the documentation on Spark Streaming based on event time
>> and how spark handles lags w.r.t processing time and so on.. but what if
>> the lag is too long between the event time and processing time? other words
>> what should I do if I am receiving yesterday's data (the timestamp on
>> message shows yesterday date and time but the processing time is today's
>> time) ? And say I also have a dashboard I want to update in realtime ( as
>> in whenever I get the data) which shows past 5 days worth of data and
>> dashboard just keeps updating.
>>
>> Thanks,
>> kant
>>
>>
>

Streaming jobs getting longer

2017-01-29 Thread Saulo Ricci

Hi,

I have 2 spark pipeline applications almost identical, but I found out a
significant difference between their performance.

Basically the 1st application consumes the streaming from Kafka, slice this
streaming in batches of 1 minute and for each record calculates a score
given the already loaded machine learning model and outputs the end results
with scores to a database.

But the 2nd application after 7 hours of continuously running ends up with
stopping and I observed each batch job had gotten longer to complete
compared with earlier batch jobs. This 2nd application besides consuming
the same streaming data as the 1st application, there are a couple of
additional steps related with records aggregation.

I'd like to ask here if as this records aggregation is the only diference
between both applications, this can explain the why my 2nd application is
getting gradually longer to complete it's streaming batch jobs.

I'd appreciate any help/clue or tip to help me understand what is going on
with this 2nd application.

Thank you,

Saulo

Re: question on spark streaming based on event time

2017-01-29 Thread Tathagata Das

Spark Streaming (DStreams) wasnt designed keeping event-time in mind.
Instead, we have designed Structured Streaming to naturally deal with event
time. You should check that out. Here are the pointers.

- Programming guide -
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
- Blog posts
   1.
https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html
   2.
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
- Talk -
https://spark-summit.org/2016/events/a-deep-dive-into-structured-streaming/

On Sat, Jan 28, 2017 at 7:05 PM, kant kodali  wrote:

> Hi All,
>
> I read through the documentation on Spark Streaming based on event time
> and how spark handles lags w.r.t processing time and so on.. but what if
> the lag is too long between the event time and processing time? other words
> what should I do if I am receiving yesterday's data (the timestamp on
> message shows yesterday date and time but the processing time is today's
> time) ? And say I also have a dashboard I want to update in realtime ( as
> in whenever I get the data) which shows past 5 days worth of data and
> dashboard just keeps updating.
>
> Thanks,
> kant
>
>

Re: Error Saving Dataframe to Hive with Spark 2.0.0

2017-01-29 Thread Jacek Laskowski

Hi,

I think you have to upgrade to 2.1.0. There were few changes wrt the ERROR
since.

Jacek


On 29 Jan 2017 9:24 a.m., "Chetan Khatri" 
wrote:

Hello Spark Users,

I am getting error while saving Spark Dataframe to Hive Table:
Hive 1.2.1
Spark 2.0.0
Local environment.
Note: Job is getting executed successfully and the way I want but still
Exception raised.
*Source Code:*

package com.chetan.poc.hbase

/**
  * Created by chetan on 24/1/17.
  */
import org.apache.hadoop.hbase.{CellUtil, HBaseConfiguration}
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.KeyValue.Type
import org.apache.spark.sql.SparkSession
import scala.collection.JavaConverters._
import java.util.Date
import java.text.SimpleDateFormat


object IncrementalJob {
val APP_NAME: String = "SparkHbaseJob"
var HBASE_DB_HOST: String = null
var HBASE_TABLE: String = null
var HBASE_COLUMN_FAMILY: String = null
var HIVE_DATA_WAREHOUSE: String = null
var HIVE_TABLE_NAME: String = null
  def main(args: Array[String]) {
// Initializing HBASE Configuration variables
HBASE_DB_HOST="127.0.0.1"
HBASE_TABLE="university"
HBASE_COLUMN_FAMILY="emp"
// Initializing Hive Metastore configuration
HIVE_DATA_WAREHOUSE = "/usr/local/hive/warehouse"
// Initializing Hive table name - Target table
HIVE_TABLE_NAME = "employees"
// setting spark application
// val sparkConf = new SparkConf().setAppName(APP_NAME).setMaster("local")
//initialize the spark context
//val sparkContext = new SparkContext(sparkConf)
//val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext)
// Enable Hive with Hive warehouse in SparkSession
val spark =
SparkSession.builder().appName(APP_NAME).config("hive.metastore.warehouse.dir",
HIVE_DATA_WAREHOUSE).config("spark.sql.warehouse.dir",
HIVE_DATA_WAREHOUSE).enableHiveSupport().getOrCreate()
import spark.implicits._
import spark.sql

val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, HBASE_TABLE)
conf.set(TableInputFormat.SCAN_COLUMNS, HBASE_COLUMN_FAMILY)
// Load an RDD of rowkey, result(ImmutableBytesWritable, Result)
tuples from the table
val hBaseRDD = spark.sparkContext.newAPIHadoopRDD(conf,
classOf[TableInputFormat],
  classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
  classOf[org.apache.hadoop.hbase.client.Result])

println(hBaseRDD.count())
//hBaseRDD.foreach(println)

//keyValue is a RDD[java.util.list[hbase.KeyValue]]
val keyValue = hBaseRDD.map(x => x._2).map(_.list)

//outPut is a RDD[String], in which each line represents a record in HBase
val outPut = keyValue.flatMap(x =>  x.asScala.map(cell =>

  HBaseResult(
Bytes.toInt(CellUtil.cloneRow(cell)),
Bytes.toStringBinary(CellUtil.cloneFamily(cell)),
Bytes.toStringBinary(CellUtil.cloneQualifier(cell)),
cell.getTimestamp,
new SimpleDateFormat("-MM-dd HH:mm:ss:SSS").format(new
Date(cell.getTimestamp.toLong)),
Bytes.toStringBinary(CellUtil.cloneValue(cell)),
Type.codeToType(cell.getTypeByte).toString
)
  )
).toDF()
// Output dataframe
outPut.show

// get timestamp
val datetimestamp_threshold = "2016-08-25 14:27:02:001"
val datetimestampformat = new SimpleDateFormat("-MM-dd
HH:mm:ss:SSS").parse(datetimestamp_threshold).getTime()

// Resultset filteration based on timestamp
val filtered_output_timestamp = outPut.filter($"colDatetime" >=
datetimestampformat)
// Resultset filteration based on rowkey
val filtered_output_row =
outPut.filter($"colDatetime".between(1668493360,1668493365))


// Saving Dataframe to Hive Table Successfully.
filtered_output_timestamp.write.mode("append").saveAsTable(HIVE_TABLE_NAME)
  }
  case class HBaseResult(rowkey: Int, colFamily: String, colQualifier:
String, colDatetime: Long, colDatetimeStr: String, colValue: String,
colType: String)
}


Error:

17/01/29 13:51:53 INFO metastore.HiveMetaStore: 0: create_database:
Database(name:default, description:default database,
locationUri:hdfs://localhost:9000/usr/local/hive/warehouse,
parameters:{})
17/01/29 13:51:53 INFO HiveMetaStore.audit:
ugi=hduser  ip=unknown-ip-addr  cmd=create_database:
Database(name:default, description:default database,
locationUri:hdfs://localhost:9000/usr/local/hive/warehouse,
parameters:{})
17/01/29 13:51:53 ERROR metastore.RetryingHMSHandler:
AlreadyExistsException(message:Database default already exists)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:891)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at

Re: Examples in graphx

2017-01-29 Thread Felix Cheung

Which graph do you are thinking about?
Here's one for neo4j

https://neo4j.com/blog/neo4j-3-0-apache-spark-connector/

From: Deepak Sharma 
Sent: Sunday, January 29, 2017 4:28:19 AM
To: spark users
Subject: Examples in graphx

Hi There,
Are there any examples of using GraphX along with any graph DB?
I am looking to persist the graph in graph based DB and then read it back in 
spark , process using graphx.

--
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: Aggregator mutate b1 in place in merge

2017-01-29 Thread Koert Kuipers

thanks thats helpful

On Sun, Jan 29, 2017 at 12:54 PM, Anton Okolnychyi <
anton.okolnyc...@gmail.com> wrote:

> Hi,
>
> I recently extended the Spark SQL programming guide to cover user-defined
> aggregations, where I modified existing variables and returned them back in
> reduce and merge. This approach worked and it was approved by people who
> know the context.
>
> Hope that helps.
>
> 2017-01-29 17:17 GMT+01:00 Koert Kuipers :
>
>> anyone?
>> it not i will follow the trail and try to deduce it myself
>>
>> On Mon, Jan 23, 2017 at 2:31 PM, Koert Kuipers  wrote:
>>
>>> looking at the docs for org.apache.spark.sql.expressions.Aggregator it
>>> says for reduce method: "For performance, the function may modify `b` and
>>> return it instead of constructing new object for b.".
>>>
>>> it makes no such comment for the merge method.
>>>
>>> this is surprising to me because i know that for
>>> PairRDDFunctions.aggregateByKey mutation is allowed in both seqOp and
>>> combOp (which are the equivalents of reduce and merge in Aggregator).
>>>
>>> is it safe to mutate b1 and return it in Aggregator.merge?
>>>
>>>
>>
>

RE: forcing dataframe groupby partitioning

2017-01-29 Thread Mendelson, Assaf

Could you explain why this would work?
Assaf.

From: Haviv, Daniel [mailto:dha...@amazon.com]
Sent: Sunday, January 29, 2017 7:09 PM
To: Mendelson, Assaf
Cc: user@spark.apache.org
Subject: Re: forcing dataframe groupby partitioning

If there's no built in local groupBy, You could do something like that:
df.groupby(C1,C2).agg(...).flatmap(x=>x.groupBy(C1)).agg

Thank you.
Daniel

On 29 Jan 2017, at 18:33, Mendelson, Assaf 
> wrote:
Hi,

Consider the following example:

df.groupby(C1,C2).agg(some agg).groupby(C1).agg(some more agg)

The default way spark would behave would be to shuffle according to a 
combination of C1 and C2 and then shuffle again by C1 only.
This behavior makes sense when one uses C2 to salt C1 for skewed data, however, 
when this is part of the logic the cost of doing two shuffles instead of one is 
not insignificant (even worse when multiple iterations are done).

Is there a way to force the first groupby to shuffle the data so that the 
second groupby would occur within the partition?

Thanks,
Assaf.

Re: Aggregator mutate b1 in place in merge

2017-01-29 Thread Anton Okolnychyi

Hi,

I recently extended the Spark SQL programming guide to cover user-defined
aggregations, where I modified existing variables and returned them back in
reduce and merge. This approach worked and it was approved by people who
know the context.

Hope that helps.

2017-01-29 17:17 GMT+01:00 Koert Kuipers :

> anyone?
> it not i will follow the trail and try to deduce it myself
>
> On Mon, Jan 23, 2017 at 2:31 PM, Koert Kuipers  wrote:
>
>> looking at the docs for org.apache.spark.sql.expressions.Aggregator it
>> says for reduce method: "For performance, the function may modify `b` and
>> return it instead of constructing new object for b.".
>>
>> it makes no such comment for the merge method.
>>
>> this is surprising to me because i know that for
>> PairRDDFunctions.aggregateByKey mutation is allowed in both seqOp and
>> combOp (which are the equivalents of reduce and merge in Aggregator).
>>
>> is it safe to mutate b1 and return it in Aggregator.merge?
>>
>>
>

Re: forcing dataframe groupby partitioning

2017-01-29 Thread Haviv, Daniel

If there's no built in local groupBy, You could do something like that:
df.groupby(C1,C2).agg(...).flatmap(x=>x.groupBy(C1)).agg

Thank you.
Daniel

On 29 Jan 2017, at 18:33, Mendelson, Assaf 
> wrote:

Hi,

Consider the following example:

df.groupby(C1,C2).agg(some agg).groupby(C1).agg(some more agg)

The default way spark would behave would be to shuffle according to a 
combination of C1 and C2 and then shuffle again by C1 only.
This behavior makes sense when one uses C2 to salt C1 for skewed data, however, 
when this is part of the logic the cost of doing two shuffles instead of one is 
not insignificant (even worse when multiple iterations are done).

Is there a way to force the first groupby to shuffle the data so that the 
second groupby would occur within the partition?

Thanks,
Assaf.

forcing dataframe groupby partitioning

2017-01-29 Thread Mendelson, Assaf

Hi,

Consider the following example:

df.groupby(C1,C2).agg(some agg).groupby(C1).agg(some more agg)

The default way spark would behave would be to shuffle according to a 
combination of C1 and C2 and then shuffle again by C1 only.
This behavior makes sense when one uses C2 to salt C1 for skewed data, however, 
when this is part of the logic the cost of doing two shuffles instead of one is 
not insignificant (even worse when multiple iterations are done).

Is there a way to force the first groupby to shuffle the data so that the 
second groupby would occur within the partition?

Thanks,
Assaf.

Re: Aggregator mutate b1 in place in merge

2017-01-29 Thread Koert Kuipers

anyone?
it not i will follow the trail and try to deduce it myself

On Mon, Jan 23, 2017 at 2:31 PM, Koert Kuipers  wrote:

> looking at the docs for org.apache.spark.sql.expressions.Aggregator it
> says for reduce method: "For performance, the function may modify `b` and
> return it instead of constructing new object for b.".
>
> it makes no such comment for the merge method.
>
> this is surprising to me because i know that for 
> PairRDDFunctions.aggregateByKey
> mutation is allowed in both seqOp and combOp (which are the equivalents of
> reduce and merge in Aggregator).
>
> is it safe to mutate b1 and return it in Aggregator.merge?
>
>

Re: Hbase and Spark

2017-01-29 Thread Sudev A C

Hi Masf,

Do try the official Hbase Spark.
https://hbase.apache.org/book.html#spark

I think you will have to build the jar from source and run your spark
program with --packages .
https://spark-packages.org/package/hortonworks-spark/shc says it's not yet
published to Spark packages or Maven Repo.


Thanks
Sudev


On Sun, 29 Jan 2017 at 5:53 PM, Masf  wrote:

I´m trying to build an application where is necessary to do bulkGets and
bulkLoad on Hbase.

I think that I could use this component
https://github.com/hortonworks-spark/shc
*Is it a good option??*

But* I can't import it in my project*. Sbt cannot resolve hbase
connector
This is my build.sbt:

version := "1.0"
scalaVersion := "2.10.6"

mainClass in assembly := Some("com.location.userTransaction")

assemblyOption in assembly ~= { _.copy(includeScala = false) }

resolvers += "Typesafe Repo" at "http://repo.typesafe.com/typesafe/releases/
"

val sparkVersion = "1.6.0"
val jettyVersion = "8.1.14.v20131031"
val hbaseConnectorVersion = "1.0.0-1.6-s_2.10"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion % "provided",
  "org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
  "org.apache.spark" %% "spark-hive" % sparkVersion % "provided"
)
libraryDependencies += "com.hortonworks" % "shc" % hbaseConnectorVersion
libraryDependencies += "org.eclipse.jetty" % "jetty-client" % jettyVersion


-- 


Saludos.
Miguel Ángel

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Mich Talebzadeh

Sorry mistake


   1. Put the csv files into HDFS /apps//data/staging/
   2. Multiple csv files for the same table can co-exist
   3. like val df1 = spark.read.option("header", false).csv(location)
   4. once the csv file read into df then you can do loads of things. The
   csv files have to reside in HDFS


HTH





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 January 2017 at 15:16, Mich Talebzadeh 
wrote:

> you can use Spark directly on csv file.
>
>
>1. Put the csv files into HDFS /apps//data/staging/<
>TABLE_NAME>
>2. Multiple csv files for the same table can co-exist
>3. like df1 = spark.read.option("header", false).csv(location)
>4.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 January 2017 at 14:37, Alex  wrote:
>
>> But for persistance after intermediate processing can i use spark cluster
>> itself or i have to use hadoop cluster?!
>>
>> On Jan 29, 2017 7:36 PM, "Deepak Sharma"  wrote:
>>
>> The better way is to read the data directly into spark using spark sql
>> read jdbc .
>> Apply the udf's locally .
>> Then save the data frame back to Oracle using dataframe's write jdbc.
>>
>> Thanks
>> Deepak
>>
>> On Jan 29, 2017 7:15 PM, "Jörn Franke"  wrote:
>>
>>> One alternative could be the oracle Hadoop loader and other Oracle
>>> products, but you have to invest some money and probably buy their Hadoop
>>> Appliance, which you have to evaluate if it make sense (can get expensive
>>> with large clusters etc).
>>>
>>> Another alternative would be to get rid of Oracle alltogether and use
>>> other databases.
>>>
>>> However, can you elaborate a little bit on your use case and the
>>> business logic as well as SLA requires. Otherwise all recommendations are
>>> right because the requirements you presented are very generic.
>>>
>>> About get rid of Hadoop - this depends! You will need some resource
>>> manager (yarn, mesos, kubernetes etc) and most likely also a distributed
>>> file system. Spark supports through the Hadoop apis a wide range of file
>>> systems, but does not need HDFS for persistence. You can have local
>>> filesystem (ie any file system mounted to a node, so also distributed ones,
>>> such as zfs), cloud file systems (s3, azure blob etc).
>>>
>>>
>>>
>>> On 29 Jan 2017, at 11:18, Alex  wrote:
>>>
>>> Hi All,
>>>
>>> Thanks for your response .. Please find below flow diagram
>>>
>>> Please help me out simplifying this architecture using Spark
>>>
>>> 1) Can i skip step 1 to step 4 and directly store it in spark
>>> if I am storing it in spark where actually it is getting stored
>>> Do i need to retain HAdoop to store data
>>> or can i directly store it in spark and remove hadoop also?
>>>
>>> I want to remove informatica for preprocessing and directly load the
>>> files data coming from server to Hadoop/Spark
>>>
>>> So My Question is Can i directly load files data to spark ? Then where
>>> exactly the data will get stored.. Do I need to have Spark installed on Top
>>> of HDFS?
>>>
>>> 2) if I am retaining below architecture Can I store back output from
>>> spark directly to oracle from step 5 to step 7
>>>
>>> and will spark way of storing it back to oracle will be better than
>>> using sqoop performance wise
>>> 3)Can I use SPark scala UDF to process data from hive and retain entire
>>> architecture
>>>
>>> which among the above would be optimal
>>>
>>> [image: Inline image 1]
>>>
>>> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik 
>>> wrote:
>>>
 I strongly agree with Jorn and Russell. There are different solutions
 for data movement depending upon your needs frequency, bi-directional
 drivers. workflow, handling duplicate records. This is a space is known as
 " Change Data Capture - CDC" for short. If you need more information, I

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Mich Talebzadeh

you can use Spark directly on csv file.


   1. Put the csv files into HDFS /apps//data/staging/
   2. Multiple csv files for the same table can co-exist
   3. like df1 = spark.read.option("header", false).csv(location)
   4.


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 January 2017 at 14:37, Alex  wrote:

> But for persistance after intermediate processing can i use spark cluster
> itself or i have to use hadoop cluster?!
>
> On Jan 29, 2017 7:36 PM, "Deepak Sharma"  wrote:
>
> The better way is to read the data directly into spark using spark sql
> read jdbc .
> Apply the udf's locally .
> Then save the data frame back to Oracle using dataframe's write jdbc.
>
> Thanks
> Deepak
>
> On Jan 29, 2017 7:15 PM, "Jörn Franke"  wrote:
>
>> One alternative could be the oracle Hadoop loader and other Oracle
>> products, but you have to invest some money and probably buy their Hadoop
>> Appliance, which you have to evaluate if it make sense (can get expensive
>> with large clusters etc).
>>
>> Another alternative would be to get rid of Oracle alltogether and use
>> other databases.
>>
>> However, can you elaborate a little bit on your use case and the business
>> logic as well as SLA requires. Otherwise all recommendations are right
>> because the requirements you presented are very generic.
>>
>> About get rid of Hadoop - this depends! You will need some resource
>> manager (yarn, mesos, kubernetes etc) and most likely also a distributed
>> file system. Spark supports through the Hadoop apis a wide range of file
>> systems, but does not need HDFS for persistence. You can have local
>> filesystem (ie any file system mounted to a node, so also distributed ones,
>> such as zfs), cloud file systems (s3, azure blob etc).
>>
>>
>>
>> On 29 Jan 2017, at 11:18, Alex  wrote:
>>
>> Hi All,
>>
>> Thanks for your response .. Please find below flow diagram
>>
>> Please help me out simplifying this architecture using Spark
>>
>> 1) Can i skip step 1 to step 4 and directly store it in spark
>> if I am storing it in spark where actually it is getting stored
>> Do i need to retain HAdoop to store data
>> or can i directly store it in spark and remove hadoop also?
>>
>> I want to remove informatica for preprocessing and directly load the
>> files data coming from server to Hadoop/Spark
>>
>> So My Question is Can i directly load files data to spark ? Then where
>> exactly the data will get stored.. Do I need to have Spark installed on Top
>> of HDFS?
>>
>> 2) if I am retaining below architecture Can I store back output from
>> spark directly to oracle from step 5 to step 7
>>
>> and will spark way of storing it back to oracle will be better than using
>> sqoop performance wise
>> 3)Can I use SPark scala UDF to process data from hive and retain entire
>> architecture
>>
>> which among the above would be optimal
>>
>> [image: Inline image 1]
>>
>> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik 
>> wrote:
>>
>>> I strongly agree with Jorn and Russell. There are different solutions
>>> for data movement depending upon your needs frequency, bi-directional
>>> drivers. workflow, handling duplicate records. This is a space is known as
>>> " Change Data Capture - CDC" for short. If you need more information, I
>>> would be happy to chat with you.  I built some products in this space that
>>> extensively used connection pooling over ODBC/JDBC.
>>>
>>> Happy to chat if you need more information.
>>>
>>> -Sachin Naik
>>>
>>> >>Hard to tell. Can you give more insights >>on what you try to achieve
>>> and what the data is about?
>>> >>For example, depending on your use case sqoop can make sense or not.
>>> Sent from my iPhone
>>>
>>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer 
>>> wrote:
>>>
>>> You can treat Oracle as a JDBC source (http://spark.apache.org/docs/
>>> latest/sql-programming-guide.html#jdbc-to-other-databases) and skip
>>> Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the
>>> way back out (see the same link) and write directly to Oracle. I'll leave
>>> the performance questions for someone else.
>>>
>>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu 
>>> wrote:
>>>

 On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu 
 wrote:

 Hi Team,

 RIght now our existing flow is

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Alex

But for persistance after intermediate processing can i use spark cluster
itself or i have to use hadoop cluster?!

On Jan 29, 2017 7:36 PM, "Deepak Sharma"  wrote:

The better way is to read the data directly into spark using spark sql read
jdbc .
Apply the udf's locally .
Then save the data frame back to Oracle using dataframe's write jdbc.

Thanks
Deepak

On Jan 29, 2017 7:15 PM, "Jörn Franke"  wrote:

> One alternative could be the oracle Hadoop loader and other Oracle
> products, but you have to invest some money and probably buy their Hadoop
> Appliance, which you have to evaluate if it make sense (can get expensive
> with large clusters etc).
>
> Another alternative would be to get rid of Oracle alltogether and use
> other databases.
>
> However, can you elaborate a little bit on your use case and the business
> logic as well as SLA requires. Otherwise all recommendations are right
> because the requirements you presented are very generic.
>
> About get rid of Hadoop - this depends! You will need some resource
> manager (yarn, mesos, kubernetes etc) and most likely also a distributed
> file system. Spark supports through the Hadoop apis a wide range of file
> systems, but does not need HDFS for persistence. You can have local
> filesystem (ie any file system mounted to a node, so also distributed ones,
> such as zfs), cloud file systems (s3, azure blob etc).
>
>
>
> On 29 Jan 2017, at 11:18, Alex  wrote:
>
> Hi All,
>
> Thanks for your response .. Please find below flow diagram
>
> Please help me out simplifying this architecture using Spark
>
> 1) Can i skip step 1 to step 4 and directly store it in spark
> if I am storing it in spark where actually it is getting stored
> Do i need to retain HAdoop to store data
> or can i directly store it in spark and remove hadoop also?
>
> I want to remove informatica for preprocessing and directly load the files
> data coming from server to Hadoop/Spark
>
> So My Question is Can i directly load files data to spark ? Then where
> exactly the data will get stored.. Do I need to have Spark installed on Top
> of HDFS?
>
> 2) if I am retaining below architecture Can I store back output from spark
> directly to oracle from step 5 to step 7
>
> and will spark way of storing it back to oracle will be better than using
> sqoop performance wise
> 3)Can I use SPark scala UDF to process data from hive and retain entire
> architecture
>
> which among the above would be optimal
>
> [image: Inline image 1]
>
> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik 
> wrote:
>
>> I strongly agree with Jorn and Russell. There are different solutions for
>> data movement depending upon your needs frequency, bi-directional drivers.
>> workflow, handling duplicate records. This is a space is known as " Change
>> Data Capture - CDC" for short. If you need more information, I would be
>> happy to chat with you.  I built some products in this space that
>> extensively used connection pooling over ODBC/JDBC.
>>
>> Happy to chat if you need more information.
>>
>> -Sachin Naik
>>
>> >>Hard to tell. Can you give more insights >>on what you try to achieve
>> and what the data is about?
>> >>For example, depending on your use case sqoop can make sense or not.
>> Sent from my iPhone
>>
>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer 
>> wrote:
>>
>> You can treat Oracle as a JDBC source (http://spark.apache.org/docs/
>> latest/sql-programming-guide.html#jdbc-to-other-databases) and skip
>> Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the
>> way back out (see the same link) and write directly to Oracle. I'll leave
>> the performance questions for someone else.
>>
>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu 
>> wrote:
>>
>>>
>>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu 
>>> wrote:
>>>
>>> Hi Team,
>>>
>>> RIght now our existing flow is
>>>
>>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive
>>> Context)-->Destination Hive table -->sqoop export to Oracle
>>>
>>> Half of the Hive UDFS required is developed in Java UDF..
>>>
>>> SO Now I want to know if I run the native scala UDF's than runninng hive
>>> java udfs in spark-sql will there be any performance difference
>>>
>>>
>>> Can we skip the Sqoop Import and export part and
>>>
>>> Instead directly load data from oracle to spark and code Scala UDF's for
>>> transformations and export output data back to oracle?
>>>
>>> RIght now the architecture we are using is
>>>
>>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL-->
>>> Hive --> Oracle
>>> what would be optimal architecture to process data from oracle using
>>> spark ?? can i anyway better this process ?
>>>
>>>
>>>
>>>
>>> Regards,
>>> Sirisha
>>>
>>>
>>>
>

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Jörn Franke

I meant with distributed file system such as Ceph, Gluster etc...

> On 29 Jan 2017, at 14:45, Jörn Franke  wrote:
> 
> One alternative could be the oracle Hadoop loader and other Oracle products, 
> but you have to invest some money and probably buy their Hadoop Appliance, 
> which you have to evaluate if it make sense (can get expensive with large 
> clusters etc).
> 
> Another alternative would be to get rid of Oracle alltogether and use other 
> databases.
> 
> However, can you elaborate a little bit on your use case and the business 
> logic as well as SLA requires. Otherwise all recommendations are right 
> because the requirements you presented are very generic.
> 
> About get rid of Hadoop - this depends! You will need some resource manager 
> (yarn, mesos, kubernetes etc) and most likely also a distributed file system. 
> Spark supports through the Hadoop apis a wide range of file systems, but does 
> not need HDFS for persistence. You can have local filesystem (ie any file 
> system mounted to a node, so also distributed ones, such as zfs), cloud file 
> systems (s3, azure blob etc).
> 
> 
> 
>> On 29 Jan 2017, at 11:18, Alex  wrote:
>> 
>> Hi All,
>> 
>> Thanks for your response .. Please find below flow diagram
>> 
>> Please help me out simplifying this architecture using Spark
>> 
>> 1) Can i skip step 1 to step 4 and directly store it in spark
>> if I am storing it in spark where actually it is getting stored
>> Do i need to retain HAdoop to store data
>> or can i directly store it in spark and remove hadoop also?
>> 
>> I want to remove informatica for preprocessing and directly load the files 
>> data coming from server to Hadoop/Spark
>> 
>> So My Question is Can i directly load files data to spark ? Then where 
>> exactly the data will get stored.. Do I need to have Spark installed on Top 
>> of HDFS?
>> 
>> 2) if I am retaining below architecture Can I store back output from spark 
>> directly to oracle from step 5 to step 7 
>> 
>> and will spark way of storing it back to oracle will be better than using 
>> sqoop performance wise
>> 3)Can I use SPark scala UDF to process data from hive and retain entire 
>> architecture 
>> 
>> which among the above would be optimal
>> 
>> 
>> 
>>> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik  
>>> wrote:
>>> I strongly agree with Jorn and Russell. There are different solutions for 
>>> data movement depending upon your needs frequency, bi-directional drivers. 
>>> workflow, handling duplicate records. This is a space is known as " Change 
>>> Data Capture - CDC" for short. If you need more information, I would be 
>>> happy to chat with you.  I built some products in this space that 
>>> extensively used connection pooling over ODBC/JDBC. 
>>> 
>>> Happy to chat if you need more information. 
>>> 
>>> -Sachin Naik
>>> 
>>> >>Hard to tell. Can you give more insights >>on what you try to achieve and 
>>> >>what the data is about?
>>> >>For example, depending on your use case sqoop can make sense or not.
>>> Sent from my iPhone
>>> 
 On Jan 27, 2017, at 11:22 PM, Russell Spitzer  
 wrote:
 
 You can treat Oracle as a JDBC source 
 (http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases)
  and skip Sqoop, HiveTables and go straight to Queries. Then you can skip 
 hive on the way back out (see the same link) and write directly to Oracle. 
 I'll leave the performance questions for someone else. 
 
> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu  
> wrote:
> 
> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu  
> wrote:
> Hi Team,
> 
> RIght now our existing flow is
> 
> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive 
> Context)-->Destination Hive table -->sqoop export to Oracle
> 
> Half of the Hive UDFS required is developed in Java UDF..
> 
> SO Now I want to know if I run the native scala UDF's than runninng hive 
> java udfs in spark-sql will there be any performance difference
> 
> 
> Can we skip the Sqoop Import and export part and 
> 
> Instead directly load data from oracle to spark and code Scala UDF's for 
> transformations and export output data back to oracle?
> 
> RIght now the architecture we are using is
> 
> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL--> 
> Hive --> Oracle 
> what would be optimal architecture to process data from oracle using 
> spark ?? can i anyway better this process ?
> 
> 
> 
> 
> Regards,
> Sirisha 
> 
>>

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Deepak Sharma

The better way is to read the data directly into spark using spark sql read
jdbc .
Apply the udf's locally .
Then save the data frame back to Oracle using dataframe's write jdbc.

Thanks
Deepak

On Jan 29, 2017 7:15 PM, "Jörn Franke"  wrote:

> One alternative could be the oracle Hadoop loader and other Oracle
> products, but you have to invest some money and probably buy their Hadoop
> Appliance, which you have to evaluate if it make sense (can get expensive
> with large clusters etc).
>
> Another alternative would be to get rid of Oracle alltogether and use
> other databases.
>
> However, can you elaborate a little bit on your use case and the business
> logic as well as SLA requires. Otherwise all recommendations are right
> because the requirements you presented are very generic.
>
> About get rid of Hadoop - this depends! You will need some resource
> manager (yarn, mesos, kubernetes etc) and most likely also a distributed
> file system. Spark supports through the Hadoop apis a wide range of file
> systems, but does not need HDFS for persistence. You can have local
> filesystem (ie any file system mounted to a node, so also distributed ones,
> such as zfs), cloud file systems (s3, azure blob etc).
>
>
>
> On 29 Jan 2017, at 11:18, Alex  wrote:
>
> Hi All,
>
> Thanks for your response .. Please find below flow diagram
>
> Please help me out simplifying this architecture using Spark
>
> 1) Can i skip step 1 to step 4 and directly store it in spark
> if I am storing it in spark where actually it is getting stored
> Do i need to retain HAdoop to store data
> or can i directly store it in spark and remove hadoop also?
>
> I want to remove informatica for preprocessing and directly load the files
> data coming from server to Hadoop/Spark
>
> So My Question is Can i directly load files data to spark ? Then where
> exactly the data will get stored.. Do I need to have Spark installed on Top
> of HDFS?
>
> 2) if I am retaining below architecture Can I store back output from spark
> directly to oracle from step 5 to step 7
>
> and will spark way of storing it back to oracle will be better than using
> sqoop performance wise
> 3)Can I use SPark scala UDF to process data from hive and retain entire
> architecture
>
> which among the above would be optimal
>
> [image: Inline image 1]
>
> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik 
> wrote:
>
>> I strongly agree with Jorn and Russell. There are different solutions for
>> data movement depending upon your needs frequency, bi-directional drivers.
>> workflow, handling duplicate records. This is a space is known as " Change
>> Data Capture - CDC" for short. If you need more information, I would be
>> happy to chat with you.  I built some products in this space that
>> extensively used connection pooling over ODBC/JDBC.
>>
>> Happy to chat if you need more information.
>>
>> -Sachin Naik
>>
>> >>Hard to tell. Can you give more insights >>on what you try to achieve
>> and what the data is about?
>> >>For example, depending on your use case sqoop can make sense or not.
>> Sent from my iPhone
>>
>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer 
>> wrote:
>>
>> You can treat Oracle as a JDBC source (http://spark.apache.org/docs/
>> latest/sql-programming-guide.html#jdbc-to-other-databases) and skip
>> Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the
>> way back out (see the same link) and write directly to Oracle. I'll leave
>> the performance questions for someone else.
>>
>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu 
>> wrote:
>>
>>>
>>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu 
>>> wrote:
>>>
>>> Hi Team,
>>>
>>> RIght now our existing flow is
>>>
>>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive
>>> Context)-->Destination Hive table -->sqoop export to Oracle
>>>
>>> Half of the Hive UDFS required is developed in Java UDF..
>>>
>>> SO Now I want to know if I run the native scala UDF's than runninng hive
>>> java udfs in spark-sql will there be any performance difference
>>>
>>>
>>> Can we skip the Sqoop Import and export part and
>>>
>>> Instead directly load data from oracle to spark and code Scala UDF's for
>>> transformations and export output data back to oracle?
>>>
>>> RIght now the architecture we are using is
>>>
>>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL-->
>>> Hive --> Oracle
>>> what would be optimal architecture to process data from oracle using
>>> spark ?? can i anyway better this process ?
>>>
>>>
>>>
>>>
>>> Regards,
>>> Sirisha
>>>
>>>
>>>
>

nvitation to speak about GraphX at London Opensource Graph Technologies Meetup

2017-01-29 Thread haikal

Hi everyone,

We're starting a new meetup in London: Opensource Graph Technologies. Our
goal is to increase the awareness of opensource graph technologies and their
applications to the London developer community. In the past week, there have
been 119 people signing up to the group, and we're hoping to continue
growing the community with the series of talks that we'll be holding.

The first meetup we're planning to host is during the week of the 6th of
March, in Central London. We would like to include GraphX as one of the
technologies being introduced to the London developer community.

Is anyone interested in giving a 20-minute talk, introducing GraphX (any
general or specific application of it) to the community?

Below is the link to the meetup group and page that we've set up. We'll be
updating the details and speakers list in the coming days, but I would very
love to reserve at least one talk for GraphX!

https://www.meetup.com/graphs/events/237191885/




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/nvitation-to-speak-about-GraphX-at-London-Opensource-Graph-Technologies-Meetup-tp28342.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Jörn Franke

One alternative could be the oracle Hadoop loader and other Oracle products, 
but you have to invest some money and probably buy their Hadoop Appliance, 
which you have to evaluate if it make sense (can get expensive with large 
clusters etc).

Another alternative would be to get rid of Oracle alltogether and use other 
databases.

However, can you elaborate a little bit on your use case and the business logic 
as well as SLA requires. Otherwise all recommendations are right because the 
requirements you presented are very generic.

About get rid of Hadoop - this depends! You will need some resource manager 
(yarn, mesos, kubernetes etc) and most likely also a distributed file system. 
Spark supports through the Hadoop apis a wide range of file systems, but does 
not need HDFS for persistence. You can have local filesystem (ie any file 
system mounted to a node, so also distributed ones, such as zfs), cloud file 
systems (s3, azure blob etc).

> On 29 Jan 2017, at 11:18, Alex  wrote:
> 
> Hi All,
> 
> Thanks for your response .. Please find below flow diagram
> 
> Please help me out simplifying this architecture using Spark
> 
> 1) Can i skip step 1 to step 4 and directly store it in spark
> if I am storing it in spark where actually it is getting stored
> Do i need to retain HAdoop to store data
> or can i directly store it in spark and remove hadoop also?
> 
> I want to remove informatica for preprocessing and directly load the files 
> data coming from server to Hadoop/Spark
> 
> So My Question is Can i directly load files data to spark ? Then where 
> exactly the data will get stored.. Do I need to have Spark installed on Top 
> of HDFS?
> 
> 2) if I am retaining below architecture Can I store back output from spark 
> directly to oracle from step 5 to step 7 
> 
> and will spark way of storing it back to oracle will be better than using 
> sqoop performance wise
> 3)Can I use SPark scala UDF to process data from hive and retain entire 
> architecture 
> 
> which among the above would be optimal
> 
> 
> 
>> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik  
>> wrote:
>> I strongly agree with Jorn and Russell. There are different solutions for 
>> data movement depending upon your needs frequency, bi-directional drivers. 
>> workflow, handling duplicate records. This is a space is known as " Change 
>> Data Capture - CDC" for short. If you need more information, I would be 
>> happy to chat with you.  I built some products in this space that 
>> extensively used connection pooling over ODBC/JDBC. 
>> 
>> Happy to chat if you need more information. 
>> 
>> -Sachin Naik
>> 
>> >>Hard to tell. Can you give more insights >>on what you try to achieve and 
>> >>what the data is about?
>> >>For example, depending on your use case sqoop can make sense or not.
>> Sent from my iPhone
>> 
>>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer  
>>> wrote:
>>> 
>>> You can treat Oracle as a JDBC source 
>>> (http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases)
>>>  and skip Sqoop, HiveTables and go straight to Queries. Then you can skip 
>>> hive on the way back out (see the same link) and write directly to Oracle. 
>>> I'll leave the performance questions for someone else. 
>>> 
 On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu  
 wrote:

 On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu  
 wrote:
 Hi Team,

 RIght now our existing flow is

 Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive 
 Context)-->Destination Hive table -->sqoop export to Oracle

 Half of the Hive UDFS required is developed in Java UDF..

 SO Now I want to know if I run the native scala UDF's than runninng hive 
 java udfs in spark-sql will there be any performance difference

 Can we skip the Sqoop Import and export part and 

 Instead directly load data from oracle to spark and code Scala UDF's for 
 transformations and export output data back to oracle?

 RIght now the architecture we are using is

 oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL--> 
 Hive --> Oracle 
 what would be optimal architecture to process data from oracle using spark 
 ?? can i anyway better this process ?

 Regards,
 Sirisha 

>

Examples in graphx

2017-01-29 Thread Deepak Sharma

Hi There,
Are there any examples of using GraphX along with any graph DB?
I am looking to persist the graph in graph based DB and then read it back
in spark , process using graphx.

-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Hbase and Spark

2017-01-29 Thread Masf

I´m trying to build an application where is necessary to do bulkGets and
bulkLoad on Hbase.

I think that I could use this component
https://github.com/hortonworks-spark/shc
*Is it a good option??*

But* I can't import it in my project*. Sbt cannot resolve hbase
connector
This is my build.sbt:

version := "1.0"
scalaVersion := "2.10.6"

mainClass in assembly := Some("com.location.userTransaction")

assemblyOption in assembly ~= { _.copy(includeScala = false) }

resolvers += "Typesafe Repo" at "http://repo.typesafe.com/typesafe/releases/
"

val sparkVersion = "1.6.0"
val jettyVersion = "8.1.14.v20131031"
val hbaseConnectorVersion = "1.0.0-1.6-s_2.10"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion % "provided",
  "org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
  "org.apache.spark" %% "spark-hive" % sparkVersion % "provided"
)
libraryDependencies += "com.hortonworks" % "shc" % hbaseConnectorVersion
libraryDependencies += "org.eclipse.jetty" % "jetty-client" % jettyVersion

-- 


Saludos.
Miguel Ángel

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Mich Talebzadeh

This is classis nothing special about it.

   1. You source is Oracle schema tables
   2. You can use Oracle JDBC connection with DIRECT CONNECT and parallel
   processing to read your data from Oracle table into Spark FP using JDBC.
   Ensure that you are getting data from Oracle DB at a time when the DB is
   not busy and network between your Spark and Oracle is reasonable. You will
   be creating multiple connections to your Oracle database from Spark
   3. Create a DF from RDD and ingest your data into Hive staging tables.
   This should be pretty fast. If you are using a recent version of Spark >
   1.5 you can see this in Spark GUI
   4. Once data is ingested into Hive table (frequency Discrete, Recurring
   or Cumulative), then you have your source data in Hive
   5. Do your work in Hive staging tables and then your enriched data will
   go into Hive enriched tables (different from your staging tables). You can
   use Spark to enrich (transform) your data on Hive staging tables
   6. Then use Spark to send that data into Oracle table. Again bear in
   mind that the application has to handle consistency from Big Data into
   RDBMS. For example what you are going to do with failed transactions in
   Oracle
   7. From my experience you also need some  staging tables in Oracle to
   handle inserts from Hive via Spark into Oracle table
   8. Finally run a job in PL/SQL to load Oracle target tables from Oracle
   staging tables

Notes:

Oracle columns types are 100% compatible with Spark. For example Spark does
not recognize CHAR column and that has to be converted into VARCHAR or
STRING.
Hive does not have the concept of Oracle "WITH CLAUSE" inline table. So
that script that works in Oracle may not work in Hive. Windowing functions
should be fine.

I tend to do all this via shell script that gives control at each layer and
creates alarms.

HTH

   1.
   2.

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 29 January 2017 at 10:18, Alex  wrote:

> Hi All,
>
> Thanks for your response .. Please find below flow diagram
>
> Please help me out simplifying this architecture using Spark
>
> 1) Can i skip step 1 to step 4 and directly store it in spark
> if I am storing it in spark where actually it is getting stored
> Do i need to retain HAdoop to store data
> or can i directly store it in spark and remove hadoop also?
>
> I want to remove informatica for preprocessing and directly load the files
> data coming from server to Hadoop/Spark
>
> So My Question is Can i directly load files data to spark ? Then where
> exactly the data will get stored.. Do I need to have Spark installed on Top
> of HDFS?
>
> 2) if I am retaining below architecture Can I store back output from spark
> directly to oracle from step 5 to step 7
>
> and will spark way of storing it back to oracle will be better than using
> sqoop performance wise
> 3)Can I use SPark scala UDF to process data from hive and retain entire
> architecture
>
> which among the above would be optimal
>
> [image: Inline image 1]
>
> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik 
> wrote:
>
>> I strongly agree with Jorn and Russell. There are different solutions for
>> data movement depending upon your needs frequency, bi-directional drivers.
>> workflow, handling duplicate records. This is a space is known as " Change
>> Data Capture - CDC" for short. If you need more information, I would be
>> happy to chat with you.  I built some products in this space that
>> extensively used connection pooling over ODBC/JDBC.
>>
>> Happy to chat if you need more information.
>>
>> -Sachin Naik
>>
>> >>Hard to tell. Can you give more insights >>on what you try to achieve
>> and what the data is about?
>> >>For example, depending on your use case sqoop can make sense or not.
>> Sent from my iPhone
>>
>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer 
>> wrote:
>>
>> You can treat Oracle as a JDBC source (http://spark.apache.org/docs/
>> latest/sql-programming-guide.html#jdbc-to-other-databases) and skip
>> Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the
>> way back out (see the same link) and write directly to Oracle. I'll leave
>> the performance questions for someone else.
>>
>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu 
>> wrote:
>>
>>>
>>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu 
>>> wrote:
>>>
>>> Hi

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Alex

Hi All,

Thanks for your response .. Please find below flow diagram

Please help me out simplifying this architecture using Spark

1) Can i skip step 1 to step 4 and directly store it in spark
if I am storing it in spark where actually it is getting stored
Do i need to retain HAdoop to store data
or can i directly store it in spark and remove hadoop also?

I want to remove informatica for preprocessing and directly load the files
data coming from server to Hadoop/Spark

So My Question is Can i directly load files data to spark ? Then where
exactly the data will get stored.. Do I need to have Spark installed on Top
of HDFS?

2) if I am retaining below architecture Can I store back output from spark
directly to oracle from step 5 to step 7

and will spark way of storing it back to oracle will be better than using
sqoop performance wise
3)Can I use SPark scala UDF to process data from hive and retain entire
architecture

which among the above would be optimal

[image: Inline image 1]

On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik 
wrote:

> I strongly agree with Jorn and Russell. There are different solutions for
> data movement depending upon your needs frequency, bi-directional drivers.
> workflow, handling duplicate records. This is a space is known as " Change
> Data Capture - CDC" for short. If you need more information, I would be
> happy to chat with you.  I built some products in this space that
> extensively used connection pooling over ODBC/JDBC.
>
> Happy to chat if you need more information.
>
> -Sachin Naik
>
> >>Hard to tell. Can you give more insights >>on what you try to achieve
> and what the data is about?
> >>For example, depending on your use case sqoop can make sense or not.
> Sent from my iPhone
>
> On Jan 27, 2017, at 11:22 PM, Russell Spitzer 
> wrote:
>
> You can treat Oracle as a JDBC source (http://spark.apache.org/docs/
> latest/sql-programming-guide.html#jdbc-to-other-databases) and skip
> Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the
> way back out (see the same link) and write directly to Oracle. I'll leave
> the performance questions for someone else.
>
> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu 
> wrote:
>
>>
>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu 
>> wrote:
>>
>> Hi Team,
>>
>> RIght now our existing flow is
>>
>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive
>> Context)-->Destination Hive table -->sqoop export to Oracle
>>
>> Half of the Hive UDFS required is developed in Java UDF..
>>
>> SO Now I want to know if I run the native scala UDF's than runninng hive
>> java udfs in spark-sql will there be any performance difference
>>
>>
>> Can we skip the Sqoop Import and export part and
>>
>> Instead directly load data from oracle to spark and code Scala UDF's for
>> transformations and export output data back to oracle?
>>
>> RIght now the architecture we are using is
>>
>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL-->
>> Hive --> Oracle
>> what would be optimal architecture to process data from oracle using
>> spark ?? can i anyway better this process ?
>>
>>
>>
>>
>> Regards,
>> Sirisha
>>
>>
>>

Error Saving Dataframe to Hive with Spark 2.0.0

2017-01-29 Thread Chetan Khatri

Hello Spark Users,

I am getting error while saving Spark Dataframe to Hive Table:
Hive 1.2.1
Spark 2.0.0
Local environment.
Note: Job is getting executed successfully and the way I want but still
Exception raised.
*Source Code:*

package com.chetan.poc.hbase

/**
  * Created by chetan on 24/1/17.
  */
import org.apache.hadoop.hbase.{CellUtil, HBaseConfiguration}
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.KeyValue.Type
import org.apache.spark.sql.SparkSession
import scala.collection.JavaConverters._
import java.util.Date
import java.text.SimpleDateFormat


object IncrementalJob {
val APP_NAME: String = "SparkHbaseJob"
var HBASE_DB_HOST: String = null
var HBASE_TABLE: String = null
var HBASE_COLUMN_FAMILY: String = null
var HIVE_DATA_WAREHOUSE: String = null
var HIVE_TABLE_NAME: String = null
  def main(args: Array[String]) {
// Initializing HBASE Configuration variables
HBASE_DB_HOST="127.0.0.1"
HBASE_TABLE="university"
HBASE_COLUMN_FAMILY="emp"
// Initializing Hive Metastore configuration
HIVE_DATA_WAREHOUSE = "/usr/local/hive/warehouse"
// Initializing Hive table name - Target table
HIVE_TABLE_NAME = "employees"
// setting spark application
// val sparkConf = new SparkConf().setAppName(APP_NAME).setMaster("local")
//initialize the spark context
//val sparkContext = new SparkContext(sparkConf)
//val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext)
// Enable Hive with Hive warehouse in SparkSession
val spark =
SparkSession.builder().appName(APP_NAME).config("hive.metastore.warehouse.dir",
HIVE_DATA_WAREHOUSE).config("spark.sql.warehouse.dir",
HIVE_DATA_WAREHOUSE).enableHiveSupport().getOrCreate()
import spark.implicits._
import spark.sql

val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, HBASE_TABLE)
conf.set(TableInputFormat.SCAN_COLUMNS, HBASE_COLUMN_FAMILY)
// Load an RDD of rowkey, result(ImmutableBytesWritable, Result)
tuples from the table
val hBaseRDD = spark.sparkContext.newAPIHadoopRDD(conf,
classOf[TableInputFormat],
  classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
  classOf[org.apache.hadoop.hbase.client.Result])

println(hBaseRDD.count())
//hBaseRDD.foreach(println)

//keyValue is a RDD[java.util.list[hbase.KeyValue]]
val keyValue = hBaseRDD.map(x => x._2).map(_.list)

//outPut is a RDD[String], in which each line represents a record in HBase
val outPut = keyValue.flatMap(x =>  x.asScala.map(cell =>

  HBaseResult(
Bytes.toInt(CellUtil.cloneRow(cell)),
Bytes.toStringBinary(CellUtil.cloneFamily(cell)),
Bytes.toStringBinary(CellUtil.cloneQualifier(cell)),
cell.getTimestamp,
new SimpleDateFormat("-MM-dd HH:mm:ss:SSS").format(new
Date(cell.getTimestamp.toLong)),
Bytes.toStringBinary(CellUtil.cloneValue(cell)),
Type.codeToType(cell.getTypeByte).toString
)
  )
).toDF()
// Output dataframe
outPut.show

// get timestamp
val datetimestamp_threshold = "2016-08-25 14:27:02:001"
val datetimestampformat = new SimpleDateFormat("-MM-dd
HH:mm:ss:SSS").parse(datetimestamp_threshold).getTime()

// Resultset filteration based on timestamp
val filtered_output_timestamp = outPut.filter($"colDatetime" >=
datetimestampformat)
// Resultset filteration based on rowkey
val filtered_output_row =
outPut.filter($"colDatetime".between(1668493360,1668493365))


// Saving Dataframe to Hive Table Successfully.
filtered_output_timestamp.write.mode("append").saveAsTable(HIVE_TABLE_NAME)
  }
  case class HBaseResult(rowkey: Int, colFamily: String, colQualifier:
String, colDatetime: Long, colDatetimeStr: String, colValue: String,
colType: String)
}


Error:

17/01/29 13:51:53 INFO metastore.HiveMetaStore: 0: create_database:
Database(name:default, description:default database,
locationUri:hdfs://localhost:9000/usr/local/hive/warehouse,
parameters:{})
17/01/29 13:51:53 INFO HiveMetaStore.audit:
ugi=hduser  ip=unknown-ip-addr  cmd=create_database:
Database(name:default, description:default database,
locationUri:hdfs://localhost:9000/usr/local/hive/warehouse,
parameters:{})
17/01/29 13:51:53 ERROR metastore.RetryingHMSHandler:
AlreadyExistsException(message:Database default already exists)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:891)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)

Re: Having multiple spark context

Re: Having multiple spark context

Re: Having multiple spark context

Failures on JavaSparkContext. - "Futures timed out after [10000 milliseconds]"

Re: Error Saving Dataframe to Hive with Spark 2.0.0

No Reducer scenarios

Having multiple spark context

Re: question on spark streaming based on event time

Streaming jobs getting longer

Re: question on spark streaming based on event time

Re: Error Saving Dataframe to Hive with Spark 2.0.0

Re: Examples in graphx

Re: Aggregator mutate b1 in place in merge

RE: forcing dataframe groupby partitioning

Re: Aggregator mutate b1 in place in merge

Re: forcing dataframe groupby partitioning

forcing dataframe groupby partitioning

Re: Aggregator mutate b1 in place in merge

Re: Hbase and Spark

Re: spark architecture question -- Pleas Read

Re: spark architecture question -- Pleas Read

Re: spark architecture question -- Pleas Read

Re: spark architecture question -- Pleas Read

Re: spark architecture question -- Pleas Read

nvitation to speak about GraphX at London Opensource Graph Technologies Meetup

Re: spark architecture question -- Pleas Read

Examples in graphx

Hbase and Spark

Re: spark architecture question -- Pleas Read

Re: spark architecture question -- Pleas Read

Error Saving Dataframe to Hive with Spark 2.0.0

31 matches

Site Navigation

Mail list logo

Footer information