Re: fetching and joining data from two different clusters

2017-06-15 Thread Mich Talebzadeh
at this also? Cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own

fetching and joining data from two different clusters

2017-06-15 Thread Mich Talebzadeh
on two different HDFS clusters? thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer

Re: Scala, Python or Java for Spark programming

2017-06-08 Thread Mich Talebzadeh
and compactness. I can write a Spark streaming code in Sala pretty fast or import massive RDBMS table into Hive and table of my design equally very fast using Scala. I don't know may be I cannot be bothered writing 100 lines of Java for a simple query from a table :) Dr Mich Talebzadeh LinkedIn * https

Scala, Python or Java for Spark programming

2017-06-07 Thread Mich Talebzadeh
. Hence I was wondering how much truth is there in this statement. Given that Spark uses Scala as its core development language, what is the general view on the use of Scala, Python or Java? Thanks, Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id

Re: Edge Node in Spark

2017-06-07 Thread Mich Talebzadeh
process will be running on edge node. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disc

Re: An Architecture question on the use of virtualised clusters

2017-06-05 Thread Mich Talebzadeh
at our future needs. From my experience of these tools, you cannot simply roll it back without incurring considerable work and considerable cost. And after all will the cost justify the whole of this setup? How about performance and other bottlenecks? Thanks Dr Mich Talebzadeh LinkedIn

Re: An Architecture question on the use of virtualised clusters

2017-06-05 Thread Mich Talebzadeh
Hi John, Thanks. Did you end up in production or in other words besides PoC did you use it in anger? The intention is to build Isilon on top of the whole HDFS cluster!. If we go that way we also need to adopt it for DR as well. Cheers Dr Mich Talebzadeh LinkedIn * https

Re: An Architecture question on the use of virtualised clusters

2017-06-05 Thread Mich Talebzadeh
proof of such tools. So I was wondering if anyone else has tried such solution. Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

Re: An Architecture question on the use of virtualised clusters

2017-06-01 Thread Mich Talebzadeh
As a matter of interest what is the best way of creating virtualised clusters all pointing to the same physical data? thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Re: An Architecture question on the use of virtualised clusters

2017-06-01 Thread Mich Talebzadeh
can build virtual clusters on the same data. One cluster for read/writes and another for Reads? That is what has been suggestes!. regards Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Re: An Architecture question on the use of virtualised clusters

2017-06-01 Thread Mich Talebzadeh
anyone is using this product in anger. At the end of the day it's not HDFS. It is OneFS with a HCFS API. However that may suit our needs. But would need to PoC it and test it thoroughly! Cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id

An Architecture question on the use of virtualised clusters

2017-05-31 Thread Mich Talebzadeh
data to be in one place regardless of artefacts used against it such as Spark? Thanks, Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

Re: Dynamically working out upperbound in JDBC connection to Oracle DB

2017-05-29 Thread Mich Talebzadeh
tring,Comparable[_ >: java.math.BigDecimal with String <: Comparable[_ >: java.math.BigDecimal with String <: java.io.Serializable] with java.io.Serializable] with java.io.Serializable]) val s = HiveContext.read.format("jdbc").options( Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com

Dynamically working out upperbound in JDBC connection to Oracle DB

2017-05-29 Thread Mich Talebzadeh
Hi, This JDBC connection works with Oracle table with primary key ID val s = HiveContext.read.format("jdbc").options( Map("url" -> _ORACLEserver, "dbtable" -> "(SELECT ID, CLUSTERED, SCATTERED, RANDOMISED, RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)", "partitionColumn" -> "ID",

Re: Upgrade the scala code using the most updated Spark version

2017-03-27 Thread Mich Talebzadeh
% "2.6.2" libraryDependencies += "org.apache.phoenix" % "phoenix-spark" % "4.6.0-HBase-1.0" libraryDependencies += "org.apache.hbase" % "hbase" % "1.2.3" libraryDependencies += "org.apache.hbase" % "hbase-client" % "1.

Re: kafka and zookeeper set up in prod for spark streaming

2017-03-03 Thread Mich Talebzadeh
Thanks all. How about Kafka HA which is important. Is it best to use application specific Kafka delivery or Kafka MirrorMaker? Cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

kafka and zookeeper set up in prod for spark streaming

2017-03-03 Thread Mich Talebzadeh
for Kafka for use with Spark Streaming? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disc

instrumenting Spark hit ratios

2017-02-25 Thread Mich Talebzadeh
not using UNUX tools such as Nagios etc, are they tools that can be deployed for spark cluster itself? I guess top/htop can be used but those are available anyway. Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <ht

Re: spark architecture question -- Pleas Read

2017-02-05 Thread Mich Talebzadeh
agreed. The best option is to ingest to ingesting tables in Oracle. Many people ingest into main Oracle table which is wrong design in my opinion. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/prof

SSpark streaming: Could not initialize class kafka.consumer.FetchRequestAndResponseStatsRegistry$

2017-02-04 Thread Mich Talebzadeh
211) at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484) ... 74 elided Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>

Re: spark architecture question -- Pleas Read

2017-02-04 Thread Mich Talebzadeh
Ingesting from Hive tables back into Oracle. What mechanisms are in place to ensure that data ends up consistently into Oracle table and Spark is notified when Oracle has issues with data ingested (say rollback)? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id

Re: Tableau BI on Spark SQL

2017-01-30 Thread Mich Talebzadeh
Thanks Jorn, So Tableau uses its own in-memory representation as I guessed. Now the question is how is performance accessing data in Oracle tables> Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/p

Tableau BI on Spark SQL

2017-01-30 Thread Mich Talebzadeh
htm>containing star schema and create and ingest the same tables and data into Hive tables. Then run Tableau against these tables and do the performance comparison. Given that Oracle is widely used with Tableau this test makes sense? Thanks. Dr Mich Talebzadeh LinkedIn * https://www.linked

Re: Having multiple spark context

2017-01-30 Thread Mich Talebzadeh
in general in a single JVM which is basically running in Local mode, you have only one Spark Context. However, you can stop the current Spark Context by sc.stop() HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <ht

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Mich Talebzadeh
files have to reside in HDFS HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Mich Talebzadeh
you can use Spark directly on csv file. 1. Put the csv files into HDFS /apps//data/staging/ 2. Multiple csv files for the same table can co-exist 3. like df1 = spark.read.option("header", false).csv(location) 4. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.c

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Mich Talebzadeh
end to do all this via shell script that gives control at each layer and creates alarms. HTH 1. 2. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP

issue with running Spark streaming with spark-shell

2017-01-28 Thread Mich Talebzadeh
r file or something? thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at y

Re: Fw: Yarn resource management for Spark with IBM Platform Symphony

2017-01-19 Thread Mich Talebzadeh
Thanks Kuan for insight. Much appreciated. Mich Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpre

Yarn resource management for Spark with IBM Platform Symphony

2017-01-19 Thread Mich Talebzadeh
n your Spark jobs including HA failover using Platform Symphony ha. Has anyone had any experience of using Yarn with IBM Platform Symphony at all including Proof of Concept? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

IBM Fluid query versus Spark

2017-01-04 Thread Mich Talebzadeh
Hi, Has anyone had any experience of using IBM Fluid query and comparing it with Spark with its MPP and in-memory capabilities? Thanks, Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

spark sql in Cloudera package

2017-01-04 Thread Mich Talebzadeh
Sounds like Cloudera do not supply the shell for spark-sql but only spark-shell is that correct? I appreciate that one can use spark-shell. however, sounds like spark-sql is excluded in favour of Impala? cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id

Re: From Hive to Spark, what is the default database/table

2016-12-31 Thread Mich Talebzadeh
ight201601; show tables; HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it

Reading specific column family and columns in Hbase table through spark

2016-12-29 Thread Mich Talebzadeh
lue()).toString, Bytes.toString( iter.next().getValue()).toString, Bytes.toString(iter.next().getValue()) )} The above reads the column family columns sequentially. How can I force it to read specific columns only? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.

Re: Location for the additional jar files in Spark

2016-12-27 Thread Mich Talebzadeh
tasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:53) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:315) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) at org.apache.spark.sql.DataFrameReader.load

Re: Location for the additional jar files in Spark

2016-12-27 Thread Mich Talebzadeh
ationProvider.scala:53) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:315) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122) ... 56 elided Dr Mich Ta

Re: Location for the additional jar files in Spark

2016-12-27 Thread Mich Talebzadeh
Ok just to be clear do you mean ADD_JARS="~/jars/ojdbc6.jar" spark-shell or spark-shell --jars $ADD_JARS Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/pr

Re: Location for the additional jar files in Spark

2016-12-27 Thread Mich Talebzadeh
at the shell file "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@" hm Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https:/

Location for the additional jar files in Spark

2016-12-27 Thread Mich Talebzadeh
When one runs in Local mode (one JVM) on an edge host (the host user accesses the cluster), it is possible to put additional jar file say accessing Oracle RDBMS tables in $SPARK_CLASSPATH. This works export SPARK_CLASSPATH=~/user_jars/ojdbc6.jar Normally a group of users can have read access to

Re: Has anyone managed to connect to Oracle via JDBC from Spark CDH 5.5.2

2016-12-22 Thread Mich Talebzadeh
ojdbc6.jar. FYI the Oracle database version accessed is 11g, R2 Also it is a challenge in a multi-talented cluster to maintain multiple versions of jars for the same database type through $SPARK_HOME/conf/ spark-defaults.conf! HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com

Re: Has anyone managed to connect to Oracle via JDBC from Spark CDH 5.5.2

2016-12-21 Thread Mich Talebzadeh
thanks Ayan, do you mean "driver" -> "oracle.jdbc.OracleDriver" we added that one but did not work! Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.li

Has anyone managed to connect to Oracle via JDBC from Spark CDH 5.5.2

2016-12-21 Thread Mich Talebzadeh
dbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:53) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:315) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122) Any ideas?

Financial fraud detection using streaming RDBMS data into Spark & Hbase

2016-12-15 Thread Mich Talebzadeh
oughts? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk.

Re: spark on yarn can't load kafka dependency jar

2016-12-15 Thread Mich Talebzadeh
try this it should work and yes they are comma separated spark-streaming-kafka_2.10-1.5.1.jar Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV

Re: Cached Tables SQL Performance Worse than Uncached

2016-12-15 Thread Mich Talebzadeh
How many tables are involved in the SQL join and how do you cache them? If you do unpersist on the DF(s) and run the same SQL query (the same sesiion) what do you see with explain? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id

Re: Cached Tables SQL Performance Worse than Uncached

2016-12-15 Thread Mich Talebzadeh
How many tables are involved in the SQL join and how do you cache them? If you do unpersist on the DF and run the sdame Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Re: Design patterns for Spark implementation

2016-12-10 Thread Mich Talebzadeh
/views in Spark can be used or Spark functional programming with Scala. Also the performance of JDBC matters. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Re: Design patterns for Spark implementation

2016-12-08 Thread Mich Talebzadeh
you JDBC connection to RDBMS table and you will need to have a primary key on the table. I am going to test it to see how performant it is to offer Spark as a fast query engine for RDNMS. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id

Re: Livy with Spark

2016-12-06 Thread Mich Talebzadeh
are assigned to users. Is that done by YARN? 2. What will happen if more than one Livy is running on the same cluster all controlled by the same YARN. how resouces are allocated cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id

Re: Livy with Spark

2016-12-05 Thread Mich Talebzadeh
Thanks Richard for the link. Also its interaction with Zeppelin is great. I believe it is a very early stage for now Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Livy with Spark

2016-12-05 Thread Mich Talebzadeh
Hi, Has there been any experience using Livy with Spark to share multiple Spark contexts? thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV

Re: Access multiple cluster

2016-12-04 Thread Mich Talebzadeh
The only way I think of would be accessing Hive tables through their respective thrift servers running on different clusters but not sure you can do it within Spark. Basically two different JDBC connections. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id

Parquet timestamp storage in Hive and possible use case of spark instead of impala

2016-12-03 Thread Mich Talebzadeh
on Hive. it will always get the same values as stored by Hive Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpre

Re: Kafka 0.10 & Spark Streaming 2.0.2

2016-12-02 Thread Mich Talebzadeh
in this POC of yours are you running this app with spark in Local mode by any chance? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

Re: Flume integration

2016-11-21 Thread Mich Talebzadeh
Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any a

Re: Flume integration

2016-11-20 Thread Mich Talebzadeh
Thanks Ian. Was your source of Flume IBM/MQ by any chance? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

Re: Flume integration

2016-11-20 Thread Mich Talebzadeh
Hi Ian, Has this been resolved? How about data to Flume and then Kafka and Kafka streaming into Spark? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Using Flume as Input Stream to Spark

2016-11-20 Thread Mich Talebzadeh
was wondering if this is a tried and tested as opposed experimental one? For example this Spark doc <http://spark.apache.org/docs/latest/streaming-flume-integration.html>talks about Flume integration. Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/v

Successful streaming with ibm/ mq to flume then to kafka and finally spark streaming

2016-11-18 Thread Mich Talebzadeh
hi, can someone share their experience of feeding data from ibm/mq messages into flume, then from flume to kafka and using spark streaming on it? any issues and things to be aware of? thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id

analysing ibm mq messages using spark streaming

2016-11-17 Thread Mich Talebzadeh
hi, I guess the only way to do this is to read ibm mq messages into flume, ingest it into hdfs and read it from there. alternatively use flume to ingest data into hbase and then use spark on hbase. I don't think there is an api like spark streaming with kafka for ibm mq? thanks Dr Mich

Re: Handling windows characters with Spark CSV on Linux

2016-11-17 Thread Mich Talebzadeh
Thanks Ayan. That only works for extra characters like ^ characters etc. Unfortunately it does not cure specific character sets. cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Handling windows characters with Spark CSV on Linux

2016-11-17 Thread Mich Talebzadeh
this as well? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your ow

Running stress tests on spark cluster to avoid wild-goose chase later

2016-11-15 Thread Mich Talebzadeh
through some tests cycles. We have some ideas but appreciate some other feedbacks. The current version is CHDS 5.2. Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Re: Possible DR solution

2016-11-12 Thread Mich Talebzadeh
*up-to-date* the data in replicate site is going to be. Bottom line how good is to deploy such tool given the cost? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Re: Possible DR solution

2016-11-12 Thread Mich Talebzadeh
data from London to Singapore. It can become a nightmare. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpre

Re: Possible DR solution

2016-11-12 Thread Mich Talebzadeh
e cluster or more likely between data centers). > > As you mentioned, Hbase & Co may require a special consideration for the > case that data is in-memory and not yet persisted. > > On Sat, Nov 12, 2016 at 12:04 PM, Mich Talebzadeh < > mich.talebza...@gmail.c

Re: Possible DR solution

2016-11-12 Thread Mich Talebzadeh
thanks Vince can you provide more details on this pls Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpre

Re: Possible DR solution

2016-11-11 Thread Mich Talebzadeh
I really don't see why one wants to set up streaming replication unless for situations where similar functionality to transactional databases is required in big data? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <ht

Re: Possible DR solution

2016-11-11 Thread Mich Talebzadeh
of it. streaming replication as opposed to snapshot. sounds familiar. think of it as log shipping in oracle old days versus goldengate etc. hth Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Re: Possible DR solution

2016-11-11 Thread Mich Talebzadeh
reason being ? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own ris

Re: Possible DR solution

2016-11-11 Thread Mich Talebzadeh
starts at $4,000 per node per year all inclusive. With discount it can be halved but we are talking a node itself so if you have 5 nodes in primary and 5 nodes in DR we are talking about $40K already. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id

Possible DR solution

2016-11-11 Thread Mich Talebzadeh
e site. The idea is that is faster than doing it through traditional HDFS copy tools which are normally batch oriented. It also claims to replicate Hive metadata as well. I wanted to gauge if anyone has used it or a competitor product. The claim is that they do not have competitors! Thanks D

Re: importing data into hdfs/spark using Informatica ETL tool

2016-11-10 Thread Mich Talebzadeh
into target tables in Hive periodically. I will still go for ORC tables. Data. will be append only. That is my conclusion.but still open to suggestions. Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <ht

Re: importing data into hdfs/spark using Informatica ETL tool

2016-11-09 Thread Mich Talebzadeh
schema I believe the above is feasible? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer

Re: importing data into hdfs/spark using Informatica ETL tool

2016-11-09 Thread Mich Talebzadeh
Mich Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any a

importing data into hdfs/spark using Informatica ETL tool

2016-11-09 Thread Mich Talebzadeh
generic alternative? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at yo

Re: Spark dataset cache vs tempview

2016-11-06 Thread Mich Talebzadeh
> df.toDF.createOrReplaceTempView("tmp") scala> spark.sql("drop view if exists tmp") Check UI (port 4040) storage page to see what is cached etc. Just try either options to see which one is more optimum. Option 2 may be more optimum. HTH Dr Mich Talebzadeh LinkedIn * https://www.li

Re: Add jar files on classpath when submitting tasks to Spark

2016-11-02 Thread Mich Talebzadeh
r, and any additional classpath specified # *through spark.driver.extraClassPath is not automatically propagated.* Whether this is relevant or not I am not sure HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV

Re: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Mich Talebzadeh
ntent is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 1 November 2016 at 14:02, Jan Botorek <jan.boto...@infor.com> wrote: > Yes, exactly. > My (testing) run script is: > > spark-

Re: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Mich Talebzadeh
Are you submitting your job through spark-submit? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpre

Re: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Mich Talebzadeh
that directory. The other alternative is to mount the shared directory as NFS mount across all the nodes and all the noses can read from that shared directory HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <ht

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Mich Talebzadeh
moved when the session ends or table is dropped Not sure how Spark handles this. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Mich Talebzadeh
uster in which it was created. The data is stored using Hive's highly-optimized, in-memory columnar format." So on the face of it tempTable is an in-memory table HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Mich Talebzadeh
ark.sql.DataFrame = [] Also your point "But the thing is that I don't explicitly cache the tempTables ..". I believe tempTable is created in-memory and is already cached HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Mich Talebzadeh
this in UI storage page. Alternative is to use persist(StorageLevel.MEMORY_AND_DISK_SER()) with a mix of cached and disk. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Mich Talebzadeh
know. Have you tried it using predicate push-down on the underlying table itself? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

Re: Happy Diwali to those forum members who celebrate this great festival

2016-10-30 Thread Mich Talebzadeh
I can hear and see plenty of firework in this foggy London tonight :) Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

Happy Diwali to those forum members who celebrate this great festival

2016-10-30 Thread Mich Talebzadeh
Enjoy the festive season. Regards, Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer

Re: Sharing RDDS across applications and users

2016-10-28 Thread Mich Talebzadeh
Hi, I think tempTable is private to the session that creates it. In Hive temp tables created by "CREATE TEMPORARY TABLE" are all private to the session. Spark is no different. The alternative may be everyone creates tempTable from the same DF? HTH Dr Mich Talebzadeh LinkedI

Re: Sharing RDDS across applications and users

2016-10-28 Thread Mich Talebzadeh
with in-memory storage where app 2 can pick up app1 results from memory or even SSD and do the work. Actually I am surprised why Spark has not incorporated this type of memory as temporary storage. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id

Re: Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
so I assume Ignite will not work with Spark version >=2? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.word

Re: Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
. For example the same tempTable etc? Cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disc

Re: Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
Thanks Chanh, Can it share RDDs. Personally I have not used either Alluxio or Ignite. 1. Are there major differences between these two 2. Have you tried Alluxio for sharing Spark RDDs and if so do you have any experience you can kindly share Regards Dr Mich Talebzadeh LinkedIn

Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
with something like Apache Ignite. Has anyone really tried this. Will that work with multiple applications? It looks feasible as RDDs are immutable and so are registered tempTables etc. Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id

Re: HiveContext is Serialized?

2016-10-26 Thread Mich Talebzadeh
uot; Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own r

Re: HiveContext is Serialized?

2016-10-26 Thread Mich Talebzadeh
of translating object state into a format that can be stored and retrieved from memory buffer? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV

Re: HiveContext is Serialized?

2016-10-26 Thread Mich Talebzadeh
uted among executors? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it

Getting only results out of Spark Shell

2016-10-25 Thread Mich Talebzadeh
= AverageDailyPrice: double 328.0 327.13 325.63 I can do it in shell but there must be a way of running the commands silently? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Re: Passing command line arguments to Spark-shell in Spark 2.0.1

2016-10-25 Thread Mich Talebzadeh
Hi, The correct way of doing it for a String argument is using eche ' ' passing the string directly as below spark-shell -i <(echo 'val ticker = "tsco"' ; cat stock.scala) Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOAB

Passing command line arguments to Spark-shell in Spark 2.0.1

2016-10-25 Thread Mich Talebzadeh
;(echo val ticker = $TICKER ; cat ) as describe here <http://stackoverflow.com/questions/29928999/passing-command-line-arguments-to-spark-shell> Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw &l

<    6   7   8   9   10   11   12   13   14   15   >