Re: Generate random numbers from Normal Distribution with Specific Mean and Variance

2016-10-24 Thread Mich Talebzadeh
thanks Jorn. I wish we had these libraries somewhere :) Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpre

Generate random numbers from Normal Distribution with Specific Mean and Variance

2016-10-24 Thread Mich Talebzadeh
ain=www.mathworks.com> Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your o

Accessing Phoenix table from Spark 2.0., any cure!

2016-10-24 Thread Mich Talebzadeh
ution.datasources.LogicalRelation.(LogicalRelation.scala:40) at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122) at org.apache.spark.sql.SQLContext.load(SQLCon

Re: JAVA heap space issue

2016-10-24 Thread Mich Talebzadeh
OK so you are disabling broadcasting although it is not obvious how this helps in this case! Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV

Re: JAVA heap space issue

2016-10-24 Thread Mich Talebzadeh
OK so what is your full launch code now? I mean equivalent to spark-submit Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

Re: JAVA heap space issue

2016-10-24 Thread Mich Talebzadeh
\ --executor-memory 2G \ --master spark://IPAddress:7077 \ HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

Re: JAVA heap space issue

2016-10-24 Thread Mich Talebzadeh
Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any a

Re: LIMIT issue of SparkSQL

2016-10-24 Thread Mich Talebzadeh
query by LIMIT on each underlying table does not make sense and will not be industry standard AFAK. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV

Re: issue accessing Phoenix table from Spark

2016-10-21 Thread Mich Talebzadeh
Still does not work with Spark 2.0.0 on apache-phoenix-4.8.1-HBase-1.2-bin thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

Re: jdbcRDD for data ingestion from RDBMS

2016-10-18 Thread Mich Talebzadeh
st and put it into HDFS and then you can access it though Hive external tables etc. A real time load of data using Spark JDBC makes sense if the RDBMS table itself is pretty small. For most dimension tables should satisfy this. This approach is not advisable for FACT tables. HTH Dr Mich Talebzadeh

Re: Accessing Hbase tables through Spark, this seems to work

2016-10-18 Thread Mich Talebzadeh
environment. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any a

Re: Accessing Hbase tables through Spark, this seems to work

2016-10-17 Thread Mich Talebzadeh
le 4. There is Hive managed table with added optimisation/indexing (ORC) There are a number of ways of doing it as usual. Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https:/

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Mich Talebzadeh
I assume that Hbase is more of columnar data store by virtue of it storing column data together. many interpretation of this is all over places. However, it is not columnar in a sense of column based (as opposed to row based) implementation of relational model. Dr Mich Talebzadeh LinkedIn

Re: Accessing Hbase tables through Spark, this seems to work

2016-10-17 Thread Mich Talebzadeh
ave in-memory database (LLAP) so we can cache Hive tables in memory. That will be faster. Many people underestimate Hive but I still believe it has a lot to offer besides serious ANSI compliant SQL. Regards Mich Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Mich Talebzadeh
Ben, *Also look at Phoenix (Apache project) which provides a better (one of the best) SQL/JDBC layer on top of HBase.* *http://phoenix.apache.org/ <http://phoenix.apache.org/>* I am afraid this does not work with Spark 2! Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/p

Re: Indexing w spark joins?

2016-10-17 Thread Mich Talebzadeh
anufacturer , model and color" How about using some analytics and windowing functions here. Spark supports all sorts of analytic functions. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Mich Talebzadeh
| |2016-10-16T18:44:57| S18|74.10128| |2016-10-16T18:44:57| S07|66.13622| |2016-10-16T18:44:57| S20| 60.35727| +---+--++ only showing top 10 rows Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?

Accessing Hbase tables through Spark, this seems to work

2016-10-16 Thread Mich Talebzadeh
74.10128| |2016-10-16T18:44:57| S07|66.13622| |2016-10-16T18:44:57| S20|60.35727| +---+--++ only showing top 10 rows Is this a workable solution? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8

Re: Want to test spark-sql-kafka but get unresolved dependency error

2016-10-13 Thread Mich Talebzadeh
add --jars /spark-streaming-kafka_2.10-1.5.1.jar (may need to download the jar file or any newer version) to spark-shell. I also have spark-streaming-kafka-assembly_2.10-1.6.1.jar as well on --jar list HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id

Design consideration for a trading System

2016-10-10 Thread Mich Talebzadeh
irectly for Spark 2 is not available and even if we did using SQL skin for visualisation tools are better. Sorry about this long monologue. Appreciate any feedbacks. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.lin

converting hBaseRDD to DataFrame

2016-10-10 Thread Mich Talebzadeh
RLClassLoader@7b44e98e Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaime

Re: Spark Streaming Advice

2016-10-10 Thread Mich Talebzadeh
Hi Kevin, What is the streaming interval (batch interval) above? I do analytics on streaming trade data but after manipulation of individual messages I store the selected on in Hbase. Very fast. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

2016-10-02 Thread Mich Talebzadeh
Thanks Ben The thing is I am using Spark 2 and no stack from CDH! Is this approach to reading/writing to Hbase specific to Cloudera? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

2016-10-01 Thread Mich Talebzadeh
,/home/hduser/jars/hbase-common-1.2.3.jar,/home/hduser/jars/hbase-protocol-1.2.3.jar,/home/hduser/jars/htrace-core-3.0.4.jar,/home/hduser/jars/hive-hbase-handler-2.1.0.jar' So any ideas will be appreciated. Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/pr

Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-09-30 Thread Mich Talebzadeh
What will happen if you LIMIT the result set to 100 rows only -- select from order by field LIMIT 100. Will that work? How about running the whole query WITHOUT order by? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Design considerations for batch and speed layers

2016-09-30 Thread Mich Talebzadeh
I have designed this prototype for a risk business. Here I would like to discuss issues with batch layer. *Apologies about being long winded.* *Business objective* Reduce risk in the credit business while making better credit and trading decisions. Specifically, to identify risk trends within

Re: SPARK CREATING EXTERNAL TABLE

2016-09-30 Thread Mich Talebzadeh
p.hive.serde2.lazy.LazySimpleSerDe InputFormat:org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets:-1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: Defined as external. HTH Dr Mich Tale

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
iles, unless the compaction is done (a nightmare) HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress

Re: Treadting NaN fields in Spark

2016-09-29 Thread Mich Talebzadeh
|Tesco PLC| TSCO| 26-Aug-11| -| -| -|365.60| 0| |Tesco PLC| TSCO| 28-Apr-11| -| -| -|403.55| 0| |Tesco PLC| TSCO| 21-Apr-11| -| -| -|395.30| 0| |Tesco PLC| TSCO| 24-Dec-10| -| -| -|439.00| 0| +-+--+--+++---+--+--+

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any a

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
Hi Ali, What is the business use case for this? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpre

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
s well and from Flume to Hbase I would have thought that if one wanted to do real time analytics with SS, then that would be a good fit with a real time dashboard. What is not so clear is the business use case for this. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/v

Re: Treadting NaN fields in Spark

2016-09-29 Thread Mich Talebzadeh
ot be directly destructured in method or function parameters. Either create a single parameter accepting the Tuple1, or consider a pattern matching anonymous function: `{ case (param1, param1) => ... } val rs = df2.filter(isAllPostiveNumber("Open") => true) Thanks

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
Kafka into spark streaming and that will be online or near real time (defined by your window). Then you have a a serving layer to present data from both speed (the one from SS) and batch layer. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
Tableau or Zeppelin to query data You will also need spark streaming to query data online for speed layer. That data could be stored in some transient fabric like ignite or even druid. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id

Re: Issue with rogue data in csv file used in Spark application

2016-09-28 Thread Mich Talebzadeh
; spark.sql("SELECT cast(value as FLOAT) from lines").show() > > +-+ > |value| > +-+ > | null| > | 1.0 | > | null| > | 8.6 | > +-+ > > After it you may filter the DataFrame for values containing null. > > Regards, > -- > Bedrytski

Treadting NaN fields in Spark

2016-09-28 Thread Mich Talebzadeh
show +-+--+-+++---+-+--+ |Stock|Ticker|TradeDate|Open|High|Low|Close|Volume| +-+--+-+++---+-+--+ +-+--+-----+----++---+-+--+ Any suggestions? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com

Re: Issue with rogue data in csv file used in Spark application

2016-09-28 Thread Mich Talebzadeh
r any monetary damages arising from such loss, damage or destruction. On 28 September 2016 at 04:07, Mike Metzger <m...@flexiblecreations.com> wrote: > Hi Mich - > >Can you run a filter command on df1 prior to your map for any rows > where p(3).toString != '-' then run y

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Mich Talebzadeh
one check for rogue data in p(3)? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *

Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Mich Talebzadeh
lternatively do the clean up before putting csv in HDFS but that becomes tedious and error prone. Any ideas will be appreciated. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=A

Re: read multiple files

2016-09-27 Thread Mich Talebzadeh
t in a shell script. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use i

Re: What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread Mich Talebzadeh
en be used for Complex Event Processing. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *

Re: Is executor computing time affected by network latency?

2016-09-23 Thread Mich Talebzadeh
. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any a

Re: How to specify file

2016-09-23 Thread Mich Talebzadeh
You can do the following with option("delimiter") .. val df = spark.read.option("header", false).option("delimiter","\t").csv("hdfs://rhes564:9000/tmp/nw_10124772.tsv") HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view

Re: sqoop Imported and Hbase ImportTsv issue with Fled: No enum constant mapreduce.JobCounter.MB_MILLIS_MAPS

2016-09-22 Thread Mich Talebzadeh
trust that this explains it. Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer

Re: Spark RDD and Memory

2016-09-22 Thread Mich Talebzadeh
of Spark deploy Least Recently Used (LRU) mechanism to flush unused data out of memory much like RBMS cache management. I know LLDAP does that. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/prof

sqoop Imported and Hbase ImportTsv issue with Fled: No enum constant mapreduce.JobCounter.MB_MILLIS_MAPS

2016-09-22 Thread Mich Talebzadeh
ng how can I investigate further? I have attached the jar file thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV

Re: Sqoop vs spark jdbc

2016-09-21 Thread Mich Talebzadeh
n(ToolRunner.java:84) at org.apache.hadoop.hbase.mapreduce.ImportTsv.main(ImportTsv.java:684) Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6Ac

Re: Sqoop vs spark jdbc

2016-09-21 Thread Mich Talebzadeh
: No enum constant org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS* Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

Re: Sqoop vs spark jdbc

2016-09-21 Thread Mich Talebzadeh
,977 [myid:] - INFO [main:Job@1356] - Job job_1474455325627_0041 completed successfully 2016-09-21 19:11:15,138 [myid:] - ERROR [main:ImportTool@607] - Imported Failed: No enum constant org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS Any ideas? Thanks Dr Mich Talebzadeh LinkedIn * ht

Re: SPARK PERFORMANCE TUNING

2016-09-21 Thread Mich Talebzadeh
LOL I think we should try the Chrystal ball to answer this question. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

Re: driver OOM - need recommended memory for driver

2016-09-19 Thread Mich Talebzadeh
If you make your driver memory too low it is likely you are going to hit OOM error. You have not mentioned with Spark mode you are using (Local, Standalone, Yarn etc) HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <ht

Anyone used Zoomdata visual dashboard with Spark

2016-09-19 Thread Mich Talebzadeh
Hi, Zoomdata <http://www.zoomdata.com/product/> is known to be a good tool for real time dashboard. I am trying to have a look. Anyone has experienced with it with Spark by any chance? https://demo.zoomdata.com/zoomdata/login Thanks Dr Mich Talebzadeh LinkedIn * https://www.linked

Re: Spark Job not failing

2016-09-19 Thread Mich Talebzadeh
I am not sure a commit or roll-back by RDBMS is acknowledged by Spark. Hence it does not know what is going on. From my recollection this is an issue. Other alternative is to save it as a csv file and load it into RDBMS using a form of bulk copy. HTH Dr Mich Talebzadeh LinkedIn * https

Re: Spark Job not failing

2016-09-19 Thread Mich Talebzadeh
As I understanding you are inserting into RDBMS from Spark and the insert is failing on RDBMS due to duplicate primary key but not acknowledged by Spark? Is this correct HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Finding unique across all columns in dataset

2016-09-19 Thread Mich Talebzadeh
something like this df.filter('transactiontype > " ").filter(not('transactiontype ==="DEB") && not('transactiontype ==="BGC")).select('transactiontype).*distinct* .collect.foreach(println) HTH Dr Mich Talebzadeh LinkedIn *

Re: Total Shuffle Read and Write Size of Spark workload

2016-09-19 Thread Mich Talebzadeh
Spark UI on port 4040 by default HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer

Re: DataFrame defined within conditional IF ELSE statement

2016-09-18 Thread Mich Talebzadeh
.marketDataParquet").select('TIMECREATED,'SECURITY,'PRICE) df2 } case _ => { println ("No valid option provide") sys.exit } } For one reason or other the following case _ => sys.err(“no valid option provided”) Threw error! Dr Mich Talebzadeh L

Re: DataFrame defined within conditional IF ELSE statement

2016-09-18 Thread Mich Talebzadeh
any opinion on this please? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it a

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-18 Thread Mich Talebzadeh
layer. Cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own ris

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-18 Thread Mich Talebzadeh
with different Port. Then of coursed one has to think about adequate response in a concurrent environment. Cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Mich Talebzadeh
Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all re

DataFrame defined within conditional IF ELSE statement

2016-09-17 Thread Mich Talebzadeh
I try to do df2.printSchema OUTSEDE of the LOOP, it comes back with error scala> df2.printSchema :31: error: not found: value df2 df2.printSchema ^ I can define a stud df2 before IF ELSE statement. Is that the best way of dealing with it? Thanks Dr Mich

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Mich Talebzadeh
Thanks Todd. I will have a look. Regards Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disc

Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Mich Talebzadeh
o.org/>. Ideally I like to utilise such concept like TimesTen. Can one distribute Hive table data (or any table data) across the nodes cached. In that case we will be doing Logical IO which is about 20 times or more lightweight compared to Physical IO. Anyway this is the concept. Thank

Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-17 Thread Mich Talebzadeh
means that the other side has initiated a connection close, but the application on the local side has not yet closed the socket Normally it should be LISTEN or ESTABLISHED. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Can not control bucket files number if it was speficed

2016-09-17 Thread Mich Talebzadeh
t;).parquet("test.sales6") It may work. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.

Re: Can not control bucket files number if it was speficed

2016-09-17 Thread Mich Talebzadeh
It is difficult to guess what is happening with your data. First when you say you use Spark to generate test data are these selected randomly and then stored in Hive/etc table? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-16 Thread Mich Talebzadeh
Is your Hive Thrift Server up and running on port jdbc:hive2://10001? Do the following netstat -alnp |grep 10001 and see whether it is actually running HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <ht

Re: Best way to present data collected by Flume through Spark

2016-09-16 Thread Mich Talebzadeh
to consider all options Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer

Re: Best way to present data collected by Flume through Spark

2016-09-15 Thread Mich Talebzadeh
any ideas on this? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your ow

Re: Using Zeppelin with Spark FP

2016-09-15 Thread Mich Talebzadeh
Where are you reading data from Chanh? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer

Best way to present data collected by Flume through Spark

2016-09-15 Thread Mich Talebzadeh
for the users as well. I was wondering what would be the best strategy here. Druid, Hive others? The business case here is that users may want to access older data so a database of some sort will be a better solution? In all likelihood they want a week's data. Thanks Dr Mich Talebzadeh LinkedIn

Re: Reading the most recent text files created by Spark streaming

2016-09-15 Thread Mich Talebzadeh
Yes thanks. I had flume already for twitter so configured it to get data from Kafka source and post it to HDFS. cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Re: Using Zeppelin with Spark FP

2016-09-15 Thread Mich Talebzadeh
Thanks Chanh, I noticed one thing. If you put on a cron refresh say every 30 seconds after a whilt the job crashes with OOM error. Then I stop and restart Zeppelin daemon and it works again! Have you come across it? cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile

Re: ACID transactions on data added from Spark not working

2016-09-14 Thread Mich Talebzadeh
this is my experience. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your ow

Reading the most recent text files created by Spark streaming

2016-09-14 Thread Mich Talebzadeh
the most recent ones. However, this is looking cumbersome. I can create these files with any timestamp extension when persisting but System.currentTimeMillis seems to be most efficient. Any alternatives you can think of? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com

Re: Sqoop on Spark

2016-09-14 Thread Mich Talebzadeh
doc <https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html> HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw&

Re: Spark Interview questions

2016-09-14 Thread Mich Talebzadeh
many other questions that one think of. For example, someone like Jacek Laskowski can provide more programming questions as he is a professional Spark trainer :) HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www

Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread Mich Talebzadeh
ute for json looks excessive. Is your Spark on the same sub-net as your HDFS if HDFS and Spark are not sharing the same hardware? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/v

Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread Mich Talebzadeh
execution time? A textFile saving is simply a one to one mapping from your DF to HDFS. I think it is pretty efficient. For myself, I would do something like below myDF.rdd.repartition(1).cache.saveAsTextFile("mypath/output") HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin

Re: Spark SQL Thriftserver

2016-09-14 Thread Mich Talebzadeh
to have upgraded Beeline version from 1.2.1 HTH It is a useful tool with Zeppelin. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

Re: Spark SQL Thriftserver

2016-09-13 Thread Mich Talebzadeh
order of magnitude faster compared to map-reduce. You can either connect to beeline from $HIVE_HOME/... or beeline from $SPARK_HOME HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profil

Re: Spark SQL Thriftserver

2016-09-13 Thread Mich Talebzadeh
t ${SPARK_HOME}/sbin/start-thriftserver.sh \ --master \ --hiveconf hive.server2.thrift.port=10055 \ and STS bypasses Spark optimiser and uses Hive optimizer and execution engine. You will see this in hive.log file So I don't think it is going to give you much difference. Unless the

Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Mich Talebzadeh
5f9d scala> val sc = new SparkContext(conf) sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@4888425d HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?

Re: Unable to compare SparkSQL Date columns

2016-09-13 Thread Mich Talebzadeh
015-12-27,2015-12-27,3,102] [2015-12-27,2015-12-27,3,103] [2015-12-27,2015-12-27,3,104] [2015-12-27,2015-12-27,3,105] [2015-12-27,2015-12-27,4,101] [2015-12-27,2015-12-27,4,102] [2015-12-27,2015-12-27,4,103] [2015-12-27,2015-12-27,4,104] [2015-12-27,2015-12-27,4,105] [2015-12-27,2015-12-27,5,101] [2

Re: Unable to compare SparkSQL Date columns

2016-09-13 Thread Mich Talebzadeh
Hi Praseetha. :32: error: not found: value formate Error occurred in an application involving default arguments. ("1", new java.sql.Date(formate.parse("2016-01-31").getTime)), What is that formate? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linke

Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Mich Talebzadeh
Hi Daan, You may find this link Re: Is "spark streaming" streaming or mini-batch? <https://www.mail-archive.com/user@spark.apache.org/msg55914.html> helpful. This was a thread in this forum not long ago. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com

Re: Unable to compare SparkSQL Date columns

2016-09-13 Thread Mich Talebzadeh
Can you send the rdds that just creates those two dates? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

Any viable DATEDIFF function in Spark/Scala

2016-09-13 Thread Mich Talebzadeh
want to find all the rows where the rows are created in the past 15 minutes? In other words something similar to this *DATEDIFF* ( *date-part*, *date-expression1*, *date-expression2* ) Any available implementation Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.co

Re: Zeppelin patterns with the streaming data

2016-09-13 Thread Mich Talebzadeh
Hi Chanh, Yes indeed. Apparently it is implemented through a class of its own. I have specified a refresh of every 15 seconds. Obviously if there is an issue then the cron will not be able to refresh but you cannot sort out that problem from the web page anyway Thanks Dr Mich Talebzadeh

Re: Re: Selecting the top 100 records per group by?

2016-09-12 Thread Mich Talebzadeh
Hi, I don't understand why you need to add a column row_number when you can use rank or dens_rank? Why one cannot one use rank or dens_rank here? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.

Re: Using Zeppelin with Spark FP

2016-09-12 Thread Mich Talebzadeh
Does Zeppelin work OK with Spark 2? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer

Using Zeppelin with Spark FP

2016-09-11 Thread Mich Talebzadeh
[image: Inline images 2] However, if I wrote that using functional programming I won't be able to plot it. the plot feature is not available. Is this correct or I am missing something? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id

Re: Selecting the top 100 records per group by?

2016-09-11 Thread Mich Talebzadeh
You can of course do this using FP. val wSpec = Window.partitionBy('price).orderBy(desc("price")) df2.filter('security > " ").select(dense_rank().over(wSpec).as("rank"),'TIMECREATED, 'SECURITY, substring('PRICE,1,7)).filter('rank<=10).show HTH Dr

Re: Selecting the top 100 records per group by?

2016-09-11 Thread Mich Talebzadeh
,Microsoft,99.99] [2,2016-09-09 22:53:49,Tate & Lyle,99.99] [3,2016-09-09 15:31:06,UNILEVER,99.985] HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAA

Re: Spark_JDBC_Partitions

2016-09-10 Thread Mich Talebzadeh
that table to DEV/TEST, add a sequence (like an IDENTITY column in Sybase), build a unique index on the sequence column and do the partitioning there. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.

Re: Spark_JDBC_Partitions

2016-09-10 Thread Mich Talebzadeh
will be bitmap indexes on the FACT table so they can be potentially used. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

Re: Spark_JDBC_Partitions

2016-09-10 Thread Mich Talebzadeh
ch.htm#BDCUG125>that can do it for you. With 404 columns it is difficult to suggest any alternative. Is this a FACT table? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/pro

Re: Reading a TSV file

2016-09-10 Thread Mich Talebzadeh
Read header false not true val df2 = spark.read.option("header", false).option("delimiter","\t").csv("hdfs://rhes564:9000/tmp/nw_10124772.tsv") Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOAB

<    7   8   9   10   11   12   13   14   15   16   >