thanks Jorn.
I wish we had these libraries somewhere :)
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpre
ain=www.mathworks.com>
Thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*Disclaimer:* Use it at your o
ution.datasources.LogicalRelation.(LogicalRelation.scala:40)
at
org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
at org.apache.spark.sql.SQLContext.load(SQLCon
OK so you are disabling broadcasting although it is not obvious how this
helps in this case!
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV
OK so what is your full launch code now? I mean equivalent to spark-submit
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
\
--executor-memory 2G \
--master spark://IPAddress:7077 \
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*Disclaimer:* Use it at your own risk. Any a
query by LIMIT on each underlying table does not make
sense and will not be industry standard AFAK.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV
Still does not work with Spark 2.0.0 on apache-phoenix-4.8.1-HBase-1.2-bin
thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
st and put it into HDFS and
then you can access it though Hive external tables etc.
A real time load of data using Spark JDBC makes sense if the RDBMS table
itself is pretty small. For most dimension tables should satisfy this. This
approach is not advisable for FACT tables.
HTH
Dr Mich Talebzadeh
environment.
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*Disclaimer:* Use it at your own risk. Any a
le
4. There is Hive managed table with added optimisation/indexing (ORC)
There are a number of ways of doing it as usual.
Thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https:/
I assume that Hbase is more of columnar data store by virtue of it storing
column data together.
many interpretation of this is all over places. However, it is not columnar
in a sense of column based (as opposed to row based) implementation of
relational model.
Dr Mich Talebzadeh
LinkedIn
ave in-memory database
(LLAP) so we can cache Hive tables in memory. That will be faster. Many
people underestimate Hive but I still believe it has a lot to offer besides
serious ANSI compliant SQL.
Regards
Mich
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view
Ben,
*Also look at Phoenix (Apache project) which provides a better (one of the
best) SQL/JDBC layer on top of HBase.*
*http://phoenix.apache.org/ <http://phoenix.apache.org/>*
I am afraid this does not work with Spark 2!
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/p
anufacturer , model
and color"
How about using some analytics and windowing functions here. Spark supports
all sorts of analytic functions.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.
|
|2016-10-16T18:44:57| S18|74.10128|
|2016-10-16T18:44:57| S07|66.13622|
|2016-10-16T18:44:57| S20| 60.35727|
+---+--++
only showing top 10 rows
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?
74.10128|
|2016-10-16T18:44:57| S07|66.13622|
|2016-10-16T18:44:57| S20|60.35727|
+---+--++
only showing top 10 rows
Is this a workable solution?
Thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8
add --jars /spark-streaming-kafka_2.10-1.5.1.jar
(may need to download the jar file or any newer version)
to spark-shell.
I also have spark-streaming-kafka-assembly_2.10-1.6.1.jar as well on --jar
list
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id
irectly for Spark 2 is not
available and even if we did using SQL skin for visualisation tools are
better.
Sorry about this long monologue. Appreciate any feedbacks.
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.lin
RLClassLoader@7b44e98e
Thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*Disclaime
Hi Kevin,
What is the streaming interval (batch interval) above?
I do analytics on streaming trade data but after manipulation of individual
messages I store the selected on in Hbase. Very fast.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id
Thanks Ben
The thing is I am using Spark 2 and no stack from CDH!
Is this approach to reading/writing to Hbase specific to Cloudera?
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view
,/home/hduser/jars/hbase-common-1.2.3.jar,/home/hduser/jars/hbase-protocol-1.2.3.jar,/home/hduser/jars/htrace-core-3.0.4.jar,/home/hduser/jars/hive-hbase-handler-2.1.0.jar'
So any ideas will be appreciated.
Thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/pr
What will happen if you LIMIT the result set to 100 rows only -- select
from order by field LIMIT 100. Will that work?
How about running the whole query WITHOUT order by?
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
I have designed this prototype for a risk business. Here I would like to
discuss issues with batch layer. *Apologies about being long winded.*
*Business objective*
Reduce risk in the credit business while making better credit and trading
decisions. Specifically, to identify risk trends within
p.hive.serde2.lazy.LazySimpleSerDe
InputFormat:org.apache.hadoop.mapred.TextInputFormat
OutputFormat:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets:-1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
Defined as external.
HTH
Dr Mich Tale
iles,
unless the compaction is done (a nightmare)
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress
|Tesco PLC| TSCO| 26-Aug-11| -| -| -|365.60| 0|
|Tesco PLC| TSCO| 28-Apr-11| -| -| -|403.55| 0|
|Tesco PLC| TSCO| 21-Apr-11| -| -| -|395.30| 0|
|Tesco PLC| TSCO| 24-Dec-10| -| -| -|439.00| 0|
+-+--+--+++---+--+--+
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*Disclaimer:* Use it at your own risk. Any a
Hi Ali,
What is the business use case for this?
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpre
s well and from
Flume to Hbase
I would have thought that if one wanted to do real time analytics with SS,
then that would be a good fit with a real time dashboard.
What is not so clear is the business use case for this.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/v
ot be directly destructured in method or function
parameters.
Either create a single parameter accepting the Tuple1,
or consider a pattern matching anonymous function: `{ case (param1,
param1) => ... }
val rs = df2.filter(isAllPostiveNumber("Open") => true)
Thanks
Kafka into spark
streaming and that will be online or near real time (defined by your
window).
Then you have a a serving layer to present data from both speed (the one
from SS) and batch layer.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id
Tableau or
Zeppelin to query data
You will also need spark streaming to query data online for speed layer.
That data could be stored in some transient fabric like ignite or even
druid.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id
; spark.sql("SELECT cast(value as FLOAT) from lines").show()
>
> +-+
> |value|
> +-+
> | null|
> | 1.0 |
> | null|
> | 8.6 |
> +-+
>
> After it you may filter the DataFrame for values containing null.
>
> Regards,
> --
> Bedrytski
show
+-+--+-+++---+-+--+
|Stock|Ticker|TradeDate|Open|High|Low|Close|Volume|
+-+--+-+++---+-+--+
+-+--+-----+----++---+-+--+
Any suggestions?
Thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com
r any monetary damages arising from
such loss, damage or destruction.
On 28 September 2016 at 04:07, Mike Metzger <m...@flexiblecreations.com>
wrote:
> Hi Mich -
>
>Can you run a filter command on df1 prior to your map for any rows
> where p(3).toString != '-' then run y
one check for rogue data in p(3)?
Thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*
lternatively do the clean up before putting csv in HDFS but
that becomes tedious and error prone.
Any ideas will be appreciated.
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=A
t in a shell script.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*Disclaimer:* Use i
en be used for Complex Event
Processing.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*
.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*Disclaimer:* Use it at your own risk. Any a
You can do the following with option("delimiter") ..
val df = spark.read.option("header",
false).option("delimiter","\t").csv("hdfs://rhes564:9000/tmp/nw_10124772.tsv")
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view
trust that this explains it.
Thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*Disclaimer
of Spark deploy Least Recently Used
(LRU) mechanism to flush unused data out of memory much like RBMS cache
management. I know LLDAP does that.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/prof
ng how can I investigate further?
I have attached the jar file
thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV
n(ToolRunner.java:84)
at
org.apache.hadoop.hbase.mapreduce.ImportTsv.main(ImportTsv.java:684)
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6Ac
: No enum constant
org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS*
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
,977 [myid:] - INFO [main:Job@1356] - Job
job_1474455325627_0041 completed successfully
2016-09-21 19:11:15,138 [myid:] - ERROR [main:ImportTool@607] - Imported
Failed: No enum constant
org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS
Any ideas?
Thanks
Dr Mich Talebzadeh
LinkedIn *
ht
LOL
I think we should try the Chrystal ball to answer this question.
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
If you make your driver memory too low it is likely you are going to hit
OOM error.
You have not mentioned with Spark mode you are using (Local, Standalone,
Yarn etc)
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<ht
Hi,
Zoomdata <http://www.zoomdata.com/product/> is known to be a good tool for
real time dashboard. I am trying to have a look. Anyone has experienced
with it with Spark by any chance?
https://demo.zoomdata.com/zoomdata/login
Thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linked
I am not sure a commit or roll-back by RDBMS is acknowledged by Spark.
Hence it does not know what is going on. From my recollection this is an
issue.
Other alternative is to save it as a csv file and load it into RDBMS
using a form of bulk copy.
HTH
Dr Mich Talebzadeh
LinkedIn *
https
As I understanding you are inserting into RDBMS from Spark and the insert
is failing on RDBMS due to duplicate primary key but not acknowledged by
Spark? Is this correct
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
something like this
df.filter('transactiontype > " ").filter(not('transactiontype ==="DEB") &&
not('transactiontype ==="BGC")).select('transactiontype).*distinct*
.collect.foreach(println)
HTH
Dr Mich Talebzadeh
LinkedIn *
Spark UI on port 4040 by default
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*Disclaimer
.marketDataParquet").select('TIMECREATED,'SECURITY,'PRICE)
df2
}
case _ => {
println ("No valid option provide")
sys.exit
}
}
For one reason or other the following
case _ => sys.err(“no valid option provided”)
Threw error!
Dr Mich Talebzadeh
L
any opinion on this please?
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*Disclaimer:* Use it a
layer.
Cheers
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*Disclaimer:* Use it at your own ris
with different Port. Then of coursed one has
to think about adequate response in a concurrent environment.
Cheers
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*Disclaimer:* Use it at your own risk. Any and all re
I try to do df2.printSchema OUTSEDE of the LOOP, it comes back
with error
scala> df2.printSchema
:31: error: not found: value df2
df2.printSchema
^
I can define a stud df2 before IF ELSE statement. Is that the best way of
dealing with it?
Thanks
Dr Mich
Thanks Todd.
I will have a look.
Regards
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*Disc
o.org/>.
Ideally I like to utilise such concept like TimesTen. Can one distribute
Hive table data (or any table data) across the nodes cached. In that case
we will be doing Logical IO which is about 20 times or more lightweight
compared to Physical IO.
Anyway this is the concept.
Thank
means that the other side has initiated a
connection close, but the application on the local side has not yet closed
the socket
Normally it should be LISTEN or ESTABLISHED.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
t;).parquet("test.sales6")
It may work.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.
It is difficult to guess what is happening with your data.
First when you say you use Spark to generate test data are these selected
randomly and then stored in Hive/etc table?
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Is your Hive Thrift Server up and running on port
jdbc:hive2://10001?
Do the following
netstat -alnp |grep 10001
and see whether it is actually running
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<ht
to consider all options
Thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*Disclaimer
any ideas on this?
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*Disclaimer:* Use it at your ow
Where are you reading data from Chanh?
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*Disclaimer
for the users as well.
I was wondering what would be the best strategy here. Druid, Hive others?
The business case here is that users may want to access older data so a
database of some sort will be a better solution? In all likelihood they
want a week's data.
Thanks
Dr Mich Talebzadeh
LinkedIn
Yes thanks. I had flume already for twitter so configured it to get data
from Kafka source and post it to HDFS.
cheers
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view
Thanks Chanh,
I noticed one thing. If you put on a cron refresh say every 30 seconds
after a whilt the job crashes with OOM error.
Then I stop and restart Zeppelin daemon and it works again!
Have you come across it?
cheers
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile
this is my experience.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*Disclaimer:* Use it at your ow
the most recent ones.
However, this is looking cumbersome. I can create these files with any
timestamp extension when persisting but System.currentTimeMillis seems to
be most efficient.
Any alternatives you can think of?
Thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com
doc
<https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html>
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw&
many other questions that one think of. For example,
someone like Jacek Laskowski can provide more programming questions as he
is a professional Spark trainer :)
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www
ute for json
looks excessive.
Is your Spark on the same sub-net as your HDFS if HDFS and Spark are not
sharing the same hardware?
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/v
execution time?
A textFile saving is simply a one to one mapping from your DF to HDFS. I
think it is pretty efficient.
For myself, I would do something like below
myDF.rdd.repartition(1).cache.saveAsTextFile("mypath/output")
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin
to have upgraded Beeline
version from 1.2.1
HTH
It is a useful tool with Zeppelin.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
order of magnitude faster compared to map-reduce.
You can either connect to beeline from $HIVE_HOME/... or beeline from
$SPARK_HOME
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profil
t
${SPARK_HOME}/sbin/start-thriftserver.sh \
--master \
--hiveconf hive.server2.thrift.port=10055 \
and STS bypasses Spark optimiser and uses Hive optimizer and execution
engine. You will see this in hive.log file
So I don't think it is going to give you much difference. Unless the
5f9d
scala> val sc = new SparkContext(conf)
sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@4888425d
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?
015-12-27,2015-12-27,3,102]
[2015-12-27,2015-12-27,3,103]
[2015-12-27,2015-12-27,3,104]
[2015-12-27,2015-12-27,3,105]
[2015-12-27,2015-12-27,4,101]
[2015-12-27,2015-12-27,4,102]
[2015-12-27,2015-12-27,4,103]
[2015-12-27,2015-12-27,4,104]
[2015-12-27,2015-12-27,4,105]
[2015-12-27,2015-12-27,5,101]
[2
Hi Praseetha.
:32: error: not found: value formate
Error occurred in an application involving default arguments.
("1", new
java.sql.Date(formate.parse("2016-01-31").getTime)),
What is that formate?
Thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linke
Hi Daan,
You may find this link Re: Is "spark streaming" streaming or mini-batch?
<https://www.mail-archive.com/user@spark.apache.org/msg55914.html>
helpful. This was a thread in this forum not long ago.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com
Can you send the rdds that just creates those two dates?
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
want to find all the rows where the rows are created in the past
15 minutes?
In other words something similar to this
*DATEDIFF* ( *date-part*, *date-expression1*, *date-expression2* )
Any available implementation
Thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.co
Hi Chanh,
Yes indeed. Apparently it is implemented through a class of its own. I have
specified a refresh of every 15 seconds.
Obviously if there is an issue then the cron will not be able to refresh
but you cannot sort out that problem from the web page anyway
Thanks
Dr Mich Talebzadeh
Hi,
I don't understand why you need to add a column row_number when you can use
rank or dens_rank?
Why one cannot one use rank or dens_rank here?
Thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.
Does Zeppelin work OK with Spark 2?
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*Disclaimer
[image: Inline images 2]
However, if I wrote that using functional programming I won't be able to
plot it. the plot feature is not available.
Is this correct or I am missing something?
Thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id
You can of course do this using FP.
val wSpec = Window.partitionBy('price).orderBy(desc("price"))
df2.filter('security > "
").select(dense_rank().over(wSpec).as("rank"),'TIMECREATED, 'SECURITY,
substring('PRICE,1,7)).filter('rank<=10).show
HTH
Dr
,Microsoft,99.99]
[2,2016-09-09 22:53:49,Tate & Lyle,99.99]
[3,2016-09-09 15:31:06,UNILEVER,99.985]
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAA
that table to
DEV/TEST, add a sequence (like an IDENTITY column in Sybase), build a
unique index on the sequence column and do the partitioning there.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.
will
be bitmap indexes on the FACT table so they can be potentially used.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
ch.htm#BDCUG125>that
can do it for you. With 404 columns it is difficult to suggest any
alternative. Is this a FACT table?
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/pro
Read header false not true
val df2 = spark.read.option("header",
false).option("delimiter","\t").csv("hdfs://rhes564:9000/tmp/nw_10124772.tsv")
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOAB
1101 - 1200 of 2083 matches
Mail list logo