If you’re interested, here is the link to the development page for Kudu. It has the Spark code snippets using DataFrames.
http://kudu.apache.org/docs/developing.html <http://kudu.apache.org/docs/developing.html> Cheers, Ben > On Oct 3, 2016, at 9:56 AM, ayan guha <guha.a...@gmail.com> wrote: > > That sounds interesting, would love to learn more about it. > > Mitch: looks good. Lastly I would suggest you to think if you really need > multiple column families. > > On 4 Oct 2016 02:57, "Benjamin Kim" <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > Lately, I’ve been experimenting with Kudu. It has been a much better > experience than with HBase. Using it is much simpler, even from spark-shell. > > spark-shell --packages org.apache.kudu:kudu-spark_2.10:1.0.0 > > It’s like going back to rudimentary DB systems where tables have just a > primary key and the columns. Additional benefits include a home-grown spark > package, fast upserts and table scans for analytics, time-series support just > introduced, and (my favorite) simpler configuration and administration. It > has just gone to version 1.0.0; so, I’m waiting for 1.0.1+ before I propose > it as our HBase replacement for some bugs to shake out. All my performance > tests have been stellar versus HBase especially with its simplicity. > > Just a thought… > > Cheers, > Ben > > >> On Oct 3, 2016, at 8:40 AM, Mich Talebzadeh <mich.talebza...@gmail.com >> <mailto:mich.talebza...@gmail.com>> wrote: >> >> Hi, >> >> I decided to create a composite key ticker-date from the csv file >> >> I just did some manipulation on CSV file >> >> export IFS=",";sed -i 1d tsco.csv; cat tsco.csv | while read a b c d e f; do >> echo "TSCO-$a,TESCO PLC,TSCO,$a,$b,$c,$d,$e,$f"; done > temp; mv -f temp >> tsco.csv >> >> Which basically takes the csv file, tells the shell that field separator >> IFS=",", drops the header, reads every field in every line (1,b,c ..), >> creates the composite key TSCO-$a, adds the stock name and ticker to the csv >> file. The whole process can be automated and parameterised. >> >> Once the csv file is put into HDFS then, I run the following command >> >> $HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv >> -Dimporttsv.separator=',' >> -Dimporttsv.columns="HBASE_ROW_KEY,stock_info:stock,stock_info:ticker,stock_daily:Date,stock_daily:open,stock_daily:high,stock_daily:low,stock_daily:close,stock_daily:volume" >> tsco hdfs://rhes564:9000/data/ <>stocks/tsco.csv >> >> The Hbase table is created as below >> >> create 'tsco','stock_info','stock_daily' >> >> and this is the data (2 rows each 2 family and with 8 attributes) >> >> hbase(main):132:0> scan 'tsco', LIMIT => 2 >> ROW COLUMN+CELL >> TSCO-1-Apr-08 >> column=stock_daily:Date, timestamp=1475507091676, value=1-Apr-08 >> TSCO-1-Apr-08 >> column=stock_daily:close, timestamp=1475507091676, value=405.25 >> TSCO-1-Apr-08 >> column=stock_daily:high, timestamp=1475507091676, value=406.75 >> TSCO-1-Apr-08 >> column=stock_daily:low, timestamp=1475507091676, value=379.25 >> TSCO-1-Apr-08 >> column=stock_daily:open, timestamp=1475507091676, value=380.00 >> TSCO-1-Apr-08 >> column=stock_daily:volume, timestamp=1475507091676, value=49664486 >> TSCO-1-Apr-08 >> column=stock_info:stock, timestamp=1475507091676, value=TESCO PLC >> TSCO-1-Apr-08 >> column=stock_info:ticker, timestamp=1475507091676, value=TSCO >> >> TSCO-1-Apr-09 >> column=stock_daily:Date, timestamp=1475507091676, value=1-Apr-09 >> TSCO-1-Apr-09 >> column=stock_daily:close, timestamp=1475507091676, value=333.30 >> TSCO-1-Apr-09 >> column=stock_daily:high, timestamp=1475507091676, value=334.60 >> TSCO-1-Apr-09 >> column=stock_daily:low, timestamp=1475507091676, value=326.50 >> TSCO-1-Apr-09 >> column=stock_daily:open, timestamp=1475507091676, value=331.10 >> TSCO-1-Apr-09 >> column=stock_daily:volume, timestamp=1475507091676, value=24877341 >> TSCO-1-Apr-09 >> column=stock_info:stock, timestamp=1475507091676, value=TESCO PLC >> TSCO-1-Apr-09 >> column=stock_info:ticker, timestamp=1475507091676, value=TSCO >> >> Any suggestions >> >> Thanks >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >> >> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >> >> Disclaimer: Use it at your own risk. Any and all responsibility for any >> loss, damage or destruction of data or any other property which may arise >> from relying on this email's technical content is explicitly disclaimed. The >> author will in no case be liable for any monetary damages arising from such >> loss, damage or destruction. >> >> >> On 3 October 2016 at 14:42, Mich Talebzadeh <mich.talebza...@gmail.com >> <mailto:mich.talebza...@gmail.com>> wrote: >> or may be add ticker+date like similar >> >> >> <image.png> >> >> So the new row key would be TSCO-1-Apr-08 >> >> and this will be added as row key. Both Date and ticker will stay as they >> are as column family attributes? >> >> >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >> >> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >> >> Disclaimer: Use it at your own risk. Any and all responsibility for any >> loss, damage or destruction of data or any other property which may arise >> from relying on this email's technical content is explicitly disclaimed. The >> author will in no case be liable for any monetary damages arising from such >> loss, damage or destruction. >> >> >> On 3 October 2016 at 14:32, Mich Talebzadeh <mich.talebza...@gmail.com >> <mailto:mich.talebza...@gmail.com>> wrote: >> with ticker+date I can c reate something like below for row key >> >> TSCO_1-Apr-08 >> >> >> or TSCO1-Apr-08 >> >> if I understood you correctly >> >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >> >> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >> >> Disclaimer: Use it at your own risk. Any and all responsibility for any >> loss, damage or destruction of data or any other property which may arise >> from relying on this email's technical content is explicitly disclaimed. The >> author will in no case be liable for any monetary damages arising from such >> loss, damage or destruction. >> >> >> On 3 October 2016 at 13:13, ayan guha <guha.a...@gmail.com >> <mailto:guha.a...@gmail.com>> wrote: >> Hi >> >> Looks like you are saving to new.csv but still loading tsco.csv? Its >> definitely the header. >> >> Suggestion: ticker+date as row key has following benefits: >> >> 1. using ticker+date as row key will enable you to hold multiple ticker in >> this single hbase table. (Think composite primary key) >> 2. Using date itself as row key will lead to hotspots (Look up hotspoting >> due to monotonically increasing row key). To distribute the load, it is >> suggested to use a salting. Ticker can be used as a natural salt in this >> case. >> 3. Also, you may want to hash the rowkey value to give it little more >> flexible (Think surrogate key). >> >> >> >> On Mon, Oct 3, 2016 at 10:17 PM, Mich Talebzadeh <mich.talebza...@gmail.com >> <mailto:mich.talebza...@gmail.com>> wrote: >> Hi Ayan, >> >> Sounds like the row key has to be unique much like a primary key in RDBMS >> >> This is what I download as a csv for stock from Google Finance >> >> Date Open High Low Close Volume >> 27-Sep-16 177.4 177.75 172.5 177.75 24117196 >> >> >> So What I do I add the stock and ticker myself to end of the row via shell >> script and get rid of header >> >> sed -i 1d tsco.csv; cat tsco.csv|awk '{print $0,",TESCO PLC,TSCO"}' > new.csv >> >> The New table has two column families: stock_price, stock_info and row key >> date (one row per date) >> >> This creates a new csv file with two additional columns appended to the end >> of each line >> >> Then I run the following command >> >> $HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv >> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY, >> stock_daily:open, stock_daily:high, stock_daily:low, stock_daily:close, >> stock_daily:volume, stock_info:stock, stock_info:ticker" tsco >> hdfs://rhes564:9000/data/stock <>s/tsco.csv >> >> This is in Hbase table for a given day >> >> hbase(main):090:0> scan 'tsco', LIMIT => 10 >> ROW COLUMN+CELL >> 1-Apr-08 >> column=stock_daily:close, timestamp=1475492248665, value=405.25 >> 1-Apr-08 >> column=stock_daily:high, timestamp=1475492248665, value=406.75 >> 1-Apr-08 >> column=stock_daily:low, timestamp=1475492248665, value=379.25 >> 1-Apr-08 >> column=stock_daily:open, timestamp=1475492248665, value=380.00 >> 1-Apr-08 >> column=stock_daily:volume, timestamp=1475492248665, value=49664486 >> 1-Apr-08 >> column=stock_info:stock, timestamp=1475492248665, value=TESCO PLC >> 1-Apr-08 >> column=stock_info:ticker, timestamp=1475492248665, value=TSCO >> >> >> But I also have this at the bottom >> >> Date >> column=stock_daily:close, timestamp=1475491189158, value=Close >> Date >> column=stock_daily:high, timestamp=1475491189158, value=High >> Date >> column=stock_daily:low, timestamp=1475491189158, value=Low >> Date >> column=stock_daily:open, timestamp=1475491189158, value=Open >> Date >> column=stock_daily:volume, timestamp=1475491189158, value=Volume >> Date >> column=stock_info:stock, timestamp=1475491189158, value=TESCO PLC >> Date >> column=stock_info:ticker, timestamp=1475491189158, value=TSCO >> >> Sounds like the table header? >> >> >> >> >> >> >> >> >> >> >> >> >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >> >> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >> >> Disclaimer: Use it at your own risk. Any and all responsibility for any >> loss, damage or destruction of data or any other property which may arise >> from relying on this email's technical content is explicitly disclaimed. The >> author will in no case be liable for any monetary damages arising from such >> loss, damage or destruction. >> >> >> On 3 October 2016 at 11:24, ayan guha <guha.a...@gmail.com >> <mailto:guha.a...@gmail.com>> wrote: >> I am not well versed with importtsv, but you can create a CSV file using a >> simple spark program to create first column as ticker+tradedate. I remember >> doing similar manipulation to create row key format in pig. >> >> On 3 Oct 2016 20:40, "Mich Talebzadeh" <mich.talebza...@gmail.com >> <mailto:mich.talebza...@gmail.com>> wrote: >> Thanks Ayan, >> >> How do you specify ticker+rtrade as row key in the below >> >> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' >> -Dimporttsv.columns="HBASE_ROW_KEY, stock_daily:ticker, >> stock_daily:tradedate, >> stock_daily:open,stock_daily:high,stock_daily:low,stock_daily:close,stock_daily:volume" >> tsco hdfs://rhes564:9000/data/stock <>s/tsco.csv >> >> I always thought that Hbase will take the first column as row key so it >> takes stock as the row key which is tsco plc for every row! >> >> Does row key need to be unique? >> >> cheers >> >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >> >> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >> >> Disclaimer: Use it at your own risk. Any and all responsibility for any >> loss, damage or destruction of data or any other property which may arise >> from relying on this email's technical content is explicitly disclaimed. The >> author will in no case be liable for any monetary damages arising from such >> loss, damage or destruction. >> >> >> On 3 October 2016 at 10:30, ayan guha <guha.a...@gmail.com >> <mailto:guha.a...@gmail.com>> wrote: >> Hi Mitch >> >> It is more to do with hbase than spark. >> >> Row key can be anything, yes but essentially what you are doing is insert >> and update tesco PLC row. Given your schema, ticker+trade date seems to be a >> good row key >> >> On 3 Oct 2016 18:25, "Mich Talebzadeh" <mich.talebza...@gmail.com >> <mailto:mich.talebza...@gmail.com>> wrote: >> thanks again. >> >> I added that jar file to the classpath and that part worked. >> >> I was using spark shell so I have to use spark-submit for it to be able to >> interact with map-reduce job. >> >> BTW when I use the command line utility ImportTsv to load a file into Hbase >> with the following table format >> >> describe 'marketDataHbase' >> Table marketDataHbase is ENABLED >> marketDataHbase >> COLUMN FAMILIES DESCRIPTION >> {NAME => 'price_info', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => >> 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL >> => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKC >> ACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} >> 1 row(s) in 0.0930 seconds >> >> >> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' >> -Dimporttsv.columns="HBASE_ROW_KEY, stock_daily:ticker, >> stock_daily:tradedate, >> stock_daily:open,stock_daily:high,stock_daily:low,stock_daily:close,stock_daily:volume" >> tsco hdfs://rhes564:9000/data/stock <>s/tsco.csv >> >> There are with 1200 rows in the csv file, but it only loads the first row! >> >> scan 'tsco' >> ROW COLUMN+CELL >> Tesco PLC >> column=stock_daily:close, timestamp=1475447365118, value=325.25 >> Tesco PLC >> column=stock_daily:high, timestamp=1475447365118, value=332.00 >> Tesco PLC >> column=stock_daily:low, timestamp=1475447365118, value=324.00 >> Tesco PLC >> column=stock_daily:open, timestamp=1475447365118, value=331.75 >> Tesco PLC >> column=stock_daily:ticker, timestamp=1475447365118, value=TSCO >> Tesco PLC >> column=stock_daily:tradedate, timestamp=1475447365118, value= 3-Jan-06 >> Tesco PLC >> column=stock_daily:volume, timestamp=1475447365118, value=46935045 >> 1 row(s) in 0.0390 seconds >> >> Is this because the hbase_row_key --> Tesco PLC is the same for all? I >> thought that the row key can be anything. >> >> >> >> >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >> >> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >> >> Disclaimer: Use it at your own risk. Any and all responsibility for any >> loss, damage or destruction of data or any other property which may arise >> from relying on this email's technical content is explicitly disclaimed. The >> author will in no case be liable for any monetary damages arising from such >> loss, damage or destruction. >> >> >> On 3 October 2016 at 07:44, Benjamin Kim <bbuil...@gmail.com >> <mailto:bbuil...@gmail.com>> wrote: >> We installed Apache Spark 1.6.0 at the time alongside CDH 5.4.8 because >> Cloudera only had Spark 1.3.0 at the time, and we wanted to use Spark >> 1.6.0’s features. We borrowed the /etc/spark/conf/spark-env.sh file that >> Cloudera generated because it was customized to add jars first from paths >> listed in the file /etc/spark/conf/classpath.txt. So, we entered the path >> for the htrace jar into the /etc/spark/conf/classpath.txt file. Then, it >> worked. We could read/write to HBase. >> >>> On Oct 2, 2016, at 12:52 AM, Mich Talebzadeh <mich.talebza...@gmail.com >>> <mailto:mich.talebza...@gmail.com>> wrote: >>> >>> Thanks Ben >>> >>> The thing is I am using Spark 2 and no stack from CDH! >>> >>> Is this approach to reading/writing to Hbase specific to Cloudera? >>> >>> >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> LinkedIn >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >>> >>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >>> >>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>> loss, damage or destruction of data or any other property which may arise >>> from relying on this email's technical content is explicitly disclaimed. >>> The author will in no case be liable for any monetary damages arising from >>> such loss, damage or destruction. >>> >>> >>> On 1 October 2016 at 23:39, Benjamin Kim <bbuil...@gmail.com >>> <mailto:bbuil...@gmail.com>> wrote: >>> Mich, >>> >>> I know up until CDH 5.4 we had to add the HTrace jar to the classpath to >>> make it work using the command below. But after upgrading to CDH 5.7, it >>> became unnecessary. >>> >>> echo "/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar" >> >>> /etc/spark/conf/classpath.txt >>> >>> Hope this helps. >>> >>> Cheers, >>> Ben >>> >>> >>>> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh <mich.talebza...@gmail.com >>>> <mailto:mich.talebza...@gmail.com>> wrote: >>>> >>>> Trying bulk load using Hfiles in Spark as below example: >>>> >>>> import org.apache.spark._ >>>> import org.apache.spark.rdd.NewHadoopRDD >>>> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor} >>>> import org.apache.hadoop.hbase.client.HBaseAdmin >>>> import org.apache.hadoop.hbase.mapreduce.TableInputFormat >>>> import org.apache.hadoop.fs.Path; >>>> import org.apache.hadoop.hbase.HColumnDescriptor >>>> import org.apache.hadoop.hbase.util.Bytes >>>> import org.apache.hadoop.hbase.client.Put; >>>> import org.apache.hadoop.hbase.client.HTable; >>>> import org.apache.hadoop.hbase.mapred.TableOutputFormat >>>> import org.apache.hadoop.mapred.JobConf >>>> import org.apache.hadoop.hbase.io >>>> <http://org.apache.hadoop.hbase.io/>.ImmutableBytesWritable >>>> import org.apache.hadoop.mapreduce.Jo >>>> <http://org.apache.hadoop.mapreduce.jo/>b >>>> import org.apache.hadoop.mapreduce.li >>>> <http://org.apache.hadoop.mapreduce.li/>b.input.FileInputFormat >>>> import org.apache.hadoop.mapreduce.li >>>> <http://org.apache.hadoop.mapreduce.li/>b.output.FileOutputFormat >>>> import org.apache.hadoop.hbase.KeyValue >>>> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat >>>> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles >>>> >>>> So far no issues. >>>> >>>> Then I do >>>> >>>> val conf = HBaseConfiguration.create() >>>> conf: org.apache.hadoop.conf.Configuration = Configuration: >>>> core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, >>>> yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml >>>> val tableName = "testTable" >>>> tableName: String = testTable > ...