If you’re interested, here is the link to the development page for Kudu. It has 
the Spark code snippets using DataFrames.

http://kudu.apache.org/docs/developing.html 
<http://kudu.apache.org/docs/developing.html>

Cheers,
Ben

> On Oct 3, 2016, at 9:56 AM, ayan guha <guha.a...@gmail.com> wrote:
> 
> That sounds interesting, would love to learn more about it. 
> 
> Mitch: looks good. Lastly I would suggest you to think if you really need 
> multiple column families. 
> 
> On 4 Oct 2016 02:57, "Benjamin Kim" <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Lately, I’ve been experimenting with Kudu. It has been a much better 
> experience than with HBase. Using it is much simpler, even from spark-shell.
> 
> spark-shell --packages org.apache.kudu:kudu-spark_2.10:1.0.0
> 
> It’s like going back to rudimentary DB systems where tables have just a 
> primary key and the columns. Additional benefits include a home-grown spark 
> package, fast upserts and table scans for analytics, time-series support just 
> introduced, and (my favorite) simpler configuration and administration. It 
> has just gone to version 1.0.0; so, I’m waiting for 1.0.1+ before I propose 
> it as our HBase replacement for some bugs to shake out. All my performance 
> tests have been stellar versus HBase especially with its simplicity.
> 
> Just a thought…
> 
> Cheers,
> Ben
> 
> 
>> On Oct 3, 2016, at 8:40 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> Hi,
>> 
>> I decided to create a composite key ticker-date from the csv file
>> 
>> I just did some manipulation on CSV file 
>> 
>> export IFS=",";sed -i 1d tsco.csv; cat tsco.csv | while read a b c d e f; do 
>> echo "TSCO-$a,TESCO PLC,TSCO,$a,$b,$c,$d,$e,$f"; done > temp; mv -f temp 
>> tsco.csv
>> 
>> Which basically takes the csv file, tells the shell that field separator 
>> IFS=",", drops the header, reads every field in every line (1,b,c ..), 
>> creates the composite key TSCO-$a, adds the stock name and ticker to the csv 
>> file. The whole process can be automated and parameterised.
>> 
>> Once the csv file is put into HDFS then, I run the following command
>> 
>> $HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv 
>> -Dimporttsv.separator=',' 
>> -Dimporttsv.columns="HBASE_ROW_KEY,stock_info:stock,stock_info:ticker,stock_daily:Date,stock_daily:open,stock_daily:high,stock_daily:low,stock_daily:close,stock_daily:volume"
>>  tsco hdfs://rhes564:9000/data/ <>stocks/tsco.csv
>> 
>> The Hbase table is created as below
>> 
>> create 'tsco','stock_info','stock_daily'
>> 
>> and this is the data (2 rows each 2 family and with 8 attributes)
>> 
>> hbase(main):132:0> scan 'tsco', LIMIT => 2
>> ROW                                                    COLUMN+CELL
>>  TSCO-1-Apr-08                                         
>> column=stock_daily:Date, timestamp=1475507091676, value=1-Apr-08
>>  TSCO-1-Apr-08                                         
>> column=stock_daily:close, timestamp=1475507091676, value=405.25
>>  TSCO-1-Apr-08                                         
>> column=stock_daily:high, timestamp=1475507091676, value=406.75
>>  TSCO-1-Apr-08                                         
>> column=stock_daily:low, timestamp=1475507091676, value=379.25
>>  TSCO-1-Apr-08                                         
>> column=stock_daily:open, timestamp=1475507091676, value=380.00
>>  TSCO-1-Apr-08                                         
>> column=stock_daily:volume, timestamp=1475507091676, value=49664486
>>  TSCO-1-Apr-08                                         
>> column=stock_info:stock, timestamp=1475507091676, value=TESCO PLC
>>  TSCO-1-Apr-08                                         
>> column=stock_info:ticker, timestamp=1475507091676, value=TSCO
>>  
>>  TSCO-1-Apr-09                                         
>> column=stock_daily:Date, timestamp=1475507091676, value=1-Apr-09
>>  TSCO-1-Apr-09                                         
>> column=stock_daily:close, timestamp=1475507091676, value=333.30
>>  TSCO-1-Apr-09                                         
>> column=stock_daily:high, timestamp=1475507091676, value=334.60
>>  TSCO-1-Apr-09                                         
>> column=stock_daily:low, timestamp=1475507091676, value=326.50
>>  TSCO-1-Apr-09                                         
>> column=stock_daily:open, timestamp=1475507091676, value=331.10
>>  TSCO-1-Apr-09                                         
>> column=stock_daily:volume, timestamp=1475507091676, value=24877341
>>  TSCO-1-Apr-09                                         
>> column=stock_info:stock, timestamp=1475507091676, value=TESCO PLC
>>  TSCO-1-Apr-09                                         
>> column=stock_info:ticker, timestamp=1475507091676, value=TSCO
>> 
>> Any suggestions
>> 
>> Thanks
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction. 
>>  
>> 
>> On 3 October 2016 at 14:42, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> or may be add ticker+date like similar
>> 
>> 
>> <image.png>
>> 
>> So the new row key would be TSCO-1-Apr-08 
>> 
>> and this will be added as row key. Both Date and ticker will stay as they 
>> are as column family attributes?
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction. 
>>  
>> 
>> On 3 October 2016 at 14:32, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> with ticker+date I can c reate something like below for row key
>> 
>> TSCO_1-Apr-08 
>> 
>> 
>> or TSCO1-Apr-08
>> 
>> if I understood you correctly
>>                     
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction. 
>>  
>> 
>> On 3 October 2016 at 13:13, ayan guha <guha.a...@gmail.com 
>> <mailto:guha.a...@gmail.com>> wrote:
>> Hi
>> 
>> Looks like you are saving to new.csv but still loading tsco.csv? Its 
>> definitely the header.
>> 
>> Suggestion: ticker+date as row key has following benefits:
>> 
>> 1. using ticker+date as row key will enable you to hold multiple ticker in 
>> this single hbase table. (Think composite primary key)
>> 2. Using date itself as row key will lead to hotspots (Look up hotspoting 
>> due to monotonically increasing row key). To distribute the load, it is 
>> suggested to use a salting. Ticker can be used as a natural salt in this 
>> case. 
>> 3. Also, you may want to hash the rowkey value to give it little more 
>> flexible (Think surrogate key). 
>> 
>> 
>> 
>> On Mon, Oct 3, 2016 at 10:17 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> Hi Ayan,
>> 
>> Sounds like the row key has to be unique much like a primary key in RDBMS
>> 
>> This is what I download as a csv for stock from Google Finance
>> 
>>   Date       Open    High    Low     Close   Volume
>> 27-Sep-16    177.4   177.75  172.5   177.75  24117196
>> 
>> 
>> So What I do I add the stock and ticker myself to end of the row via shell 
>> script and get rid of header
>> 
>> sed -i 1d tsco.csv; cat tsco.csv|awk '{print $0,",TESCO PLC,TSCO"}' > new.csv
>> 
>> The New table has two column families: stock_price, stock_info and row key 
>> date (one row per date)
>> 
>> This creates a new csv file with two additional columns appended to the end 
>> of each line
>> 
>> Then I run the following command 
>> 
>> $HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv 
>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY, 
>> stock_daily:open, stock_daily:high, stock_daily:low, stock_daily:close, 
>> stock_daily:volume, stock_info:stock, stock_info:ticker" tsco 
>> hdfs://rhes564:9000/data/stock <>s/tsco.csv
>> 
>> This is in Hbase table for a given day
>> 
>> hbase(main):090:0> scan 'tsco', LIMIT => 10
>> ROW                                                    COLUMN+CELL
>>  1-Apr-08                                              
>> column=stock_daily:close, timestamp=1475492248665, value=405.25
>>  1-Apr-08                                              
>> column=stock_daily:high, timestamp=1475492248665, value=406.75
>>  1-Apr-08                                              
>> column=stock_daily:low, timestamp=1475492248665, value=379.25
>>  1-Apr-08                                              
>> column=stock_daily:open, timestamp=1475492248665, value=380.00
>>  1-Apr-08                                              
>> column=stock_daily:volume, timestamp=1475492248665, value=49664486
>>  1-Apr-08                                              
>> column=stock_info:stock, timestamp=1475492248665, value=TESCO PLC
>>  1-Apr-08                                              
>> column=stock_info:ticker, timestamp=1475492248665, value=TSCO
>> 
>>   
>> But I also have this at the bottom
>> 
>>   Date                                                  
>> column=stock_daily:close, timestamp=1475491189158, value=Close
>>  Date                                                  
>> column=stock_daily:high, timestamp=1475491189158, value=High
>>  Date                                                  
>> column=stock_daily:low, timestamp=1475491189158, value=Low
>>  Date                                                  
>> column=stock_daily:open, timestamp=1475491189158, value=Open
>>  Date                                                  
>> column=stock_daily:volume, timestamp=1475491189158, value=Volume
>>  Date                                                  
>> column=stock_info:stock, timestamp=1475491189158, value=TESCO PLC
>>  Date                                                  
>> column=stock_info:ticker, timestamp=1475491189158, value=TSCO
>> 
>> Sounds like the table header?
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction. 
>>  
>> 
>> On 3 October 2016 at 11:24, ayan guha <guha.a...@gmail.com 
>> <mailto:guha.a...@gmail.com>> wrote:
>> I am not well versed with importtsv, but you can create a CSV file using a 
>> simple spark program to create first column as ticker+tradedate. I remember 
>> doing similar manipulation to create row key format in pig. 
>> 
>> On 3 Oct 2016 20:40, "Mich Talebzadeh" <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> Thanks Ayan,
>> 
>> How do you specify ticker+rtrade as row key in the below
>> 
>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' 
>> -Dimporttsv.columns="HBASE_ROW_KEY, stock_daily:ticker, 
>> stock_daily:tradedate, 
>> stock_daily:open,stock_daily:high,stock_daily:low,stock_daily:close,stock_daily:volume"
>>  tsco hdfs://rhes564:9000/data/stock <>s/tsco.csv
>> 
>> I always thought that Hbase will take the first column as row key so it 
>> takes stock as the row key which is tsco plc for every row!
>> 
>> Does row key need to be unique?
>> 
>> cheers
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction. 
>>  
>> 
>> On 3 October 2016 at 10:30, ayan guha <guha.a...@gmail.com 
>> <mailto:guha.a...@gmail.com>> wrote:
>> Hi Mitch
>> 
>> It is more to do with hbase than spark.
>> 
>> Row key can be anything, yes but essentially what you are doing is insert 
>> and update tesco PLC row. Given your schema, ticker+trade date seems to be a 
>> good row key
>> 
>> On 3 Oct 2016 18:25, "Mich Talebzadeh" <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> thanks again.
>> 
>> I added that jar file to the classpath and that part worked.
>> 
>> I was using spark shell so I have to use spark-submit for it to be able to 
>> interact with map-reduce job.
>> 
>> BTW when I use the command line utility ImportTsv  to load a file into Hbase 
>> with the following table format
>> 
>> describe 'marketDataHbase'
>> Table marketDataHbase is ENABLED
>> marketDataHbase
>> COLUMN FAMILIES DESCRIPTION
>> {NAME => 'price_info', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 
>> 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL 
>> => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKC
>> ACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
>> 1 row(s) in 0.0930 seconds
>> 
>> 
>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' 
>> -Dimporttsv.columns="HBASE_ROW_KEY, stock_daily:ticker, 
>> stock_daily:tradedate, 
>> stock_daily:open,stock_daily:high,stock_daily:low,stock_daily:close,stock_daily:volume"
>>  tsco hdfs://rhes564:9000/data/stock <>s/tsco.csv
>> 
>> There are with 1200 rows in the csv file, but it only loads the first row!
>> 
>> scan 'tsco'
>> ROW                                                    COLUMN+CELL
>>  Tesco PLC                                             
>> column=stock_daily:close, timestamp=1475447365118, value=325.25
>>  Tesco PLC                                             
>> column=stock_daily:high, timestamp=1475447365118, value=332.00
>>  Tesco PLC                                             
>> column=stock_daily:low, timestamp=1475447365118, value=324.00
>>  Tesco PLC                                             
>> column=stock_daily:open, timestamp=1475447365118, value=331.75
>>  Tesco PLC                                             
>> column=stock_daily:ticker, timestamp=1475447365118, value=TSCO
>>  Tesco PLC                                             
>> column=stock_daily:tradedate, timestamp=1475447365118, value= 3-Jan-06
>>  Tesco PLC                                             
>> column=stock_daily:volume, timestamp=1475447365118, value=46935045
>> 1 row(s) in 0.0390 seconds
>> 
>> Is this because the hbase_row_key --> Tesco PLC is the same for all? I 
>> thought that the row key can be anything.
>> 
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction. 
>>  
>> 
>> On 3 October 2016 at 07:44, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> We installed Apache Spark 1.6.0 at the time alongside CDH 5.4.8 because 
>> Cloudera only had Spark 1.3.0 at the time, and we wanted to use Spark 
>> 1.6.0’s features. We borrowed the /etc/spark/conf/spark-env.sh file that 
>> Cloudera generated because it was customized to add jars first from paths 
>> listed in the file /etc/spark/conf/classpath.txt. So, we entered the path 
>> for the htrace jar into the /etc/spark/conf/classpath.txt file. Then, it 
>> worked. We could read/write to HBase. 
>> 
>>> On Oct 2, 2016, at 12:52 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>> 
>>> Thanks Ben
>>> 
>>> The thing is I am using Spark 2 and no stack from CDH!
>>> 
>>> Is this approach to reading/writing to Hbase specific to Cloudera?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction. 
>>>  
>>> 
>>> On 1 October 2016 at 23:39, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> Mich,
>>> 
>>> I know up until CDH 5.4 we had to add the HTrace jar to the classpath to 
>>> make it work using the command below. But after upgrading to CDH 5.7, it 
>>> became unnecessary.
>>> 
>>> echo "/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar" >> 
>>> /etc/spark/conf/classpath.txt
>>> 
>>> Hope this helps.
>>> 
>>> Cheers,
>>> Ben
>>> 
>>> 
>>>> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>>> 
>>>> Trying bulk load using Hfiles in Spark as below example:
>>>> 
>>>> import org.apache.spark._
>>>> import org.apache.spark.rdd.NewHadoopRDD
>>>> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
>>>> import org.apache.hadoop.hbase.client.HBaseAdmin
>>>> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>>> import org.apache.hadoop.fs.Path;
>>>> import org.apache.hadoop.hbase.HColumnDescriptor
>>>> import org.apache.hadoop.hbase.util.Bytes
>>>> import org.apache.hadoop.hbase.client.Put;
>>>> import org.apache.hadoop.hbase.client.HTable;
>>>> import org.apache.hadoop.hbase.mapred.TableOutputFormat
>>>> import org.apache.hadoop.mapred.JobConf
>>>> import org.apache.hadoop.hbase.io 
>>>> <http://org.apache.hadoop.hbase.io/>.ImmutableBytesWritable
>>>> import org.apache.hadoop.mapreduce.Jo 
>>>> <http://org.apache.hadoop.mapreduce.jo/>b
>>>> import org.apache.hadoop.mapreduce.li 
>>>> <http://org.apache.hadoop.mapreduce.li/>b.input.FileInputFormat
>>>> import org.apache.hadoop.mapreduce.li 
>>>> <http://org.apache.hadoop.mapreduce.li/>b.output.FileOutputFormat
>>>> import org.apache.hadoop.hbase.KeyValue
>>>> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
>>>> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
>>>> 
>>>> So far no issues.
>>>> 
>>>> Then I do
>>>> 
>>>> val conf = HBaseConfiguration.create()
>>>> conf: org.apache.hadoop.conf.Configuration = Configuration: 
>>>> core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, 
>>>> yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml
>>>> val tableName = "testTable"
>>>> tableName: String = testTable
> ...

Reply via email to