I am not well versed with importtsv, but you can create a CSV file using a simple spark program to create first column as ticker+tradedate. I remember doing similar manipulation to create row key format in pig. On 3 Oct 2016 20:40, "Mich Talebzadeh" <mich.talebza...@gmail.com> wrote:
> Thanks Ayan, > > How do you specify ticker+rtrade as row key in the below > > hbase org.apache.hadoop.hbase.mapreduce.ImportTsv > -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY, > stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily: > high,stock_daily:low,stock_daily:close,stock_daily:volume" tsco > hdfs://rhes564:9000/data/stocks/tsco.csv > > I always thought that Hbase will take the first column as row key so it > takes stock as the row key which is tsco plc for every row! > > Does row key need to be unique? > > cheers > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 3 October 2016 at 10:30, ayan guha <guha.a...@gmail.com> wrote: > >> Hi Mitch >> >> It is more to do with hbase than spark. >> >> Row key can be anything, yes but essentially what you are doing is insert >> and update tesco PLC row. Given your schema, ticker+trade date seems to be >> a good row key >> On 3 Oct 2016 18:25, "Mich Talebzadeh" <mich.talebza...@gmail.com> wrote: >> >>> thanks again. >>> >>> I added that jar file to the classpath and that part worked. >>> >>> I was using spark shell so I have to use spark-submit for it to be able >>> to interact with map-reduce job. >>> >>> BTW when I use the command line utility ImportTsv to load a file into >>> Hbase with the following table format >>> >>> describe 'marketDataHbase' >>> Table marketDataHbase is ENABLED >>> marketDataHbase >>> COLUMN FAMILIES DESCRIPTION >>> {NAME => 'price_info', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY >>> => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', >>> TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKC >>> ACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} >>> 1 row(s) in 0.0930 seconds >>> >>> >>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv >>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY, >>> stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:h >>> igh,stock_daily:low,stock_daily:close,stock_daily:volume" tsco >>> hdfs://rhes564:9000/data/stocks/tsco.csv >>> >>> There are with 1200 rows in the csv file,* but it only loads the first >>> row!* >>> >>> scan 'tsco' >>> ROW COLUMN+CELL >>> Tesco PLC >>> column=stock_daily:close, timestamp=1475447365118, value=325.25 >>> Tesco PLC >>> column=stock_daily:high, timestamp=1475447365118, value=332.00 >>> Tesco PLC >>> column=stock_daily:low, timestamp=1475447365118, value=324.00 >>> Tesco PLC >>> column=stock_daily:open, timestamp=1475447365118, value=331.75 >>> Tesco PLC >>> column=stock_daily:ticker, timestamp=1475447365118, value=TSCO >>> Tesco PLC >>> column=stock_daily:tradedate, timestamp=1475447365118, value= 3-Jan-06 >>> Tesco PLC >>> column=stock_daily:volume, timestamp=1475447365118, value=46935045 >>> 1 row(s) in 0.0390 seconds >>> >>> Is this because the hbase_row_key --> Tesco PLC is the same for all? I >>> thought that the row key can be anything. >>> >>> >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> On 3 October 2016 at 07:44, Benjamin Kim <bbuil...@gmail.com> wrote: >>> >>>> We installed Apache Spark 1.6.0 at the time alongside CDH 5.4.8 because >>>> Cloudera only had Spark 1.3.0 at the time, and we wanted to use Spark >>>> 1.6.0’s features. We borrowed the /etc/spark/conf/spark-env.sh file that >>>> Cloudera generated because it was customized to add jars first from paths >>>> listed in the file /etc/spark/conf/classpath.txt. So, we entered the path >>>> for the htrace jar into the /etc/spark/conf/classpath.txt file. Then, it >>>> worked. We could read/write to HBase. >>>> >>>> On Oct 2, 2016, at 12:52 AM, Mich Talebzadeh <mich.talebza...@gmail.com> >>>> wrote: >>>> >>>> Thanks Ben >>>> >>>> The thing is I am using Spark 2 and no stack from CDH! >>>> >>>> Is this approach to reading/writing to Hbase specific to Cloudera? >>>> >>>> >>>> >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> On 1 October 2016 at 23:39, Benjamin Kim <bbuil...@gmail.com> wrote: >>>> >>>>> Mich, >>>>> >>>>> I know up until CDH 5.4 we had to add the HTrace jar to the classpath >>>>> to make it work using the command below. But after upgrading to CDH 5.7, >>>>> it >>>>> became unnecessary. >>>>> >>>>> echo "/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar" >>>>> >> /etc/spark/conf/classpath.txt >>>>> >>>>> Hope this helps. >>>>> >>>>> Cheers, >>>>> Ben >>>>> >>>>> >>>>> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh <mich.talebza...@gmail.com> >>>>> wrote: >>>>> >>>>> Trying bulk load using Hfiles in Spark as below example: >>>>> >>>>> import org.apache.spark._ >>>>> import org.apache.spark.rdd.NewHadoopRDD >>>>> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor} >>>>> import org.apache.hadoop.hbase.client.HBaseAdmin >>>>> import org.apache.hadoop.hbase.mapreduce.TableInputFormat >>>>> import org.apache.hadoop.fs.Path; >>>>> import org.apache.hadoop.hbase.HColumnDescriptor >>>>> import org.apache.hadoop.hbase.util.Bytes >>>>> import org.apache.hadoop.hbase.client.Put; >>>>> import org.apache.hadoop.hbase.client.HTable; >>>>> import org.apache.hadoop.hbase.mapred.TableOutputFormat >>>>> import org.apache.hadoop.mapred.JobConf >>>>> import org.apache.hadoop.hbase.io.ImmutableBytesWritable >>>>> import org.apache.hadoop.mapreduce.Job >>>>> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat >>>>> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat >>>>> import org.apache.hadoop.hbase.KeyValue >>>>> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat >>>>> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles >>>>> >>>>> So far no issues. >>>>> >>>>> Then I do >>>>> >>>>> val conf = HBaseConfiguration.create() >>>>> conf: org.apache.hadoop.conf.Configuration = Configuration: >>>>> core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, >>>>> yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml >>>>> val tableName = "testTable" >>>>> tableName: String = testTable >>>>> >>>>> But this one fails: >>>>> >>>>> scala> val table = new HTable(conf, tableName) >>>>> java.io.IOException: java.lang.reflect.InvocationTargetException >>>>> at org.apache.hadoop.hbase.client.ConnectionFactory.createConne >>>>> ction(ConnectionFactory.java:240) >>>>> at org.apache.hadoop.hbase.client.ConnectionManager.createConne >>>>> ction(ConnectionManager.java:431) >>>>> at org.apache.hadoop.hbase.client.ConnectionManager.createConne >>>>> ction(ConnectionManager.java:424) >>>>> at org.apache.hadoop.hbase.client.ConnectionManager.getConnecti >>>>> onInternal(ConnectionManager.java:302) >>>>> at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:185) >>>>> at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:151) >>>>> ... 52 elided >>>>> Caused by: java.lang.reflect.InvocationTargetException: >>>>> java.lang.NoClassDefFoundError: org/apache/htrace/Trace >>>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >>>>> Method) >>>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance(Native >>>>> ConstructorAccessorImpl.java:62) >>>>> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(De >>>>> legatingConstructorAccessorImpl.java:45) >>>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:423) >>>>> at org.apache.hadoop.hbase.client.ConnectionFactory.createConne >>>>> ction(ConnectionFactory.java:238) >>>>> ... 57 more >>>>> Caused by: java.lang.NoClassDefFoundError: org/apache/htrace/Trace >>>>> at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exist >>>>> s(RecoverableZooKeeper.java:216) >>>>> at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil. >>>>> java:419) >>>>> at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZ >>>>> Node(ZKClusterId.java:65) >>>>> at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterI >>>>> d(ZooKeeperRegistry.java:105) >>>>> at org.apache.hadoop.hbase.client.ConnectionManager$HConnection >>>>> Implementation.retrieveClusterId(ConnectionManager.java:905) >>>>> at org.apache.hadoop.hbase.client.ConnectionManager$HConnection >>>>> Implementation.<init>(ConnectionManager.java:648) >>>>> ... 62 more >>>>> Caused by: java.lang.ClassNotFoundException: org.apache.htrace.Trace >>>>> >>>>> I have got all the jar files in spark-defaults.conf >>>>> >>>>> spark.driver.extraClassPath /home/hduser/jars/ojdbc6.jar:/ >>>>> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1 >>>>> .2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hdus >>>>> er/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-proto >>>>> col-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/ >>>>> hduser/jars/hive-hbase-handler-2.1.0.jar >>>>> spark.executor.extraClassPath /home/hduser/jars/ojdbc6.jar:/ >>>>> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1 >>>>> .2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hdus >>>>> er/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-proto >>>>> col-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/ >>>>> hduser/jars/hive-hbase-handler-2.1.0.jar >>>>> >>>>> >>>>> and also in Spark shell where I test the code >>>>> >>>>> --jars /home/hduser/jars/hbase-client-1.2.3.jar,/home/hduser/jars/h >>>>> base-server-1.2.3.jar,/home/hduser/jars/hbase-common-1.2.3.j >>>>> ar,/home/hduser/jars/hbase-protocol-1.2.3.jar,/home/hduser/j >>>>> ars/htrace-core-3.0.4.jar,/home/hduser/jars/hive-hbase-handl >>>>> er-2.1.0.jar' >>>>> >>>>> So any ideas will be appreciated. >>>>> >>>>> Thanks >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> LinkedIn * >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >