Re: pyspark get column family and qualifier names from hbase table

2014-11-12 Thread freedafeng
Hi, 

This is my code,

import org.apache.hadoop.hbase.CellUtil

/**
 * JF: convert a Result object into a string with column family and
qualifier names. Sth like
 *
'columnfamily1:columnqualifier1:value1;columnfamily2:columnqualifier2:value2'
etc.
 * k-v pairs are separated by ';'. different columns for each cell is
separated by ':'.
 * Notice that we don't need the row key here, because it has been converted
by
 * ImmutableBytesWritableToStringConverter.
 */
class CustomHBaseResultToStringConverter extends Converter[Any, String] {
  override def convert(obj: Any): String = {
val result = obj.asInstanceOf[Result]

result.rawCells().map(cell =
List(Bytes.toString(CellUtil.cloneFamily(cell)),
  Bytes.toString(CellUtil.cloneQualifier(cell)),
 
Bytes.toString(CellUtil.cloneValue(cell))).mkString(:)).mkString(;)
  }
}

I recommend you to use different delimiters (to replace : or ; ) if you
have data with those stuff
in them. I am not a seasoned scala programmer, so there might be a more
flexible solution. For 
example, make the delimiters dynamically assignable. 

I will try to open a PR probably later today.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18744.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: pyspark get column family and qualifier names from hbase table

2014-11-12 Thread freedafeng
Hi Nick,

I saw the HBase api has experienced lots of changes. If I remember
correctly, the default hbase in spark 1.1.0 is 0.94.6. The one I am using is
0.98.1. To get the column family names and qualifier names, we need to call
different methods for these two different versions. I don't know how to do
that...sorry...



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18749.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



pyspark get column family and qualifier names from hbase table

2014-11-11 Thread freedafeng
Hello there,

I am wondering how to get the column family names and column qualifier names
when using pyspark to read an hbase table with multiple column families.

I have a hbase table as follows,
hbase(main):007:0 scan 'data1'
ROW   COLUMN+CELL   
 row1 column=f1:, timestamp=1411078148186, value=value1 
 row1 column=f2:, timestamp=1415732470877, value=value7 
 row2 column=f2:, timestamp=1411078160265, value=value2 

when I ran the examples/hbase_inputformat.py code: 
conf2 = {hbase.zookeeper.quorum: localhost,
hbase.mapreduce.inputtable: 'data1'}
hbase_rdd = sc.newAPIHadoopRDD(
org.apache.hadoop.hbase.mapreduce.TableInputFormat,
org.apache.hadoop.hbase.io.ImmutableBytesWritable,
org.apache.hadoop.hbase.client.Result,
   
keyConverter=org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter,
   
valueConverter=org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter,
conf=conf2)
output = hbase_rdd.collect()
for (k, v) in output:
print (k, v)
I only see 
(u'row1', u'value1')
(u'row2', u'value2')

What I really want is (row_id, column family:column qualifier, value)
tuples. Any comments? Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: pyspark get column family and qualifier names from hbase table

2014-11-11 Thread freedafeng
checked the source, found the following,

class HBaseResultToStringConverter extends Converter[Any, String] {
  override def convert(obj: Any): String = {
val result = obj.asInstanceOf[Result]
Bytes.toStringBinary(result.value())
  }
}

I feel using 'result.value()' here is a big limitation. Converting from the
'list()' from the 'Result' is more general and easy to use. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18619.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: pyspark get column family and qualifier names from hbase table

2014-11-11 Thread freedafeng
just wrote a custom convert in scala to replace HBaseResultToStringConverter.
Just couple of lines of code. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18639.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: pyspark get column family and qualifier names from hbase table

2014-11-11 Thread alaa
Hey freedafeng, I'm exactly where you are. I want the output to show the
rowkey and all column qualifiers that correspond to it. How did you write
HBaseResultToStringConverter to do what you wanted it to do?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18650.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: pyspark get column family and qualifier names from hbase table

2014-11-11 Thread Nick Pentreath
Feel free to add that converter as an option in the Spark examples via a PR :)

—
Sent from Mailbox

On Wed, Nov 12, 2014 at 3:27 AM, alaa contact.a...@gmail.com wrote:

 Hey freedafeng, I'm exactly where you are. I want the output to show the
 rowkey and all column qualifiers that correspond to it. How did you write
 HBaseResultToStringConverter to do what you wanted it to do?
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18650.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org