Re: Spark Plugin Exception - java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row

2015-09-23 Thread Josh Mahonin
Hi Babar,

Can you file a JIRA for this? I suspect this is something to do with the Spark 
1.5 data frame API data structures, perhaps they've gone and changed them again!

Can you try with previous Spark versions to see if there's a difference? Also, 
you may have luck interfacing with the RDDs directly instead of the data frames.

Thanks!

Josh

From: Babar Tareen
Reply-To: "user@phoenix.apache.org"
Date: Tuesday, September 22, 2015 at 5:47 PM
To: "user@phoenix.apache.org"
Subject: Spark Plugin Exception - java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to 
org.apache.spark.sql.Row

Hi,

I am trying to run the spark plugin DataFrame sample code available here 
(https://phoenix.apache.org/phoenix_spark.html) and getting following 
exception.  I am running the code against hbase-1.1.1, spark 1.5.0 and phoenix  
4.5.2. HBase is running in standalone mode, locally on OS X.  Any ideas what 
might be causing this exception?


java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to 
org.apache.spark.sql.Row
at org.apache.spark.sql.SQLContext$$anonfun$7.apply(SQLContext.scala:439) 
~[spark-sql_2.11-1.5.0.jar:1.5.0]
at scala.collection.Iterator$$anon$11.next(Iterator.scala:363) 
~[scala-library-2.11.4.jar:na]
at scala.collection.Iterator$$anon$11.next(Iterator.scala:363) 
~[scala-library-2.11.4.jar:na]
at scala.collection.Iterator$$anon$11.next(Iterator.scala:363) 
~[scala-library-2.11.4.jar:na]
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:366)
 ~[spark-sql_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
 ~[spark-sql_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
 ~[spark-sql_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
 ~[spark-sql_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
 ~[spark-sql_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
 ~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.scheduler.Task.run(Task.scala:88) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[na:1.8.0_45]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[na:1.8.0_45]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]

Thanks,
Babar


Re: BulkloadTool issue even after successful HfileLoads

2015-09-23 Thread Gabriel Reid
Hi Dhruv,

This is a bug in Phoenix, although it appears that your hadoop
configuration is also somewhat unusual.

As far as I can see, your hadoop configuration is set up to use the
local filesystem, and not hdfs. You can test this by running the
following command:

hadoop dfs -ls /

If that command lists the contents of the root directory on your local
machine, then hadoop is set up to use your local filesystem, and not
HDFS.

The bulk load tool currently uses the file system that is configured
for Hadoop for deleting the (temporary) output directory. However,
because your setup uses the local filesystem by default, but you're
using HDFS via the file paths that you supply to the tool, deleting
the temporary output directory fails.

The quick fix (if feasible) is to set up your hadoop command to use
HDFS as the default file system. However, could you also log this as a
bug in the JIRA (https://issues.apache.org/jira/browse/PHOENIX)

- Gabriel


On Wed, Sep 23, 2015 at 2:45 PM, Dhruv Gohil  wrote:
> Hi,
> I am able to successfully use BulkLoadTool to load millions of rows in
> phoenixTable, but at end of each execution following error occurs. Need your
> inputs to make runs full green.
>
> Following is minimal reproduction using EXAMPLE given in documentation.
> Env:
> CDH 5.3
> Hbase 0.98
> Phoenix 4.2.2
>
>> Running BulkloadTool from one of the HDFS-CDH cluster machine.
>  Exactly following instructions :
> https://phoenix.apache.org/bulk_dataload.html
>
> HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar
> phoenix-4.2.2-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool
> --table EXAMPLE --input
> hdfs://cdh3.st.comany.org:8020/data/example.csv --output
> hdfs://cdh3.st.comany.org:8020/user/ --zookeeper 10.10.10.1
>
>
> CsvBulkLoadTool - Import job on table=EXAMPLE failed due to
> exception:java.lang.IllegalArgumentException: Wrong FS:
> hdfs://cdh3.st.comany.org:8020/user/EXAMPLE, expected: file:///
>
> Job output log:
>
> 2015-09-23 17:54:09,633 [phoenix-2-thread-0] INFO
> org.apache.hadoop.mapreduce.Job - Job job_local1620809330_0001 completed
> successfully
> 2015-09-23 17:54:09,663 [phoenix-2-thread-0] INFO
> org.apache.hadoop.mapreduce.Job - Counters: 33
> File System Counters
> FILE: Number of bytes read=34018940
> FILE: Number of bytes written=34812676
> FILE: Number of read operations=0
> FILE: Number of large read operations=0
> FILE: Number of write operations=0
> HDFS: Number of bytes read=68
> HDFS: Number of bytes written=1241
> HDFS: Number of read operations=15
> HDFS: Number of large read operations=0
> HDFS: Number of write operations=5
> Map-Reduce Framework
> Map input records=2
> Map output records=6
> Map output bytes=330
> Map output materialized bytes=348
> Input split bytes=117
> Combine input records=0
> Combine output records=0
> Reduce input groups=2
> Reduce shuffle bytes=0
> Reduce input records=6
> Reduce output records=6
> Spilled Records=12
> Shuffled Maps =0
> Failed Shuffles=0
> Merged Map outputs=0
> GC time elapsed (ms)=0
> CPU time spent (ms)=0
> Physical memory (bytes) snapshot=0
> Virtual memory (bytes) snapshot=0
> Total committed heap usage (bytes)=736100352
> Phoenix MapReduce Import
> Upserts Done=2
> File Input Format Counters
> Bytes Read=34
> File Output Format Counters
> Bytes Written=1241
> 2015-09-23 17:54:09,664 [phoenix-2-thread-0] INFO
> org.apache.phoenix.mapreduce.CsvBulkLoadTool - Loading HFiles from
> hdfs://cdh3.st.comany.org:8020/user/EXAMPLE
> 2015-09-23 17:54:12,428 [phoenix-2-thread-0] WARN
> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles - Skipping
> non-directory hdfs://cdh3.st.comany.org:8020/user/EXAMPLE/_SUCCESS
> 2015-09-23 17:54:16,233 [LoadIncrementalHFiles-0] INFO
> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles - Trying to load
> hfile=hdfs://cdh3.st.comany.org:8020/user/EXAMPLE/M/1fe6f08c1a2f431ba87e08ce27ec813d
> first=\x80\x00\x00\x00\x00\x0009 last=\x80\x00\x00\x00\x00\x01\x092
> 2015-09-23 17:54:17,898 [phoenix-2-thread-0] INFO
> org.apache.phoenix.mapreduce.CsvBulkLoadTool - Incremental load complete for
> table=EXAMPLE
> 2015-09-23 17:54:17,899 [phoenix-2-thread-0] INFO
> org.apache.phoenix.mapreduce.CsvBulkLoadTool - Removing output directory
> hdfs://cdh3.st.comany.org:8020/user/EXAMPLE
> 2015-09-23 17:54:17,900 [phoenix-2-thread-0] ERROR
> org.apache.phoenix.mapreduce.CsvBulkLoadTool - Import job on table=EXAMPLE
> failed due to exception:java.lang.IllegalArgumentException: Wrong FS:
> hdfs://cdh3.st.comany.org:8020/user/EXAMPLE, expected: file:///
>
> --
> Dhruv


BulkloadTool issue even after successful HfileLoads

2015-09-23 Thread Dhruv Gohil

Hi,
I am able to successfully use BulkLoadTool to load millions of rows 
in phoenixTable, but at end of each execution following error occurs. 
Need your inputs to make runs full green.


Following is minimal reproduction using EXAMPLE given in documentation.
Env:
CDH 5.3
Hbase 0.98
Phoenix 4.2.2

> Running BulkloadTool from one of the HDFS-CDH cluster machine.
 Exactly following instructions : 
https://phoenix.apache.org/bulk_dataload.html


HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar 
phoenix-4.2.2-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table 
EXAMPLE --input
hdfs://cdh3.st.comany.org:8020/data/example.csv --output
hdfs://cdh3.st.comany.org:8020/user/ --zookeeper 10.10.10.1


CsvBulkLoadTool - Import job on table=EXAMPLE failed due to 
exception:java.lang.IllegalArgumentException: Wrong FS: 
hdfs://cdh3.st.comany.org:8020/user/EXAMPLE, expected: file:///


Job output log:
2015-09-23 17:54:09,633 [phoenix-2-thread-0] INFO 
org.apache.hadoop.mapreduce.Job - Job job_local1620809330_0001 
completed successfully
2015-09-23 17:54:09,663 [phoenix-2-thread-0] INFO 
org.apache.hadoop.mapreduce.Job - Counters: 33

File System Counters
FILE: Number of bytes read=34018940
FILE: Number of bytes written=34812676
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=68
HDFS: Number of bytes written=1241
HDFS: Number of read operations=15
HDFS: Number of large read operations=0
HDFS: Number of write operations=5
Map-Reduce Framework
Map input records=2
Map output records=6
Map output bytes=330
Map output materialized bytes=348
Input split bytes=117
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=0
Reduce input records=6
Reduce output records=6
Spilled Records=12
Shuffled Maps =0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=0
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=736100352
Phoenix MapReduce Import
Upserts Done=2
File Input Format Counters
Bytes Read=34
File Output Format Counters
Bytes Written=1241
2015-09-23 17:54:09,664 [phoenix-2-thread-0] INFO 
org.apache.phoenix.mapreduce.CsvBulkLoadTool - Loading HFiles from 
hdfs://cdh3.st.comany.org:8020/user/EXAMPLE
2015-09-23 17:54:12,428 [phoenix-2-thread-0] WARN 
org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles - Skipping 
non-directory hdfs://cdh3.st.comany.org:8020/user/EXAMPLE/_SUCCESS
2015-09-23 17:54:16,233 [LoadIncrementalHFiles-0] INFO 
org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles - Trying to 
load 
hfile=hdfs://cdh3.st.comany.org:8020/user/EXAMPLE/M/1fe6f08c1a2f431ba87e08ce27ec813d 
first=\x80\x00\x00\x00\x00\x0009 last=\x80\x00\x00\x00\x00\x01\x092
2015-09-23 17:54:17,898 [phoenix-2-thread-0] INFO 
org.apache.phoenix.mapreduce.CsvBulkLoadTool - Incremental load 
complete for table=EXAMPLE
2015-09-23 17:54:17,899 [phoenix-2-thread-0] INFO 
org.apache.phoenix.mapreduce.CsvBulkLoadTool - Removing output 
directory hdfs://cdh3.st.comany.org:8020/user/EXAMPLE
2015-09-23 17:54:17,900 [phoenix-2-thread-0] ERROR 
org.apache.phoenix.mapreduce.CsvBulkLoadTool - Import job on 
table=EXAMPLE failed due to 
exception:java.lang.IllegalArgumentException: Wrong FS: 
hdfs://cdh3.st.comany.org:8020/user/EXAMPLE, expected: file:///

--
Dhruv


Re: BulkloadTool issue even after successful HfileLoads

2015-09-23 Thread Dhruv Gohil

Thanks Gabriel for quick response,
You are right about my Hadoop Config, its somehow using local 
filesystem. (let me find in CDH on how to change this)


So for tool "--output" is optional arg, so should work without that 
argument.
But in that case we are not at all able to loadHfiles. Tool get 
stuck at that for 15-20 mins and then fails without ANY record in Phoenix.

See following logs..
2015-09-23 19:09:57,618 [phoenix-2-thread-0] INFO 
org.apache.phoenix.mapreduce.CsvBulkLoadTool - Loading HFiles from 
/tmp/a87eaa26-8cf4-4589-9ac9-6662e86f8547/EXAMPLE
2015-09-23 19:09:59,261 [phoenix-2-thread-0] WARN 
org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles - Skipping 
non-directory 
file:/tmp/a87eaa26-8cf4-4589-9ac9-6662e86f8547/EXAMPLE/_SUCCESS
2015-09-23 19:10:01,469 [LoadIncrementalHFiles-0] INFO 
org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles - Trying to 
load 
hfile=file:/tmp/a87eaa26-8cf4-4589-9ac9-6662e86f8547/EXAMPLE/M/4854e10c013944e4ab1195c70ff530ee 
first=\x80\x00\x00\x00\x00\x0009 last=\x80\x00\x00\x00\x00\x01\x092


  That's why we explicitly provided ---output as hdfs:// and things 
atleast worked.


--
Dhruv
On Wednesday 23 September 2015 06:54 PM, Gabriel Reid wrote:

Hi Dhruv,

This is a bug in Phoenix, although it appears that your hadoop
configuration is also somewhat unusual.

As far as I can see, your hadoop configuration is set up to use the
local filesystem, and not hdfs. You can test this by running the
following command:

 hadoop dfs -ls /

If that command lists the contents of the root directory on your local
machine, then hadoop is set up to use your local filesystem, and not
HDFS.

The bulk load tool currently uses the file system that is configured
for Hadoop for deleting the (temporary) output directory. However,
because your setup uses the local filesystem by default, but you're
using HDFS via the file paths that you supply to the tool, deleting
the temporary output directory fails.

The quick fix (if feasible) is to set up your hadoop command to use
HDFS as the default file system. However, could you also log this as a
bug in the JIRA (https://issues.apache.org/jira/browse/PHOENIX)

- Gabriel


On Wed, Sep 23, 2015 at 2:45 PM, Dhruv Gohil  wrote:

Hi,
 I am able to successfully use BulkLoadTool to load millions of rows in
phoenixTable, but at end of each execution following error occurs. Need your
inputs to make runs full green.

 Following is minimal reproduction using EXAMPLE given in documentation.
Env:
 CDH 5.3
 Hbase 0.98
 Phoenix 4.2.2


Running BulkloadTool from one of the HDFS-CDH cluster machine.

  Exactly following instructions :
https://phoenix.apache.org/bulk_dataload.html

HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar
phoenix-4.2.2-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool
--table EXAMPLE --input
hdfs://cdh3.st.comany.org:8020/data/example.csv --output
hdfs://cdh3.st.comany.org:8020/user/ --zookeeper 10.10.10.1


CsvBulkLoadTool - Import job on table=EXAMPLE failed due to
exception:java.lang.IllegalArgumentException: Wrong FS:
hdfs://cdh3.st.comany.org:8020/user/EXAMPLE, expected: file:///

Job output log:

2015-09-23 17:54:09,633 [phoenix-2-thread-0] INFO
org.apache.hadoop.mapreduce.Job - Job job_local1620809330_0001 completed
successfully
2015-09-23 17:54:09,663 [phoenix-2-thread-0] INFO
org.apache.hadoop.mapreduce.Job - Counters: 33
File System Counters
FILE: Number of bytes read=34018940
FILE: Number of bytes written=34812676
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=68
HDFS: Number of bytes written=1241
HDFS: Number of read operations=15
HDFS: Number of large read operations=0
HDFS: Number of write operations=5
Map-Reduce Framework
Map input records=2
Map output records=6
Map output bytes=330
Map output materialized bytes=348
Input split bytes=117
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=0
Reduce input records=6
Reduce output records=6
Spilled Records=12
Shuffled Maps =0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=0
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=736100352
Phoenix MapReduce Import
Upserts Done=2
File Input Format Counters
Bytes Read=34
File Output Format Counters
Bytes Written=1241
2015-09-23 17:54:09,664 [phoenix-2-thread-0] INFO
org.apache.phoenix.mapreduce.CsvBulkLoadTool - Loading HFiles from
hdfs://cdh3.st.comany.org:8020/user/EXAMPLE
2015-09-23 17:54:12,428 [phoenix-2-thread-0] WARN
org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles - Skipping
non-directory hdfs://cdh3.st.comany.org:8020/user/EXAMPLE/_SUCCESS
2015-09-23 17:54:16,233 [LoadIncrementalHFiles-0] INFO
org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles - Trying to load

Re: BulkloadTool issue even after successful HfileLoads

2015-09-23 Thread Gabriel Reid
That's correct, without having the output written to HDFS, HBase won't
be able to load the HFiles unless HBase is also hosted on the local
filesystem.

In any case, properly configuring Hadoop to use the same HDFS setup
everywhere will be the easiest way to get things working.

- Gabriel

On Wed, Sep 23, 2015 at 3:47 PM, Dhruv Gohil  wrote:
> Thanks Gabriel for quick response,
> You are right about my Hadoop Config, its somehow using local
> filesystem. (let me find in CDH on how to change this)
>
> So for tool "--output" is optional arg, so should work without that
> argument.
> But in that case we are not at all able to loadHfiles. Tool get stuck at
> that for 15-20 mins and then fails without ANY record in Phoenix.
> See following logs..
>>
>> 2015-09-23 19:09:57,618 [phoenix-2-thread-0] INFO
>> org.apache.phoenix.mapreduce.CsvBulkLoadTool - Loading HFiles from
>> /tmp/a87eaa26-8cf4-4589-9ac9-6662e86f8547/EXAMPLE
>> 2015-09-23 19:09:59,261 [phoenix-2-thread-0] WARN
>> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles - Skipping
>> non-directory
>> file:/tmp/a87eaa26-8cf4-4589-9ac9-6662e86f8547/EXAMPLE/_SUCCESS
>> 2015-09-23 19:10:01,469 [LoadIncrementalHFiles-0] INFO
>> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles - Trying to load
>> hfile=file:/tmp/a87eaa26-8cf4-4589-9ac9-6662e86f8547/EXAMPLE/M/4854e10c013944e4ab1195c70ff530ee
>> first=\x80\x00\x00\x00\x00\x0009 last=\x80\x00\x00\x00\x00\x01\x092
>
>
>   That's why we explicitly provided ---output as hdfs:// and things atleast
> worked.
>
> --
> Dhruv
>
> On Wednesday 23 September 2015 06:54 PM, Gabriel Reid wrote:
>>
>> Hi Dhruv,
>>
>> This is a bug in Phoenix, although it appears that your hadoop
>> configuration is also somewhat unusual.
>>
>> As far as I can see, your hadoop configuration is set up to use the
>> local filesystem, and not hdfs. You can test this by running the
>> following command:
>>
>>  hadoop dfs -ls /
>>
>> If that command lists the contents of the root directory on your local
>> machine, then hadoop is set up to use your local filesystem, and not
>> HDFS.
>>
>> The bulk load tool currently uses the file system that is configured
>> for Hadoop for deleting the (temporary) output directory. However,
>> because your setup uses the local filesystem by default, but you're
>> using HDFS via the file paths that you supply to the tool, deleting
>> the temporary output directory fails.
>>
>> The quick fix (if feasible) is to set up your hadoop command to use
>> HDFS as the default file system. However, could you also log this as a
>> bug in the JIRA (https://issues.apache.org/jira/browse/PHOENIX)
>>
>> - Gabriel
>>
>>
>> On Wed, Sep 23, 2015 at 2:45 PM, Dhruv Gohil 
>> wrote:
>>>
>>> Hi,
>>>  I am able to successfully use BulkLoadTool to load millions of rows
>>> in
>>> phoenixTable, but at end of each execution following error occurs. Need
>>> your
>>> inputs to make runs full green.
>>>
>>>  Following is minimal reproduction using EXAMPLE given in
>>> documentation.
>>> Env:
>>>  CDH 5.3
>>>  Hbase 0.98
>>>  Phoenix 4.2.2
>>>
 Running BulkloadTool from one of the HDFS-CDH cluster machine.
>>>
>>>   Exactly following instructions :
>>> https://phoenix.apache.org/bulk_dataload.html
>>>
>>> HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop
>>> jar
>>> phoenix-4.2.2-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool
>>> --table EXAMPLE --input
>>> hdfs://cdh3.st.comany.org:8020/data/example.csv --output
>>> hdfs://cdh3.st.comany.org:8020/user/ --zookeeper 10.10.10.1
>>>
>>>
>>> CsvBulkLoadTool - Import job on table=EXAMPLE failed due to
>>> exception:java.lang.IllegalArgumentException: Wrong FS:
>>> hdfs://cdh3.st.comany.org:8020/user/EXAMPLE, expected: file:///
>>>
>>> Job output log:
>>>
>>> 2015-09-23 17:54:09,633 [phoenix-2-thread-0] INFO
>>> org.apache.hadoop.mapreduce.Job - Job job_local1620809330_0001 completed
>>> successfully
>>> 2015-09-23 17:54:09,663 [phoenix-2-thread-0] INFO
>>> org.apache.hadoop.mapreduce.Job - Counters: 33
>>> File System Counters
>>> FILE: Number of bytes read=34018940
>>> FILE: Number of bytes written=34812676
>>> FILE: Number of read operations=0
>>> FILE: Number of large read operations=0
>>> FILE: Number of write operations=0
>>> HDFS: Number of bytes read=68
>>> HDFS: Number of bytes written=1241
>>> HDFS: Number of read operations=15
>>> HDFS: Number of large read operations=0
>>> HDFS: Number of write operations=5
>>> Map-Reduce Framework
>>> Map input records=2
>>> Map output records=6
>>> Map output bytes=330
>>> Map output materialized bytes=348
>>> Input split bytes=117
>>> Combine input records=0
>>> Combine output records=0
>>> Reduce input groups=2
>>> Reduce shuffle bytes=0
>>> Reduce input records=6
>>> Reduce output records=6
>>> Spilled Records=12
>>> Shuffled Maps =0
>>> Failed Shuffles=0
>>> Merged Map outputs=0
>>> GC time elapsed (ms)=0
>>> CPU 

Re: Spark Plugin Exception - java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row

2015-09-23 Thread Babar Tareen
I have filed PHOENIX-2287
 for this. And the code
works fine with Spark 1.4.1.

Thanks

On Wed, Sep 23, 2015 at 6:06 AM Josh Mahonin  wrote:

> Hi Babar,
>
> Can you file a JIRA for this? I suspect this is something to do with the
> Spark 1.5 data frame API data structures, perhaps they've gone and changed
> them again!
>
> Can you try with previous Spark versions to see if there's a difference?
> Also, you may have luck interfacing with the RDDs directly instead of the
> data frames.
>
> Thanks!
>
> Josh
>
> From: Babar Tareen
> Reply-To: "user@phoenix.apache.org"
> Date: Tuesday, September 22, 2015 at 5:47 PM
> To: "user@phoenix.apache.org"
> Subject: Spark Plugin Exception - java.lang.ClassCastException:
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast
> to org.apache.spark.sql.Row
>
> Hi,
>
> I am trying to run the spark plugin DataFrame sample code available here (
> https://phoenix.apache.org/phoenix_spark.html) and getting following
> exception.  I am running the code against hbase-1.1.1, spark 1.5.0 and
> phoenix  4.5.2. HBase is running in standalone mode, locally on OS X.  Any
> ideas what might be causing this exception?
>
>
> java.lang.ClassCastException:
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast
> to org.apache.spark.sql.Row
> at
> org.apache.spark.sql.SQLContext$$anonfun$7.apply(SQLContext.scala:439)
> ~[spark-sql_2.11-1.5.0.jar:1.5.0]
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:363)
> ~[scala-library-2.11.4.jar:na]
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:363)
> ~[scala-library-2.11.4.jar:na]
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:363)
> ~[scala-library-2.11.4.jar:na]
> at
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:366)
> ~[spark-sql_2.11-1.5.0.jar:1.5.0]
> at
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> ~[spark-sql_2.11-1.5.0.jar:1.5.0]
> at
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$
> 1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> ~[spark-sql_2.11-1.5.0.jar:1.5.0]
> at
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> ~[spark-sql_2.11-1.5.0.jar:1.5.0]
> at
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> ~[spark-sql_2.11-1.5.0.jar:1.5.0]
> at
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> ~[spark-core_2.11-1.5.0.jar:1.5.0]
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> ~[spark-core_2.11-1.5.0.jar:1.5.0]
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> ~[spark-core_2.11-1.5.0.jar:1.5.0]
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> ~[spark-core_2.11-1.5.0.jar:1.5.0]
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> ~[spark-core_2.11-1.5.0.jar:1.5.0]
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> ~[spark-core_2.11-1.5.0.jar:1.5.0]
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> ~[spark-core_2.11-1.5.0.jar:1.5.0]
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> ~[spark-core_2.11-1.5.0.jar:1.5.0]
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> ~[spark-core_2.11-1.5.0.jar:1.5.0]
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> ~[spark-core_2.11-1.5.0.jar:1.5.0]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [na:1.8.0_45]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [na:1.8.0_45]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
>
> Thanks,
> Babar
>


Re: Setting a TTL in an upsert

2015-09-23 Thread James Taylor
Hi Alex,
I can think of a couple of ways to support this:
1) Surface support for per Cell TTLs (HBASE-10560) in Phoenix
(PHOENIX-1335). This could have the kind of syntax you mentioned (or
alternatively rely on a connection property and no syntactic change would
be necessary, and then in MutationState (where Phoenix produces HBase
Mutations), you'd need to use the HBase API to set the TTLs. You'd also
need to deal with setting secondary index rows to have the same TTLs as
their data rows.
2) Use the CurrentSCN property at connection time for UPSERT calls to
future date the cell timestamp. You'd also need to set the CurrentSCN
property for readers above any value you used at UPSERT time as otherwise
you wouldn't see the data you wrote.

If you're up for it, (1) would be a nice contribution and definitely a
viable feature.

Thanks,
James

On Wed, Sep 23, 2015 at 9:08 AM, Alex Loffler 
wrote:

> Hi,
>
> Thanks for the response – would this be a viable feature request? We’re
> moving from using raw HBase to Phoenix and would like to use this
> ‘countdown’ feature to allow for different rows in the same table to have
> different retention times. Instead of having to index a user created TTL
> column and create a script to manually garbage collect the stale rows, we
> could continue to leverage HBase’s TTL mechanism to automatically exclude
> the rows and physically delete them on the next major compaction.
>
> From the documentation, Phoenix supports TTL on secondary indexes as long
> as they are created with the same value as the base table, which would be
> perfect!
>
> Thanks,
>
> -Alex.
>
> *From:* Yuhao Bi [mailto:byh0...@gmail.com ]
> *Sent:* September 23, 2015 00:31
> *To:* user
> *Subject:* Re: Setting a TTL in an upsert
>
> Hi,
>
> As I know, we can only set a ttl in create table stage corresponding to
> HBase table ttl.
>
> CREATE TABLE IF NOT EXISTS my_schema.my_table (
> org_id CHAR(15), entity_id CHAR(15), payload binary(1000),
> CONSTRAINT pk PRIMARY KEY (org_id, entity_id) )
> TTL=86400
>
> See *http://phoenix.apache.org/language/index.html#create_table*
>  for more grammar detail.
>
> Thanks,
>
> 2015-09-23 15:11 GMT+08:00 Alex Loffler <*alex.loff...@telus.com*
> >:
>
> Hi,
>
>
>
> Is it possible to define the TTL of a row (or even each cell in the row)
> during an upsert e.g:
>
>
>
> upsert into test values(1,2,3) TTL=1442988643355;
>
>
>
> Assuming the table has a TTL this would allow per-row retention policies
> (with automatic garbage-collection by HBase) by e.g. setting the upsert TTL
> to a time in the future.
>
>
>
> For example if the TTL on the table is set to 60 (seconds), a row with a
> desired retention policy of 1 year could be upserted with a TTL=now() + 1
> year.
>
>
>
> Thanks in advance,
>
> -Alex.
>
>


Re: Setting a TTL in an upsert

2015-09-23 Thread James Taylor
Also, for more information on (2), see
https://phoenix.apache.org/faq.html#Can_phoenix_work_on_tables_with_arbitrary_timestamp_as_flexible_as_HBase_API

On Wed, Sep 23, 2015 at 10:55 AM, James Taylor 
wrote:

> Hi Alex,
> I can think of a couple of ways to support this:
> 1) Surface support for per Cell TTLs (HBASE-10560) in Phoenix
> (PHOENIX-1335). This could have the kind of syntax you mentioned (or
> alternatively rely on a connection property and no syntactic change would
> be necessary, and then in MutationState (where Phoenix produces HBase
> Mutations), you'd need to use the HBase API to set the TTLs. You'd also
> need to deal with setting secondary index rows to have the same TTLs as
> their data rows.
> 2) Use the CurrentSCN property at connection time for UPSERT calls to
> future date the cell timestamp. You'd also need to set the CurrentSCN
> property for readers above any value you used at UPSERT time as otherwise
> you wouldn't see the data you wrote.
>
> If you're up for it, (1) would be a nice contribution and definitely a
> viable feature.
>
> Thanks,
> James
>
> On Wed, Sep 23, 2015 at 9:08 AM, Alex Loffler 
> wrote:
>
>> Hi,
>>
>> Thanks for the response – would this be a viable feature request? We’re
>> moving from using raw HBase to Phoenix and would like to use this
>> ‘countdown’ feature to allow for different rows in the same table to have
>> different retention times. Instead of having to index a user created TTL
>> column and create a script to manually garbage collect the stale rows,
>> we could continue to leverage HBase’s TTL mechanism to automatically
>> exclude the rows and physically delete them on the next major compaction.
>>
>> From the documentation, Phoenix supports TTL on secondary indexes as long
>> as they are created with the same value as the base table, which would be
>> perfect!
>>
>> Thanks,
>>
>> -Alex.
>>
>> *From:* Yuhao Bi [mailto:byh0...@gmail.com ]
>> *Sent:* September 23, 2015 00:31
>> *To:* user
>> *Subject:* Re: Setting a TTL in an upsert
>>
>> Hi,
>>
>> As I know, we can only set a ttl in create table stage corresponding to
>> HBase table ttl.
>>
>> CREATE TABLE IF NOT EXISTS my_schema.my_table (
>> org_id CHAR(15), entity_id CHAR(15), payload binary(1000),
>> CONSTRAINT pk PRIMARY KEY (org_id, entity_id) )
>> TTL=86400
>>
>> See *http://phoenix.apache.org/language/index.html#create_table*
>>  for more grammar detail.
>>
>> Thanks,
>>
>> 2015-09-23 15:11 GMT+08:00 Alex Loffler <*alex.loff...@telus.com*
>> >:
>>
>> Hi,
>>
>>
>>
>> Is it possible to define the TTL of a row (or even each cell in the row)
>> during an upsert e.g:
>>
>>
>>
>> upsert into test values(1,2,3) TTL=1442988643355;
>>
>>
>>
>> Assuming the table has a TTL this would allow per-row retention policies
>> (with automatic garbage-collection by HBase) by e.g. setting the upsert TTL
>> to a time in the future.
>>
>>
>>
>> For example if the TTL on the table is set to 60 (seconds), a row with a
>> desired retention policy of 1 year could be upserted with a TTL=now() + 1
>> year.
>>
>>
>>
>> Thanks in advance,
>>
>> -Alex.
>>
>>
>


Re: Spark Plugin Exception - java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row

2015-09-23 Thread Josh Mahonin
I've got a patch attached to the ticket that I think should fix your issue.

If you're able to try it out and let us know how it goes, it'd be much 
appreciated.

From: Babar Tareen
Reply-To: "user@phoenix.apache.org"
Date: Wednesday, September 23, 2015 at 1:14 PM
To: "user@phoenix.apache.org"
Subject: Re: Spark Plugin Exception - java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to 
org.apache.spark.sql.Row

I have filed PHOENIX-2287 
for this. And the code works fine with Spark 1.4.1.

Thanks

On Wed, Sep 23, 2015 at 6:06 AM Josh Mahonin 
> wrote:
Hi Babar,

Can you file a JIRA for this? I suspect this is something to do with the Spark 
1.5 data frame API data structures, perhaps they've gone and changed them again!

Can you try with previous Spark versions to see if there's a difference? Also, 
you may have luck interfacing with the RDDs directly instead of the data frames.

Thanks!

Josh

From: Babar Tareen
Reply-To: "user@phoenix.apache.org"
Date: Tuesday, September 22, 2015 at 5:47 PM
To: "user@phoenix.apache.org"
Subject: Spark Plugin Exception - java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to 
org.apache.spark.sql.Row

Hi,

I am trying to run the spark plugin DataFrame sample code available here 
(https://phoenix.apache.org/phoenix_spark.html) and getting following 
exception.  I am running the code against hbase-1.1.1, spark 1.5.0 and phoenix  
4.5.2. HBase is running in standalone mode, locally on OS X.  Any ideas what 
might be causing this exception?


java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to 
org.apache.spark.sql.Row
at org.apache.spark.sql.SQLContext$$anonfun$7.apply(SQLContext.scala:439) 
~[spark-sql_2.11-1.5.0.jar:1.5.0]
at scala.collection.Iterator$$anon$11.next(Iterator.scala:363) 
~[scala-library-2.11.4.jar:na]
at scala.collection.Iterator$$anon$11.next(Iterator.scala:363) 
~[scala-library-2.11.4.jar:na]
at scala.collection.Iterator$$anon$11.next(Iterator.scala:363) 
~[scala-library-2.11.4.jar:na]
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:366)
 ~[spark-sql_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
 ~[spark-sql_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
 ~[spark-sql_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
 ~[spark-sql_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
 ~[spark-sql_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
 ~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.scheduler.Task.run(Task.scala:88) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[na:1.8.0_45]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[na:1.8.0_45]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]

Thanks,
Babar


RE: Setting a TTL in an upsert

2015-09-23 Thread Alex Loffler
Hi James,

 

Thank you for the info/validation. What I had in mind was the ability to define 
the (HBase per cell) timestamp arbitrarily for each upsert statement, but more 
broadly there seems to be at least three levels of granularity:

 

1)  Per cell/column timestamp – where each column in the upsert statement 
could define a separate timestamp (this may make the syntax quite unwieldy, but 
offers the most flexibility). At first glance this hurts my head especially wrt 
honouring the timestamps on primary keys & secondary indexes that are defined 
over multiple columns where a parts of the index/row could disappear at 
different times.

2)  Per row timestamp – my original use-case where all of the cells in the 
row receive the same timestamp.

3)  Per connection – As you say, probably the simplest/least impacting as 
this could be passed as a connection property. I believe it will also be 
possible to emulate (2) by using multiple connections, one per retention policy 
duration, so maybe this is a good starting point.

 

I’m new to the project so will dive into the code to get my bearings before 
pulling together a plan of attack.

 

Thanks again,

-Alex.

 

From: James Taylor [mailto:jamestay...@apache.org] 
Sent: September 23, 2015 10:59
To: James Taylor
Cc: user
Subject: Re: Setting a TTL in an upsert

 

Also, for more information on (2), see 
https://phoenix.apache.org/faq.html#Can_phoenix_work_on_tables_with_arbitrary_timestamp_as_flexible_as_HBase_API

 

On Wed, Sep 23, 2015 at 10:55 AM, James Taylor  wrote:

Hi Alex,

I can think of a couple of ways to support this:

1) Surface support for per Cell TTLs (HBASE-10560) in Phoenix (PHOENIX-1335). 
This could have the kind of syntax you mentioned (or alternatively rely on a 
connection property and no syntactic change would be necessary, and then in 
MutationState (where Phoenix produces HBase Mutations), you'd need to use the 
HBase API to set the TTLs. You'd also need to deal with setting secondary index 
rows to have the same TTLs as their data rows.

2) Use the CurrentSCN property at connection time for UPSERT calls to future 
date the cell timestamp. You'd also need to set the CurrentSCN property for 
readers above any value you used at UPSERT time as otherwise you wouldn't see 
the data you wrote.

 

If you're up for it, (1) would be a nice contribution and definitely a viable 
feature.

 

Thanks,

James

 

On Wed, Sep 23, 2015 at 9:08 AM, Alex Loffler  wrote:

Hi,

Thanks for the response – would this be a viable feature request? We’re moving 
from using raw HBase to Phoenix and would like to use this ‘countdown’ feature 
to allow for different rows in the same table to have different retention 
times. Instead of having to index a user created TTL column and create a script 
to manually garbage collect the stale rows, we could continue to leverage 
HBase’s TTL mechanism to automatically exclude the rows and physically delete 
them on the next major compaction.

>From the documentation, Phoenix supports TTL on secondary indexes as long as 
>they are created with the same value as the base table, which would be perfect!

Thanks,

-Alex.

From: Yuhao Bi [mailto:byh0...@gmail.com]
Sent: September 23, 2015 00:31
To: user
Subject: Re: Setting a TTL in an upsert

Hi,

As I know, we can only set a ttl in create table stage corresponding to HBase 
table ttl.

CREATE TABLE IF NOT EXISTS my_schema.my_table (
org_id CHAR(15), entity_id CHAR(15), payload binary(1000),
CONSTRAINT pk PRIMARY KEY (org_id, entity_id) )
TTL=86400

See http://phoenix.apache.org/language/index.html#create_table 
  for more grammar detail.

Thanks,

2015-09-23 15:11 GMT+08:00 Alex Loffler :

Hi,

 

Is it possible to define the TTL of a row (or even each cell in the row) during 
an upsert e.g:

 

upsert into test values(1,2,3) TTL=1442988643355;

 

Assuming the table has a TTL this would allow per-row retention policies (with 
automatic garbage-collection by HBase) by e.g. setting the upsert TTL to a time 
in the future.

 

For example if the TTL on the table is set to 60 (seconds), a row with a 
desired retention policy of 1 year could be upserted with a TTL=now() + 1 year.

 

Thanks in advance,

-Alex.

 

 



smime.p7s
Description: S/MIME cryptographic signature


Re: Spark Plugin Exception - java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row

2015-09-23 Thread Babar Tareen
I tried the patch; it resolves this issue. Thanks.

On Wed, Sep 23, 2015 at 11:23 AM Josh Mahonin  wrote:

> I've got a patch attached to the ticket that I think should fix your
> issue.
>
> If you're able to try it out and let us know how it goes, it'd be much
> appreciated.
>
> From: Babar Tareen
> Reply-To: "user@phoenix.apache.org"
> Date: Wednesday, September 23, 2015 at 1:14 PM
> To: "user@phoenix.apache.org"
> Subject: Re: Spark Plugin Exception - java.lang.ClassCastException:
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast
> to org.apache.spark.sql.Row
>
> I have filed PHOENIX-2287
>  for this. And the
> code works fine with Spark 1.4.1.
>
> Thanks
>
> On Wed, Sep 23, 2015 at 6:06 AM Josh Mahonin 
> wrote:
>
>> Hi Babar,
>>
>> Can you file a JIRA for this? I suspect this is something to do with the
>> Spark 1.5 data frame API data structures, perhaps they've gone and changed
>> them again!
>>
>> Can you try with previous Spark versions to see if there's a difference?
>> Also, you may have luck interfacing with the RDDs directly instead of the
>> data frames.
>>
>> Thanks!
>>
>> Josh
>>
>> From: Babar Tareen
>> Reply-To: "user@phoenix.apache.org"
>> Date: Tuesday, September 22, 2015 at 5:47 PM
>> To: "user@phoenix.apache.org"
>> Subject: Spark Plugin Exception - java.lang.ClassCastException:
>> org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast
>> to org.apache.spark.sql.Row
>>
>> Hi,
>>
>> I am trying to run the spark plugin DataFrame sample code available here (
>> https://phoenix.apache.org/phoenix_spark.html) and getting following
>> exception.  I am running the code against hbase-1.1.1, spark 1.5.0 and
>> phoenix  4.5.2. HBase is running in standalone mode, locally on OS X.  Any
>> ideas what might be causing this exception?
>>
>>
>> java.lang.ClassCastException:
>> org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast
>> to org.apache.spark.sql.Row
>> at
>> org.apache.spark.sql.SQLContext$$anonfun$7.apply(SQLContext.scala:439)
>> ~[spark-sql_2.11-1.5.0.jar:1.5.0]
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:363)
>> ~[scala-library-2.11.4.jar:na]
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:363)
>> ~[scala-library-2.11.4.jar:na]
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:363)
>> ~[scala-library-2.11.4.jar:na]
>> at
>> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:366)
>> ~[spark-sql_2.11-1.5.0.jar:1.5.0]
>> at
>> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
>> ~[spark-sql_2.11-1.5.0.jar:1.5.0]
>> at
>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$
>> 1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
>> ~[spark-sql_2.11-1.5.0.jar:1.5.0]
>> at
>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>> ~[spark-sql_2.11-1.5.0.jar:1.5.0]
>> at
>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>> ~[spark-sql_2.11-1.5.0.jar:1.5.0]
>> at
>> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>> ~[spark-core_2.11-1.5.0.jar:1.5.0]
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>> ~[spark-core_2.11-1.5.0.jar:1.5.0]
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>> ~[spark-core_2.11-1.5.0.jar:1.5.0]
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> ~[spark-core_2.11-1.5.0.jar:1.5.0]
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>> ~[spark-core_2.11-1.5.0.jar:1.5.0]
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>> ~[spark-core_2.11-1.5.0.jar:1.5.0]
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>> ~[spark-core_2.11-1.5.0.jar:1.5.0]
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> ~[spark-core_2.11-1.5.0.jar:1.5.0]
>> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>> ~[spark-core_2.11-1.5.0.jar:1.5.0]
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> ~[spark-core_2.11-1.5.0.jar:1.5.0]
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> [na:1.8.0_45]
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> [na:1.8.0_45]
>> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
>>
>> Thanks,
>> Babar
>>
>


Setting a TTL in an upsert

2015-09-23 Thread Alex Loffler
Hi,

 

Is it possible to define the TTL of a row (or even each cell in the row)
during an upsert e.g:

 

upsert into test values(1,2,3) TTL=1442988643355;

 

Assuming the table has a TTL this would allow per-row retention policies
(with automatic garbage-collection by HBase) by e.g. setting the upsert TTL
to a time in the future.

 

For example if the TTL on the table is set to 60 (seconds), a row with a
desired retention policy of 1 year could be upserted with a TTL=now() + 1
year.

 

Thanks in advance,

-Alex.



smime.p7s
Description: S/MIME cryptographic signature


Re: Setting a TTL in an upsert

2015-09-23 Thread James Taylor
Thanks, Alex. I agree - starting with (3) would be best as we wouldn't need
any non standard SQL syntax.

On Wed, Sep 23, 2015 at 12:06 PM, Alex Loffler 
wrote:

> Hi James,
>
>
>
> Thank you for the info/validation. What I had in mind was the ability to
> define the (HBase per cell) timestamp arbitrarily for each upsert
> statement, but more broadly there seems to be at least three levels of
> granularity:
>
>
>
> 1)  Per cell/column timestamp – where each column in the upsert
> statement could define a separate timestamp (this may make the syntax quite
> unwieldy, but offers the most flexibility). At first glance this hurts my
> head especially wrt honouring the timestamps on primary keys & secondary
> indexes that are defined over multiple columns where a parts of the
> index/row could disappear at different times.
>
> 2)  Per row timestamp – my original use-case where all of the cells
> in the row receive the same timestamp.
>
> 3)  Per connection – As you say, probably the simplest/least
> impacting as this could be passed as a connection property. I believe it
> will also be possible to emulate (2) by using multiple connections, one per
> retention policy duration, so maybe this is a good starting point.
>
>
>
> I’m new to the project so will dive into the code to get my bearings
> before pulling together a plan of attack.
>
>
>
> Thanks again,
>
> -Alex.
>
>
>
> *From:* James Taylor [mailto:jamestay...@apache.org]
> *Sent:* September 23, 2015 10:59
> *To:* James Taylor
> *Cc:* user
>
> *Subject:* Re: Setting a TTL in an upsert
>
>
>
> Also, for more information on (2), see
> https://phoenix.apache.org/faq.html#Can_phoenix_work_on_tables_with_arbitrary_timestamp_as_flexible_as_HBase_API
>
>
>
> On Wed, Sep 23, 2015 at 10:55 AM, James Taylor 
> wrote:
>
> Hi Alex,
>
> I can think of a couple of ways to support this:
>
> 1) Surface support for per Cell TTLs (HBASE-10560) in Phoenix
> (PHOENIX-1335). This could have the kind of syntax you mentioned (or
> alternatively rely on a connection property and no syntactic change would
> be necessary, and then in MutationState (where Phoenix produces HBase
> Mutations), you'd need to use the HBase API to set the TTLs. You'd also
> need to deal with setting secondary index rows to have the same TTLs as
> their data rows.
>
> 2) Use the CurrentSCN property at connection time for UPSERT calls to
> future date the cell timestamp. You'd also need to set the CurrentSCN
> property for readers above any value you used at UPSERT time as otherwise
> you wouldn't see the data you wrote.
>
>
>
> If you're up for it, (1) would be a nice contribution and definitely a
> viable feature.
>
>
>
> Thanks,
>
> James
>
>
>
> On Wed, Sep 23, 2015 at 9:08 AM, Alex Loffler 
> wrote:
>
> Hi,
>
> Thanks for the response – would this be a viable feature request? We’re
> moving from using raw HBase to Phoenix and would like to use this
> ‘countdown’ feature to allow for different rows in the same table to have
> different retention times. Instead of having to index a user created TTL
> column and create a script to manually garbage collect the stale rows, we
> could continue to leverage HBase’s TTL mechanism to automatically exclude
> the rows and physically delete them on the next major compaction.
>
> From the documentation, Phoenix supports TTL on secondary indexes as long
> as they are created with the same value as the base table, which would be
> perfect!
>
> Thanks,
>
> -Alex.
>
> *From:* Yuhao Bi [mailto:byh0...@gmail.com ]
> *Sent:* September 23, 2015 00:31
> *To:* user
> *Subject:* Re: Setting a TTL in an upsert
>
> Hi,
>
> As I know, we can only set a ttl in create table stage corresponding to
> HBase table ttl.
>
> CREATE TABLE IF NOT EXISTS my_schema.my_table (
> org_id CHAR(15), entity_id CHAR(15), payload binary(1000),
> CONSTRAINT pk PRIMARY KEY (org_id, entity_id) )
> TTL=86400
>
> See http://phoenix.apache.org/language/index.html#create_table
>  for more grammar detail.
>
> Thanks,
>
> 2015-09-23 15:11 GMT+08:00 Alex Loffler :
>
> Hi,
>
>
>
> Is it possible to define the TTL of a row (or even each cell in the row)
> during an upsert e.g:
>
>
>
> upsert into test values(1,2,3) TTL=1442988643355;
>
>
>
> Assuming the table has a TTL this would allow per-row retention policies
> (with automatic garbage-collection by HBase) by e.g. setting the upsert TTL
> to a time in the future.
>
>
>
> For example if the TTL on the table is set to 60 (seconds), a row with a
> desired retention policy of 1 year could be upserted with a TTL=now() + 1
> year.
>
>
>
> Thanks in advance,
>
> -Alex.
>
>
>
>
>