Re: [gdal-dev] filter out wanted messages in a grib2 file

2019-03-20 Thread Zhan Zhang - NOAA Affiliate
Thanks, Even! Can gdalinfo also get all the bands info needed as 
GDALGetRasterCount() + GDALGetRasterBand() + GDALGetMetadata() + 
GDALGetDescription() ) work together?

> On Mar 19, 2019, at 12:44 PM, Even Rouault  wrote:
> 
>> On dimanche 10 mars 2019 21:19:30 CET Zhan Zhang - NOAA Affiliate wrote:
>> I have a grib2 file which contains many messages, and those messages define
>> different products on different surfaces (like z axis). For
>> instance, some messages defines "soil temperature"(product name) on a
>> surface called "depth below land surface" (surface name); and other
>> messages define "geopotential height" (product name) on a "pressure
>> surface" (surface name); etc. May I ask how I can filter out all those
>> messages that defines "soil temperature"(product name) on a surface called
>> "depth below land surface" (surface name)? Is there some grib2 tools
>> provided in gdal APIs that I can use? Thanks!
> 
> Zhan,
> 
> Multiple GRIB messages in a single file are seen by GDAL as a multi-band 
> dataset.
> So you need to iterate over the bands ( GDALGetRasterCount() + 
> GDALGetRasterBand() for the C API ) and fetch its metadata ( 
> GDALGetMetadata() 
> ) an description ( GDALGetDescription() ) to see if the band is of interest 
> for you.
> 
> The output of gdalinfo should help you understanding this
> 
> If using the Python API, you may also analyze the result of
> gdal.Info('my.grb2', format = 'json')
> which is a Python dictionary, to retrieve the band you are interested in.
> 
> Best regards,
> 
> Even
> 
> -- 
> Spatialys - Geospatial professional services
> http://www.spatialys.com
___
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev

Re: [gdal-dev] filter out wanted messages in a grib2 file

2019-03-11 Thread Zhan Zhang - NOAA Affiliate
when I say "filter out", I mean "retrieve".

On Sun, Mar 10, 2019 at 9:19 PM Zhan Zhang - NOAA Affiliate <
zhan.j.zh...@noaa.gov> wrote:

> I have a grib2 file which contains many messages, and those messages
> define different products on different surfaces (like z axis). For
> instance, some messages defines "soil temperature"(product name) on a
> surface called "depth below land surface" (surface name); and other
> messages define "geopotential height" (product name) on a "pressure
> surface" (surface name); etc. May I ask how I can filter out all those
> messages that defines "soil temperature"(product name) on a surface called
> "depth below land surface" (surface name)? Is there some grib2 tools
> provided in gdal APIs that I can use? Thanks!
>
___
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev

[gdal-dev] filter out wanted messages in a grib2 file

2019-03-10 Thread Zhan Zhang - NOAA Affiliate
I have a grib2 file which contains many messages, and those messages define
different products on different surfaces (like z axis). For
instance, some messages defines "soil temperature"(product name) on a
surface called "depth below land surface" (surface name); and other
messages define "geopotential height" (product name) on a "pressure
surface" (surface name); etc. May I ask how I can filter out all those
messages that defines "soil temperature"(product name) on a surface called
"depth below land surface" (surface name)? Is there some grib2 tools
provided in gdal APIs that I can use? Thanks!
___
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev

Re: [gdal-dev] grib driver

2019-02-28 Thread Zhan Zhang - NOAA Affiliate
May I ask whether gdal grib driver support changing grid in a GRIB2
message? For instance, given a GRIB2 message, I want to change from polar
stereographic grid projection to lambert conformal grid projection, and the
data section need also be changed accordingly, and then output this to a
new GRIB2 message. I ask MDL degrib group and they tell me degrib can not
do this. It sound that it is likely gdal grib driver cannot do this either.
However I am not 100% sure and want to seek some comments here. Thanks!

On Thu, Feb 28, 2019 at 10:27 AM Even Rouault 
wrote:

> On jeudi 28 février 2019 09:29:41 CET Zhan Zhang - NOAA Affiliate wrote:
> > I am new to the gdal apis and is interested in getting some knowledge of
> > the grib driver. May I ask whether it basically provides the same
> > functionality as degrib from MDL/NWS/NOAA? Thanks! --Zhan
>
> The GDAL GRIB driver strongly relies on degrib+g2clib. It esposes their
> functionality through the general GDAL raster API, so as to be able to
> manipulate a GRIB dataset as any other GDAL raster.
>
> Even
>
> --
> Spatialys - Geospatial professional services
> http://www.spatialys.com
>
___
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev

[gdal-dev] grib driver

2019-02-28 Thread Zhan Zhang - NOAA Affiliate
I am new to the gdal apis and is interested in getting some knowledge of
the grib driver. May I ask whether it basically provides the same
functionality as degrib from MDL/NWS/NOAA? Thanks! --Zhan
___
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev

[jira] [Created] (SPARK-23306) Race condition in TaskMemoryManager

2018-02-01 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-23306:
--

 Summary: Race condition in TaskMemoryManager
 Key: SPARK-23306
 URL: https://issues.apache.org/jira/browse/SPARK-23306
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Zhan Zhang


There is race condition in TaskMemoryManger, which may cause OOM.

The memory released may be taken by another task because there is a gap between 
releaseMemory and acquireMemory, e.g., UnifiedMemoryManager, causing the OOM. 
if the current is the only one that can perform spill. It can happen to 
BytesToBytesMap, as it only spill required bytes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21505) A dynamic join operator to improve the join reliability

2017-11-02 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16236365#comment-16236365
 ] 

Zhan Zhang commented on SPARK-21505:


Any comments on this feature? Do you think the design is OK? If so, we are 
going to submit a PR.

> A dynamic join operator to improve the join reliability
> ---
>
> Key: SPARK-21505
> URL: https://issues.apache.org/jira/browse/SPARK-21505
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 3.0.0
>Reporter: Lin
>Priority: Major
>  Labels: features
>
> As we know, hash join is more efficient than sort merge join. But today hash 
> join is not so widely used because it may fail with OutOfMemory (OOM) error 
> due to limited memory resource, data skew, statistics mis-estimation and so 
> on. For example, if we apply shuffle hash join on an uneven distributed 
> dataset, some partitions might be so large  that we cannot make a Hash table 
> for this particular partition causing OOM error. When OOM happens, current 
> Spark technology will throw an Exception, resulting in job failure. On the 
> other hand, if sort-merge join is used, there will be shuffle, sorting and 
> extra spill, causing the degradation of the join. Considering the efficiency 
> of hash join, we want to propose a fallback mechanism to dynamically use hash 
> join or sort-merge join at runtime at task level to provide a more reliable 
> join operation.
> This new dynamic join operator internally implements the logic of HashJoin, 
> Iterator Reconstruct, Sort, and MergeJoin.  We show the process of this 
> dynamic join method as following:
> HashJoin: We start from building  Hash table on one side of join partitions. 
> If Hash table is built successfully, it would be the same as the current 
> ShuffledHashJoin operator. 
> Sort: If we fail to build Hash table due to the large partition size, we do 
> SortMergeJoin only on this partition. But we need to rebuild the   When OOM 
> happens, a Hash table corresponding to partial part of this partition has 
> been built successfully (e.g. first 4000 rows of RDD), and the iterator of 
> this partition is now pointing to the 4001st row of partition. We reuse this 
> hash table to reconstruct the iterator for the first 4000 rows and 
> concatenate  with the rest rows of this partition so that we can rebuild this 
> partition completely. On this re-built partition, we apply sorting based on 
> key values.
> MergeJoin: After getting two sorted Iterators, we perform regular merge join 
> against them and emits the records to downstream operators.
> Iterator Reconstruct:  BytesToBytesMap has to be spilled to disk to release 
> the memory for other operators, such as Sort, Join, etc. In addition, it has 
> to be converted to Iterator, so that it can be concatenated with remaining 
> items in the original iterator that is used to build the hash table.
> Meta Data Population: Necessary metadata, such as sorting keys, jointype, 
> etc,  has to be populated, so that they are used for potential Sort and 
> MergeJoin operator.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21492) Memory leak in SortMergeJoin

2017-07-20 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095425#comment-16095425
 ] 

Zhan Zhang commented on SPARK-21492:


root cause: In the SortMergeJoin, inner/leftOuter/rightOuter, one side of the 
SortedIter may not exhausted, that chunk of the memory thus cannot be released, 
causing memory leak and performance degradtion.

> Memory leak in SortMergeJoin
> 
>
> Key: SPARK-21492
> URL: https://issues.apache.org/jira/browse/SPARK-21492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>    Reporter: Zhan Zhang
>
> In SortMergeJoin, if the iterator is not exhausted, there will be memory leak 
> caused by the Sort. The memory is not released until the task end, and cannot 
> be used by other operators causing performance drop or OOM.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21492) Memory leak in SortMergeJoin

2017-07-20 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-21492:
--

 Summary: Memory leak in SortMergeJoin
 Key: SPARK-21492
 URL: https://issues.apache.org/jira/browse/SPARK-21492
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Zhan Zhang


In SortMergeJoin, if the iterator is not exhausted, there will be memory leak 
caused by the Sort. The memory is not released until the task end, and cannot 
be used by other operators causing performance drop or OOM.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-20215) ReuseExchange is boken in SparkSQL

2017-04-14 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-20215:
---
Comment: was deleted

(was: Seems to be fixed in SPARK-20229)

> ReuseExchange is boken in SparkSQL
> --
>
> Key: SPARK-20215
> URL: https://issues.apache.org/jira/browse/SPARK-20215
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>    Reporter: Zhan Zhang
>Priority: Minor
>
> Currently if we have query like: A join B Union A join C... with the same 
> join key. Table A will be scanned multiple times in sql. It is because the 
> megastoreRelation are not shared by two joins, and ExprId is different.  
> canonicalized in Expression will not be able to unify them and two Exchange 
> will not compatible and cannot be reused.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20215) ReuseExchange is boken in SparkSQL

2017-04-14 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968734#comment-15968734
 ] 

Zhan Zhang commented on SPARK-20215:


Seems to be fixed in SPARK-20229

> ReuseExchange is boken in SparkSQL
> --
>
> Key: SPARK-20215
> URL: https://issues.apache.org/jira/browse/SPARK-20215
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>    Reporter: Zhan Zhang
>Priority: Minor
>
> Currently if we have query like: A join B Union A join C... with the same 
> join key. Table A will be scanned multiple times in sql. It is because the 
> megastoreRelation are not shared by two joins, and ExprId is different.  
> canonicalized in Expression will not be able to unify them and two Exchange 
> will not compatible and cannot be reused.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20215) ReuseExchange is boken in SparkSQL

2017-04-04 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-20215:
--

 Summary: ReuseExchange is boken in SparkSQL
 Key: SPARK-20215
 URL: https://issues.apache.org/jira/browse/SPARK-20215
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Zhan Zhang
Priority: Minor


Currently if we have query like: A join B Union A join C... with the same join 
key. Table A will be scanned multiple times in sql. It is because the 
megastoreRelation are not shared by two joins, and ExprId is different.  
canonicalized in Expression will not be able to unify them and two Exchange 
will not compatible and cannot be reused.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20006) Separate threshold for broadcast and shuffled hash join

2017-03-17 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931054#comment-15931054
 ] 

Zhan Zhang edited comment on SPARK-20006 at 3/18/17 4:42 AM:
-

The default ShuffledHashJoin threshold can fallback to the broadcast one. A 
separate configuration does provide us opportunities to optimize the join 
dramatically. It would be great if CBO can automatically find the best 
strategy. But probably I miss something. Currently the CBO does not collect 
right statistics, especially for partitioned table. I have opened a JIRA for 
that issue as well. https://issues.apache.org/jira/browse/SPARK-19890


was (Author: zhzhan):
The default ShuffledHashJoin threshold can fallback to the broadcast one. A 
separate configuration does provide us opportunities to optimize the join 
dramatically. It would be great if CBO can automatically find the best 
strategy. But probably I miss something. Currently the CBO does not collect 
right statistics, especially for partitioned table. 
https://issues.apache.org/jira/browse/SPARK-19890

> Separate threshold for broadcast and shuffled hash join
> ---
>
> Key: SPARK-20006
> URL: https://issues.apache.org/jira/browse/SPARK-20006
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>    Reporter: Zhan Zhang
>Priority: Minor
>
> Currently both canBroadcast and canBuildLocalHashMap use the same 
> configuration: AUTO_BROADCASTJOIN_THRESHOLD. 
> But the memory model may be different. For broadcast, currently the hash map 
> is always build on heap. For shuffledHashJoin, the hash map may be build on 
> heap(longHash), or off heap(other map if off heap is enabled). The same 
> configuration makes the configuration hard to tune (how to allocate memory 
> onheap/offheap). Propose to use different configuration. Please comments 
> whether it is reasonable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20006) Separate threshold for broadcast and shuffled hash join

2017-03-17 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931054#comment-15931054
 ] 

Zhan Zhang commented on SPARK-20006:


The default ShuffledHashJoin threshold can fallback to the broadcast one. A 
separate configuration does provide us opportunities to optimize the join 
dramatically. It would be great if CBO can automatically find the best 
strategy. But probably I miss something. Currently the CBO does not collect 
right statistics, especially for partitioned table. 
https://issues.apache.org/jira/browse/SPARK-19890

> Separate threshold for broadcast and shuffled hash join
> ---
>
> Key: SPARK-20006
> URL: https://issues.apache.org/jira/browse/SPARK-20006
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>    Reporter: Zhan Zhang
>Priority: Minor
>
> Currently both canBroadcast and canBuildLocalHashMap use the same 
> configuration: AUTO_BROADCASTJOIN_THRESHOLD. 
> But the memory model may be different. For broadcast, currently the hash map 
> is always build on heap. For shuffledHashJoin, the hash map may be build on 
> heap(longHash), or off heap(other map if off heap is enabled). The same 
> configuration makes the configuration hard to tune (how to allocate memory 
> onheap/offheap). Propose to use different configuration. Please comments 
> whether it is reasonable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20006) Separate threshold for broadcast and shuffled hash join

2017-03-17 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-20006:
---
Description: 
Currently both canBroadcast and canBuildLocalHashMap use the same 
configuration: AUTO_BROADCASTJOIN_THRESHOLD. 

But the memory model may be different. For broadcast, currently the hash map is 
always build on heap. For shuffledHashJoin, the hash map may be build on 
heap(longHash), or off heap(other map if off heap is enabled). The same 
configuration makes the configuration hard to tune (how to allocate memory 
onheap/offheap). Propose to use different configuration. Please comments 
whether it is reasonable.

  was:
Currently both canBroadcast and canBuildLocalHashMap use the same 
configuration: AUTO_BROADCASTJOIN_THRESHOLD. 

But the memory model may be different. For broadcast, currently the hash map is 
always build on heap. For shuffledHashJoin, the hash map may be build on 
heap(longHash), or off heap(other map if off heap is enabled). The same 
configuration makes the configuration hard to tune (how to allocate memory 
onheap/offheap). Propose to use different configuration.


> Separate threshold for broadcast and shuffled hash join
> ---
>
> Key: SPARK-20006
> URL: https://issues.apache.org/jira/browse/SPARK-20006
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>    Reporter: Zhan Zhang
>Priority: Minor
>
> Currently both canBroadcast and canBuildLocalHashMap use the same 
> configuration: AUTO_BROADCASTJOIN_THRESHOLD. 
> But the memory model may be different. For broadcast, currently the hash map 
> is always build on heap. For shuffledHashJoin, the hash map may be build on 
> heap(longHash), or off heap(other map if off heap is enabled). The same 
> configuration makes the configuration hard to tune (how to allocate memory 
> onheap/offheap). Propose to use different configuration. Please comments 
> whether it is reasonable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20006) Separate threshold for broadcast and shuffled hash join

2017-03-17 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-20006:
--

 Summary: Separate threshold for broadcast and shuffled hash join
 Key: SPARK-20006
 URL: https://issues.apache.org/jira/browse/SPARK-20006
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Zhan Zhang
Priority: Minor


Currently both canBroadcast and canBuildLocalHashMap use the same 
configuration: AUTO_BROADCASTJOIN_THRESHOLD. 

But the memory model may be different. For broadcast, currently the hash map is 
always build on heap. For shuffledHashJoin, the hash map may be build on 
heap(longHash), or off heap(other map if off heap is enabled). The same 
configuration makes the configuration hard to tune (how to allocate memory 
onheap/offheap). Propose to use different configuration.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19908) Direct buffer memory OOM should not cause stage retries.

2017-03-10 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-19908:
--

 Summary: Direct buffer memory OOM should not cause stage retries.
 Key: SPARK-19908
 URL: https://issues.apache.org/jira/browse/SPARK-19908
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 2.1.0
Reporter: Zhan Zhang
Priority: Minor


Currently if there is  java.lang.OutOfMemoryError: Direct buffer memory, the 
exception will be changed to FetchFailedException, causing stage retries.

org.apache.spark.shuffle.FetchFailedException: Direct buffer memory
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:40)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:731)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:692)
at 
org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:854)
at 
org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:887)
at 
org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:278)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:212)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
at 
io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
at 
io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
at 
io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511

[jira] [Updated] (SPARK-19890) Make MetastoreRelation statistics estimation more accurately

2017-03-09 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-19890:
---
Description: 
Currently the MetastoreRelation statistics is retrieved on the analyze phase, 
and the size is based on the table scope. But for partitioned table, this 
statistics is not useful as table size may 100x+ larger than the input 
partition size. As a result, the join optimization techniques is not applicable.

It would be great if we can postpone the statistics to the optimization phase 
to get partition information but before physical plan generation phase so that 
JoinSelection can choose better join methd (broadcast, shuffledjoin, or 
sortmerjoin).

Although the metastorerelation does not associated with partitions, but through 
PhysicalOperation we can get the partition info for the table. Multiple plan 
can use the same meatstorerelation, but the estimation is still much better 
than table size. This way, retrieving statistics is straightforward.

Another possible way is to have a another data structure associating the 
metastore relation and partitions with the plan to get most accurate estimation.

  was:
Currently the MetastoreRelation statistics is retrieved on the analyze phase, 
and the size is based on the table scope. But for partitioned table, this 
statistics is not useful as table size may 100x+ larger than the input 
partition size. As a result, the join optimization techniques is not applicable.

It would be great if we can postpone the statistics to the optimization phase 
to get partition information but before physical plan generation phase so that 
JoinSelection can choose better join methd (broadcast, shuffledjoin, or 
sortmerjoin).

Although the metastorerelation does not associated with partitions, but through 
PhysicalOperation we can get the partition info for the table. Although 
multiple plan can use the same meatstorerelation, but the estimation still much 
better than table size. This way, retrieving statistics is straightforward.

Another possible way is to have a another data structure associating the 
metastore relation and partitions with the plan to get most accurate estimation.


> Make MetastoreRelation statistics estimation more accurately
> 
>
> Key: SPARK-19890
> URL: https://issues.apache.org/jira/browse/SPARK-19890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>    Reporter: Zhan Zhang
>Priority: Minor
>
> Currently the MetastoreRelation statistics is retrieved on the analyze phase, 
> and the size is based on the table scope. But for partitioned table, this 
> statistics is not useful as table size may 100x+ larger than the input 
> partition size. As a result, the join optimization techniques is not 
> applicable.
> It would be great if we can postpone the statistics to the optimization phase 
> to get partition information but before physical plan generation phase so 
> that JoinSelection can choose better join methd (broadcast, shuffledjoin, or 
> sortmerjoin).
> Although the metastorerelation does not associated with partitions, but 
> through PhysicalOperation we can get the partition info for the table. 
> Multiple plan can use the same meatstorerelation, but the estimation is still 
> much better than table size. This way, retrieving statistics is 
> straightforward.
> Another possible way is to have a another data structure associating the 
> metastore relation and partitions with the plan to get most accurate 
> estimation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19890) Make MetastoreRelation statistics estimation more accurately

2017-03-09 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-19890:
--

 Summary: Make MetastoreRelation statistics estimation more 
accurately
 Key: SPARK-19890
 URL: https://issues.apache.org/jira/browse/SPARK-19890
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Zhan Zhang
Priority: Minor


Currently the MetastoreRelation statistics is retrieved on the analyze phase, 
and the size is based on the table scope. But for partitioned table, this 
statistics is not useful as table size may 100x+ larger than the input 
partition size. As a result, the join optimization techniques is not applicable.

It would be great if we can postpone the statistics to the optimization phase 
to get partition information but before physical plan generation phase so that 
JoinSelection can choose better join methd (broadcast, shuffledjoin, or 
sortmerjoin).

Although the metastorerelation does not associated with partitions, but through 
PhysicalOperation we can get the partition info for the table. Although 
multiple plan can use the same meatstorerelation, but the estimation still much 
better than table size. This way, retrieving statistics is straightforward.

Another possible way is to have a another data structure associating the 
metastore relation and partitions with the plan to get most accurate estimation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19839) Fix memory leak in BytesToBytesMap

2017-03-06 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897975#comment-15897975
 ] 

Zhan Zhang commented on SPARK-19839:


When BytesToBytesMap spills, its longArray should be released. Otherwise, it 
may not released until the task complete. This array may take a significant 
amount of memory, which cannot be used by later operator, such as 
UnsafeShuffleExternalSorter, resulting in more frequent spill in sorter. This 
patch release the array as destructive iterator will not use this array anymore.

> Fix memory leak in BytesToBytesMap
> --
>
> Key: SPARK-19839
> URL: https://issues.apache.org/jira/browse/SPARK-19839
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>    Reporter: Zhan Zhang
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19839) Fix memory leak in BytesToBytesMap

2017-03-06 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-19839:
--

 Summary: Fix memory leak in BytesToBytesMap
 Key: SPARK-19839
 URL: https://issues.apache.org/jira/browse/SPARK-19839
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Zhan Zhang






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19815) Not orderable should be applied to right key instead of left key

2017-03-03 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895522#comment-15895522
 ] 

Zhan Zhang commented on SPARK-19815:


I am thinking the logic again. On the surface, the logic may be correct. Since 
in the join, the left and right key should be the same type. Please close this 
JIRA.

> Not orderable should be applied to right key instead of left key
> 
>
> Key: SPARK-19815
> URL: https://issues.apache.org/jira/browse/SPARK-19815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>    Reporter: Zhan Zhang
>Priority: Minor
>
> When generating ShuffledHashJoinExec, the orderable condition should be 
> applied to right key instead of left key.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19815) Not orderable should be applied to right key instead of left key

2017-03-03 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-19815:
---
Summary: Not orderable should be applied to right key instead of left key  
(was: Not order able should be applied to right key instead of left key)

> Not orderable should be applied to right key instead of left key
> 
>
> Key: SPARK-19815
> URL: https://issues.apache.org/jira/browse/SPARK-19815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>    Reporter: Zhan Zhang
>Priority: Minor
>
> When generating ShuffledHashJoinExec, the orderable condition should be 
> applied to right key instead of left key.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19815) Not order able should be applied to right key instead of left key

2017-03-03 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-19815:
--

 Summary: Not order able should be applied to right key instead of 
left key
 Key: SPARK-19815
 URL: https://issues.apache.org/jira/browse/SPARK-19815
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Zhan Zhang
Priority: Minor


When generating ShuffledHashJoinExec, the orderable condition should be applied 
to right key instead of left key.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19354) Killed tasks are getting marked as FAILED

2017-02-11 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862588#comment-15862588
 ] 

Zhan Zhang commented on SPARK-19354:


This fix is actually critical. In production, we found that this behavior can 
cause job retry and failure especially speculation is enabled.

/cc [~rxin]

Specifically we observe that:
When sorter spill to disk, the task is killed. Then a interruptedExecption is 
thrown. Then OOM will be thrown, which cause the unhandledexception in 
executor, and eventually shutdown the executor. It happens a lot in speculative 
tasks. With healthy tasks in the same executor marked as failed as well. 
Retries will happen. Even worse, such retries may fail again due to same 
reason, eventually causing job failure.

17/02/11 15:39:38 ERROR TaskMemoryManager: error while calling spill() on 
org.apache.spark.shuffle.sort.ShuffleExternalSorter@714b17b0
java.nio.channels.ClosedByInterruptException
at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269)
at 
org.apache.spark.storage.DiskBlockObjectWriter.commitAndGet(DiskBlockObjectWriter.scala:178)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:186)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:254)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:171)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:245)
at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:359)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:382)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:241)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:162)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:278)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.shuffle.sort.ShuffleExternalSorter@714b17b0 : null
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:180)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:245)
at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:359)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:382)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:241)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:162)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:278)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

> Killed tasks are getting marked as FAILED
> -
>
> Key: SPARK-19354
> URL: https://issues.apache.org/jira/browse/SPARK-19354
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Devaraj K
>Priority: Minor
>
> When we enable speculation, we can see there are multiple attempts running 
> for the same task when the first task progress is slow. If any of the task 
> attempt succeeds then the other attempts will be killed, during killing the 
> attempts those attempts are getting marked as failed due to the below error. 
> We need to handle this error and mark the at

[jira] [Commented] (SPARK-13450) SortMergeJoin will OOM when join rows have lot of same keys

2017-01-09 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15812550#comment-15812550
 ] 

Zhan Zhang commented on SPARK-13450:


ExternalAppendOnlyMap estimate the size of the data saved. In SortMergeJoin, I 
think we can leverage UnsafeExternalSorter to get more accurate and 
controllable behavior.

> SortMergeJoin will OOM when join rows have lot of same keys
> ---
>
> Key: SPARK-13450
> URL: https://issues.apache.org/jira/browse/SPARK-13450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.2, 2.1.0
>Reporter: Hong Shen
> Attachments: heap-dump-analysis.png
>
>
>   When I run a sql with join, task throw  java.lang.OutOfMemoryError and sql 
> failed. I have set spark.executor.memory  4096m.
>   SortMergeJoin use a ArrayBuffer[InternalRow] to store bufferedMatches, if 
> the join rows have a lot of same key, it will throw OutOfMemoryError.
> {code}
>   /** Buffered rows from the buffered side of the join. This is empty if 
> there are no matches. */
>   private[this] val bufferedMatches: ArrayBuffer[InternalRow] = new 
> ArrayBuffer[InternalRow]
> {code}
>   Here is the stackTrace:
> {code}
> org.xerial.snappy.SnappyNative.arrayCopy(Native Method)
> org.xerial.snappy.Snappy.arrayCopy(Snappy.java:84)
> org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190)
> org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163)
> java.io.DataInputStream.readFully(DataInputStream.java:195)
> java.io.DataInputStream.readLong(DataInputStream.java:416)
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71)
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79)
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136)
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123)
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:84)
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoin.scala:300)
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.bufferMatchingRows(SortMergeJoin.scala:329)
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoin.scala:229)
> org.apache.spark.sql.execution.joins.SortMergeJoin$$anonfun$doExecute$1$$anon$1.advanceNext(SortMergeJoin.scala:105)
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741)
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:301)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:301)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> org.apache.spark.scheduler.Task.run(Task.scala:89)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:215)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (HBASE-15335) Add composite key support in row key

2016-12-05 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15722901#comment-15722901
 ] 

Zhan Zhang commented on HBASE-15335:


[~tedyu] It seems that I do not have permission to reassign. Could you help on 
this? Thanks.

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, 
> HBASE-15335-6.patch, HBASE-15335-7.patch, HBASE-15335-8.patch, 
> HBASE-15335-9.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-18637) Stateful UDF should be considered as nondeterministic

2016-11-29 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706961#comment-15706961
 ] 

Zhan Zhang commented on SPARK-18637:


[~hvanhovell] It is an annotation.

/**
 * UDFType annotations are used to describe properties of a UDF. This gives
 * important information to the optimizer.
 * If the UDF is not deterministic, or if it is stateful, it is necessary to
 * annotate it as such for correctness.
 *
 */
@Public
@Evolving
@Target(ElementType.TYPE)
@Retention(RetentionPolicy.RUNTIME)
@Inherited
public @interface UDFType {
  /**
   * Certain optimizations should not be applied if UDF is not deterministic.
   * Deterministic UDF returns same result each time it is invoked with a
   * particular input. This determinism just needs to hold within the context of
   * a query.
   *
   * @return true if the UDF is deterministic
   */
  boolean deterministic() default true;

  /**
   * If a UDF stores state based on the sequence of records it has processed, it
   * is stateful. A stateful UDF cannot be used in certain expressions such as
   * case statement and certain optimizations such as AND/OR short circuiting
   * don't apply for such UDFs, as they need to be invoked for each record.
   * row_sequence is an example of stateful UDF. A stateful UDF is considered to
   * be non-deterministic, irrespective of what deterministic() returns.
   *
   * @return true
   */
  boolean stateful() default false;

  /**
   * A UDF is considered distinctLike if the UDF can be evaluated on just the
   * distinct values of a column. Examples include min and max UDFs. This
   * information is used by metadata-only optimizer.
   *
   * @return true if UDF is distinctLike
   */
  boolean distinctLike() default false;

  /**
   * Using in analytical functions to specify that UDF implies an ordering
   *
   * @return true if the function implies order
   */
  boolean impliesOrder() default false;
}


> Stateful UDF should be considered as nondeterministic
> -
>
> Key: SPARK-18637
> URL: https://issues.apache.org/jira/browse/SPARK-18637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>    Reporter: Zhan Zhang
>
> If the annotation UDFType of a udf is stateful, it shoudl be considered as 
> non-deterministic. Otherwise, the catalyst may optimize the plan and return 
> the wrong result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18637) Stateful UDF should be considered as nondeterministic

2016-11-29 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706961#comment-15706961
 ] 

Zhan Zhang edited comment on SPARK-18637 at 11/29/16 11:52 PM:
---

[~hvanhovell] It is an annotation.

/**
 * UDFType annotations are used to describe properties of a UDF. This gives
 * important information to the optimizer.
 * If the UDF is not deterministic, or if it is stateful, it is necessary to
 * annotate it as such for correctness.
 *
 */


was (Author: zhzhan):
[~hvanhovell] It is an annotation.

/**
 * UDFType annotations are used to describe properties of a UDF. This gives
 * important information to the optimizer.
 * If the UDF is not deterministic, or if it is stateful, it is necessary to
 * annotate it as such for correctness.
 *
 */
@Public
@Evolving
@Target(ElementType.TYPE)
@Retention(RetentionPolicy.RUNTIME)
@Inherited
public @interface UDFType {
  /**
   * Certain optimizations should not be applied if UDF is not deterministic.
   * Deterministic UDF returns same result each time it is invoked with a
   * particular input. This determinism just needs to hold within the context of
   * a query.
   *
   * @return true if the UDF is deterministic
   */
  boolean deterministic() default true;

  /**
   * If a UDF stores state based on the sequence of records it has processed, it
   * is stateful. A stateful UDF cannot be used in certain expressions such as
   * case statement and certain optimizations such as AND/OR short circuiting
   * don't apply for such UDFs, as they need to be invoked for each record.
   * row_sequence is an example of stateful UDF. A stateful UDF is considered to
   * be non-deterministic, irrespective of what deterministic() returns.
   *
   * @return true
   */
  boolean stateful() default false;

  /**
   * A UDF is considered distinctLike if the UDF can be evaluated on just the
   * distinct values of a column. Examples include min and max UDFs. This
   * information is used by metadata-only optimizer.
   *
   * @return true if UDF is distinctLike
   */
  boolean distinctLike() default false;

  /**
   * Using in analytical functions to specify that UDF implies an ordering
   *
   * @return true if the function implies order
   */
  boolean impliesOrder() default false;
}


> Stateful UDF should be considered as nondeterministic
> -
>
> Key: SPARK-18637
> URL: https://issues.apache.org/jira/browse/SPARK-18637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>    Reporter: Zhan Zhang
>
> If the annotation UDFType of a udf is stateful, it shoudl be considered as 
> non-deterministic. Otherwise, the catalyst may optimize the plan and return 
> the wrong result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18637) Stateful UDF should be considered as nondeterministic

2016-11-29 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-18637:
---
Component/s: SQL

> Stateful UDF should be considered as nondeterministic
> -
>
> Key: SPARK-18637
> URL: https://issues.apache.org/jira/browse/SPARK-18637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>    Reporter: Zhan Zhang
>
> If the annotation UDFType of a udf is stateful, it shoudl be considered as 
> non-deterministic. Otherwise, the catalyst may optimize the plan and return 
> the wrong result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18637) Stateful UDF should be considered as nondeterministic

2016-11-29 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706905#comment-15706905
 ] 

Zhan Zhang commented on SPARK-18637:


Here is the comments from UDFType 
  /**
   * If a UDF stores state based on the sequence of records it has processed, it
   * is stateful. A stateful UDF cannot be used in certain expressions such as
   * case statement and certain optimizations such as AND/OR short circuiting
   * don't apply for such UDFs, as they need to be invoked for each record.
   * row_sequence is an example of stateful UDF. A stateful UDF is considered to
   * be non-deterministic, irrespective of what deterministic() returns.
   *
   * @return true
   */
  boolean stateful() default false;

> Stateful UDF should be considered as nondeterministic
> -
>
> Key: SPARK-18637
> URL: https://issues.apache.org/jira/browse/SPARK-18637
> Project: Spark
>  Issue Type: Bug
>    Reporter: Zhan Zhang
>
> If the annotation UDFType of a udf is stateful, it shoudl be considered as 
> non-deterministic. Otherwise, the catalyst may optimize the plan and return 
> the wrong result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18637) Stateful UDF should be considered as nondeterministic

2016-11-29 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-18637:
--

 Summary: Stateful UDF should be considered as nondeterministic
 Key: SPARK-18637
 URL: https://issues.apache.org/jira/browse/SPARK-18637
 Project: Spark
  Issue Type: Bug
Reporter: Zhan Zhang


If the annotation UDFType of a udf is stateful, it shoudl be considered as 
non-deterministic. Otherwise, the catalyst may optimize the plan and return the 
wrong result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18550) Make the queue capacity of LiveListenerBus configurable.

2016-11-22 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15688361#comment-15688361
 ] 

Zhan Zhang commented on SPARK-18550:


I was not aware it has been fixed already. Please help to close it.

> Make the queue capacity of LiveListenerBus configurable.
> 
>
> Key: SPARK-18550
> URL: https://issues.apache.org/jira/browse/SPARK-18550
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>    Reporter: Zhan Zhang
>Priority: Minor
>
> We meet issues that driver listener bus cannot catch up the speed of incoming 
> event. Current value is fixed as 1000. This value should be configurable per 
> job. Otherwise, when event is dropped, the UI is totally useless.
> Bus: Dropping SparkListenerEvent because no remaining room in event queue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18550) Make the queue capacity of LiveListenerBus configurable.

2016-11-22 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-18550:
--

 Summary: Make the queue capacity of LiveListenerBus configurable.
 Key: SPARK-18550
 URL: https://issues.apache.org/jira/browse/SPARK-18550
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Zhan Zhang
Priority: Minor


We meet issues that driver listener bus cannot catch up the speed of incoming 
event. Current value is fixed as 1000. This value should be configurable per 
job. Otherwise, when event is dropped, the UI is totally useless.

Bus: Dropping SparkListenerEvent because no remaining room in event queue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17637) Packed scheduling for Spark tasks across executors

2016-11-22 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687474#comment-15687474
 ] 

Zhan Zhang commented on SPARK-17637:


[~hvanhovell] Thanks. PR is updated with conflicts resolved.

> Packed scheduling for Spark tasks across executors
> --
>
> Key: SPARK-17637
> URL: https://issues.apache.org/jira/browse/SPARK-17637
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
>Priority: Minor
>
> Currently Spark scheduler implements round robin scheduling for tasks to 
> executors. Which is great as it distributes the load evenly across the 
> cluster, but this leads to significant resource waste in some cases, 
> especially when dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



SparkPlan/Shuffle stage reuse with Dataset/DataFrame

2016-10-18 Thread Zhan Zhang
Hi Folks,

We have some Dataset/Dataframe use cases that will benefit from reuse the
SparkPlan and shuffle stage. 

For example, the following cases. Because the query optimization and
sparkplan is generated by catalyst when it is executed, as a result, the
underlying RDD lineage is regenerated for dataset1. Thus, the shuffle stage
will be executed multiple times.

val dataset1 = dataset.groupby.agg
df.registerTempTable("tmpTable")
spark.sql("select * from tmpTable where condition").collect
spark.sql("select * from tmpTable where condition1").cllect

On the one side, we get optimized query plan, but on the other side, we
cannot reuse the data generated by shuffle stage.

Currently, to reuse the dataset1, we have to use persist to cache the data.
It is helpful but sometimes is not what we want, as it has some side effect.
For example, we cannot release the executor that has active cache in it even
it is idle and dynamic allocator is enabled.

In other words, we only want to reuse the shuffle data as much as possible
without caching in a long pipeline with multiple shuffle stages.

I am wondering does it make sense to add a new feature to Dataset/Dataframe
to work as barrier and prevent the query optimization happens across the
barrier.

For example, in the above case, we want catalyst take tmpTable as a barrier,
and stop optimization across it, so that we can reuse the underlying rdd
lineage of dataset1.

The prototype code to make it work is quite small, and we tried in house
with a new API as Dataset.cacheShuffle to make this happen.

But I want some feedback from community before opening a JIRA, as in some
sense, it does stop the optimization earlier. Any comments?





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkPlan-Shuffle-stage-reuse-with-Dataset-DataFrame-tp19502.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[jira] [Commented] (SPARK-17637) Packed scheduling for Spark tasks across executors

2016-09-23 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15516851#comment-15516851
 ] 

Zhan Zhang commented on SPARK-17637:


[~jerryshao] The idea is straightforward. Instead of doing round robin on the 
executers with available cores, the new scheduling will try to allocate tasks 
to the executors with least available cores. As a result for the executors who 
has more free resources may not have new tasks allocated. With dynamic 
allocation enabled, these executors may be released so that other jobs can get 
required resources from underlying resource manager.

It is not specific bound to dynamic allocation, but it is an easy way to 
understand the gains of the new scheduler. In addition, in the patch (soon to 
be sent out) there is also another scheduler which does exactly opposite thing 
by allocating tasks to executors with most available cores in order to balance 
the workload to all executors.

> Packed scheduling for Spark tasks across executors
> --
>
> Key: SPARK-17637
> URL: https://issues.apache.org/jira/browse/SPARK-17637
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>    Reporter: Zhan Zhang
>Priority: Minor
>
> Currently Spark scheduler implements round robin scheduling for tasks to 
> executors. Which is great as it distributes the load evenly across the 
> cluster, but this leads to significant resource waste in some cases, 
> especially when dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17637) Packed scheduling for Spark tasks across executors

2016-09-22 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15515247#comment-15515247
 ] 

Zhan Zhang commented on SPARK-17637:


cc [~rxin]
A quick prototype shows that  for a tested pipeline, the job can save around 
45% regarding the reserved cpu and memory when the dynamic allocation is 
enabled.

> Packed scheduling for Spark tasks across executors
> --
>
> Key: SPARK-17637
> URL: https://issues.apache.org/jira/browse/SPARK-17637
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>    Reporter: Zhan Zhang
>Priority: Minor
>
> Currently Spark scheduler implements round robin scheduling for tasks to 
> executors. Which is great as it distributes the load evenly across the 
> cluster, but this leads to significant resource waste in some cases, 
> especially when dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17637) Packed scheduling for Spark tasks across executors

2016-09-22 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514105#comment-15514105
 ] 

Zhan Zhang commented on SPARK-17637:


The plan is to introduce a new configuration so that different scheduling 
algorithms can be used for the task scheduling.

> Packed scheduling for Spark tasks across executors
> --
>
> Key: SPARK-17637
> URL: https://issues.apache.org/jira/browse/SPARK-17637
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>    Reporter: Zhan Zhang
>Priority: Minor
>
> Currently Spark scheduler implements round robin scheduling for tasks to 
> executors. Which is great as it distributes the load evenly across the 
> cluster, but this leads to significant resource waste in some cases, 
> especially when dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17637) Packed scheduling for Spark tasks across executors

2016-09-22 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-17637:
--

 Summary: Packed scheduling for Spark tasks across executors
 Key: SPARK-17637
 URL: https://issues.apache.org/jira/browse/SPARK-17637
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: Zhan Zhang
Priority: Minor


Currently Spark scheduler implements round robin scheduling for tasks to 
executors. Which is great as it distributes the load evenly across the cluster, 
but this leads to significant resource waste in some cases, especially when 
dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17526) Display the executor log links with the job failure message on Spark UI and Console

2016-09-13 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-17526:
--

 Summary: Display the executor log links with the job failure 
message on Spark UI and Console
 Key: SPARK-17526
 URL: https://issues.apache.org/jira/browse/SPARK-17526
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Zhan Zhang
Priority: Minor


 Display the executor log links with the job failure message on Spark UI and 
Console

"Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most 
recent failure: Lost task 0.3 in stage 0.0 (TID 3, HostName): 
java.lang.Exception: foo"

To make this failure message more helpful, we should have the executor log link 
in the driver log and web ui as well on which the task failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (HBASE-15335) Add composite key support in row key

2016-07-14 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15378457#comment-15378457
 ] 

Zhan Zhang commented on HBASE-15335:


[~tedyu] The scaladoc warning seems to be false positive, as I didn't see the 
map of the comments of method apply in object HBaseTableCatalo

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, 
> HBASE-15335-6.patch, HBASE-15335-7.patch, HBASE-15335-8.patch, 
> HBASE-15335-9.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15335) Add composite key support in row key

2016-07-14 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: HBASE-15335-9.patch

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, 
> HBASE-15335-6.patch, HBASE-15335-7.patch, HBASE-15335-8.patch, 
> HBASE-15335-9.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15335) Add composite key support in row key

2016-07-12 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: HBASE-15335-8.patch

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, 
> HBASE-15335-6.patch, HBASE-15335-7.patch, HBASE-15335-8.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Anyone knows the hive repo for spark-2.0?

2016-07-07 Thread Zhan Zhang
I saw the pom file having hive version as
1.2.1.spark2. But I cannot find the branch in 
https://github.com/pwendell/

Does anyone know where the repo is?

Thanks.

Zhan Zhang




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Anyone-knows-the-hive-repo-for-spark-2-0-tp18234.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[jira] [Updated] (HBASE-15335) Add composite key support in row key

2016-06-14 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Status: Patch Available  (was: Open)

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, 
> HBASE-15335-6.patch, HBASE-15335-7.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15335) Add composite key support in row key

2016-06-14 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Status: Open  (was: Patch Available)

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, 
> HBASE-15335-6.patch, HBASE-15335-7.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15335) Add composite key support in row key

2016-06-14 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Status: Open  (was: Patch Available)

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, 
> HBASE-15335-6.patch, HBASE-15335-7.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15335) Add composite key support in row key

2016-06-14 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Status: Patch Available  (was: Open)

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, 
> HBASE-15335-6.patch, HBASE-15335-7.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15335) Add composite key support in row key

2016-06-14 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: HBASE-15335-7.patch

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, 
> HBASE-15335-6.patch, HBASE-15335-7.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-16017) HBase TableOutputFormat has connection leak in getRecordWriter

2016-06-13 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-16017:
---
Attachment: HBASE-16017-1.patch

> HBase TableOutputFormat has connection leak in getRecordWriter
> --
>
> Key: HBASE-16017
> URL: https://issues.apache.org/jira/browse/HBASE-16017
> Project: HBase
>  Issue Type: Bug
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-16017-1.patch
>
>
> Currently getRecordWriter will not release the connection until jvm 
> terminate, which is not a right assumption given that the function may be 
> invoked many times in the same jvm lifecycle. Inside of mapreduce, the issue 
> has already fixed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-16017) HBase TableOutputFormat has connection leak in getRecordWriter

2016-06-13 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-16017:
---
Attachment: (was: HBASE-16017-1.patch)

> HBase TableOutputFormat has connection leak in getRecordWriter
> --
>
> Key: HBASE-16017
> URL: https://issues.apache.org/jira/browse/HBASE-16017
> Project: HBase
>  Issue Type: Bug
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-16017-1.patch
>
>
> Currently getRecordWriter will not release the connection until jvm 
> terminate, which is not a right assumption given that the function may be 
> invoked many times in the same jvm lifecycle. Inside of mapreduce, the issue 
> has already fixed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-16017) HBase TableOutputFormat has connection leak in getRecordWriter

2016-06-13 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-16017:
---
Status: Patch Available  (was: Open)

> HBase TableOutputFormat has connection leak in getRecordWriter
> --
>
> Key: HBASE-16017
> URL: https://issues.apache.org/jira/browse/HBASE-16017
> Project: HBase
>  Issue Type: Bug
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-16017-1.patch
>
>
> Currently getRecordWriter will not release the connection until jvm 
> terminate, which is not a right assumption given that the function may be 
> invoked many times in the same jvm lifecycle. Inside of mapreduce, the issue 
> has already fixed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16017) HBase TableOutputFormat has connection leak in getRecordWriter

2016-06-13 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328354#comment-15328354
 ] 

Zhan Zhang commented on HBASE-16017:


[~te...@apache.org] Can you please take a look? It is a simple fix. To make the 
impact as small as possible,  I added a new constructor, so that there is no 
change to any other modules.

> HBase TableOutputFormat has connection leak in getRecordWriter
> --
>
> Key: HBASE-16017
> URL: https://issues.apache.org/jira/browse/HBASE-16017
> Project: HBase
>  Issue Type: Bug
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-16017-1.patch
>
>
> Currently getRecordWriter will not release the connection until jvm 
> terminate, which is not a right assumption given that the function may be 
> invoked many times in the same jvm lifecycle. Inside of mapreduce, the issue 
> has already fixed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-16017) HBase TableOutputFormat has connection leak in getRecordWriter

2016-06-13 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-16017:
---
Attachment: HBASE-16017-1.patch

> HBase TableOutputFormat has connection leak in getRecordWriter
> --
>
> Key: HBASE-16017
> URL: https://issues.apache.org/jira/browse/HBASE-16017
> Project: HBase
>  Issue Type: Bug
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-16017-1.patch
>
>
> Currently getRecordWriter will not release the connection until jvm 
> terminate, which is not a right assumption given that the function may be 
> invoked many times in the same jvm lifecycle. Inside of mapreduce, the issue 
> has already fixed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-16017) HBase TableOutputFormat has connection leak in getRecordWriter

2016-06-13 Thread Zhan Zhang (JIRA)
Zhan Zhang created HBASE-16017:
--

 Summary: HBase TableOutputFormat has connection leak in 
getRecordWriter
 Key: HBASE-16017
 URL: https://issues.apache.org/jira/browse/HBASE-16017
 Project: HBase
  Issue Type: Bug
Reporter: Zhan Zhang
Assignee: Zhan Zhang


Currently getRecordWriter will not release the connection until jvm terminate, 
which is not a right assumption given that the function may be invoked many 
times in the same jvm lifecycle. Inside of mapreduce, the issue has already 
fixed. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-16017) HBase TableOutputFormat has connection leak in getRecordWriter

2016-06-13 Thread Zhan Zhang (JIRA)
Zhan Zhang created HBASE-16017:
--

 Summary: HBase TableOutputFormat has connection leak in 
getRecordWriter
 Key: HBASE-16017
 URL: https://issues.apache.org/jira/browse/HBASE-16017
 Project: HBase
  Issue Type: Bug
Reporter: Zhan Zhang
Assignee: Zhan Zhang


Currently getRecordWriter will not release the connection until jvm terminate, 
which is not a right assumption given that the function may be invoked many 
times in the same jvm lifecycle. Inside of mapreduce, the issue has already 
fixed. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15335) Add composite key support in row key

2016-06-13 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: HBASE-15335-6.patch

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, 
> HBASE-15335-6.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15335) Add composite key support in row key

2016-06-09 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: HBASE-15335-5.patch

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SPARK-15848) Spark unable to read partitioned table in avro format and column name in upper case

2016-06-09 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-15848:
---
Affects Version/s: 1.6.1

> Spark unable to read partitioned table in avro format and column name in 
> upper case
> ---
>
> Key: SPARK-15848
> URL: https://issues.apache.org/jira/browse/SPARK-15848
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Zhan Zhang
>
> If external partitioned Hive tables created in Avro format.
> Spark is returning "null" values if columns names are in Uppercase in the 
> Avro schema.
> The same tables return proper data when queried in the Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15848) Spark unable to read partitioned table in avro format and column name in upper case

2016-06-09 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323195#comment-15323195
 ] 

Zhan Zhang commented on SPARK-15848:


cat > file1.csv< file2.csv< val tbl = sqlContext.table("default.avro_table_uppercase");

scala> tbl.show

+--+--+-++
|student_id|subject_id|marks|year|
+--+--+-++
|  null|  null|  100|2000|
|  null|  null|   20|2000|
|  null|  null|  160|2000|
|  null|  null|  963|2000|
|  null|  null|  142|2000|
|  null|  null|  430|2000|
|  null|  null|   91|2002|
|  null|  null|   28|2002|
|  null|  null|   16|2002|
|  null|  null|   96|2002|
|  null|  null|   14|2002|
|  null|  null|   43|2002|
+--+--+-++

> Spark unable to read partitioned table in avro format and column name in 
> upper case
> ---
>
> Key: SPARK-15848
> URL: https://issues.apache.org/jira/browse/SPARK-15848
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Zhan Zhang
>
> If external partitioned Hive tables created in Avro format.
> Spark is returning "null" values if columns names are in Uppercase in the 
> Avro schema.
> The same tables return proper data when queried in the Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15848) Spark unable to read partitioned table in avro format and column name in upper case

2016-06-09 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-15848:
--

 Summary: Spark unable to read partitioned table in avro format and 
column name in upper case
 Key: SPARK-15848
 URL: https://issues.apache.org/jira/browse/SPARK-15848
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Zhan Zhang


If external partitioned Hive tables created in Avro format.
Spark is returning "null" values if columns names are in Uppercase in the Avro 
schema.
The same tables return proper data when queried in the Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (HBASE-15335) Add composite key support in row key

2016-06-09 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: HBASE-15335-4.patch

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15473) Documentation for the usage of hbase dataframe user api (JSON, Avro, etc)

2016-06-01 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15473:
---
Assignee: (was: Zhan Zhang)

> Documentation for the usage of hbase dataframe user api (JSON, Avro, etc)
> -
>
> Key: HBASE-15473
> URL: https://issues.apache.org/jira/browse/HBASE-15473
> Project: HBase
>  Issue Type: Sub-task
>  Components: documentation, spark
>    Reporter: Zhan Zhang
>Priority: Blocker
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format

2016-06-01 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang reassigned HBASE-14801:
--

Assignee: Zhan Zhang

> Enhance the Spark-HBase connector catalog with json format
> --
>
> Key: HBASE-14801
> URL: https://issues.apache.org/jira/browse/HBASE-14801
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-14801-1.patch, HBASE-14801-2.patch, 
> HBASE-14801-3.patch, HBASE-14801-4.patch, HBASE-14801-5.patch, 
> HBASE-14801-6.patch, HBASE-14801-7.patch, HBASE-14801-8.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format

2016-06-01 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14801:
---
Assignee: (was: Zhan Zhang)

> Enhance the Spark-HBase connector catalog with json format
> --
>
> Key: HBASE-14801
> URL: https://issues.apache.org/jira/browse/HBASE-14801
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
> Attachments: HBASE-14801-1.patch, HBASE-14801-2.patch, 
> HBASE-14801-3.patch, HBASE-14801-4.patch, HBASE-14801-5.patch, 
> HBASE-14801-6.patch, HBASE-14801-7.patch, HBASE-14801-8.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-15441) dataset outer join seems to return incorrect result

2016-05-24 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297881#comment-15297881
 ] 

Zhan Zhang commented on SPARK-15441:


Currently new GenericInternalRow(right.output.length) is used as nullRow, but 
actually it cannot be used to identify the difference of row itself is null or 
all columns are null. Probably we can add a special row nullRow to represent 
that the InternalRow itself is null, so that Encoder can identify whether the 
object itself is null or not. 

> dataset outer join seems to return incorrect result
> ---
>
> Key: SPARK-15441
> URL: https://issues.apache.org/jira/browse/SPARK-15441
> Project: Spark
>  Issue Type: Bug
>  Components: sq;
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>Priority: Critical
>
> See notebook
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2836020637783173/5382278320999420/latest.html
> {code}
> import org.apache.spark.sql.functions
> val left = List(("a", 1), ("a", 2), ("b", 3), ("c", 4)).toDS()
> val right = List(("a", "x"), ("b", "y"), ("d", "z")).toDS()
> // The last row _1 should be null, rather than (null, -1)
> left.toDF("k", "v").as[(String, Int)].alias("left")
>   .joinWith(right.toDF("k", "u").as[(String, String)].alias("right"), 
> functions.col("left.k") === functions.col("right.k"), "right_outer")
>   .show()
> {code}
> The returned result currently is
> {code}
> +-+-+
> |   _1|   _2|
> +-+-+
> |(a,2)|(a,x)|
> |(a,1)|(a,x)|
> |(b,3)|(b,y)|
> |(null,-1)|(d,z)|
> +-+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: right outer joins on Datasets

2016-05-24 Thread Zhan Zhang
The reason for "-1" is that the default value for Integer is -1 if the value
is null

  def defaultValue(jt: String): String = jt match {
...
case JAVA_INT => "-1"
...   
 }



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/right-outer-joins-on-Datasets-tp17542p17651.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Updated] (HBASE-15335) Add composite key support in row key

2016-05-18 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Status: Patch Available  (was: Open)

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15335) Add composite key support in row key

2016-05-18 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Status: Open  (was: Patch Available)

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15335) Add composite key support in row key

2016-05-18 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: HBASE-15335-3.patch

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15473) Documentation for the usage of hbase dataframe user api (JSON, Avro, etc)

2016-05-18 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289993#comment-15289993
 ] 

Zhan Zhang commented on HBASE-15473:


[~WeiqingYang] is working on this.

> Documentation for the usage of hbase dataframe user api (JSON, Avro, etc)
> -
>
> Key: HBASE-15473
> URL: https://issues.apache.org/jira/browse/HBASE-15473
> Project: HBase
>  Issue Type: Sub-task
>  Components: documentation, spark
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
>Priority: Blocker
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15335) Add composite key support in row key

2016-05-18 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: HBASE-15335-2.patch

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15335) Add composite key support in row key

2016-05-18 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: HBASE-15335-2.patch

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15335) Add composite key support in row key

2016-05-18 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: HBASE-15335-2.patch

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15335) Add composite key support in row key

2016-05-18 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: (was: HBASE-15335-2.patch)

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15335) Add composite key support in row key

2016-05-18 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: (was: HBASE-15335-2.patch)

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15825) Fix the null pointer in DynamicLogicExpressionSuite

2016-05-16 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284923#comment-15284923
 ] 

Zhan Zhang commented on HBASE-15825:


[~ted_yu] Thanks a lot

> Fix the null pointer in DynamicLogicExpressionSuite
> ---
>
> Key: HBASE-15825
> URL: https://issues.apache.org/jira/browse/HBASE-15825
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Fix For: 2.0.0
>
> Attachments: HBASE-15825-1.patch
>
>
> It only happens in test cases. Not sure why it is not caught. Will submit 
> patch soon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15825) Fix the null pointer in DynamicLogicExpressionSuite

2016-05-13 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15825:
---
Status: Patch Available  (was: Open)

> Fix the null pointer in DynamicLogicExpressionSuite
> ---
>
> Key: HBASE-15825
> URL: https://issues.apache.org/jira/browse/HBASE-15825
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
> Fix For: 2.0.0
>
> Attachments: HBASE-15825-1.patch
>
>
> It only happens in test cases. Not sure why it is not caught. Will submit 
> patch soon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15825) Fix the null pointer in DynamicLogicExpressionSuite

2016-05-13 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15283172#comment-15283172
 ] 

Zhan Zhang commented on HBASE-15825:


[~te...@apache.org] [~jmhsieh] Can you please take a quick look? It is a 
straightforward fix.

> Fix the null pointer in DynamicLogicExpressionSuite
> ---
>
> Key: HBASE-15825
> URL: https://issues.apache.org/jira/browse/HBASE-15825
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
> Fix For: 2.0.0
>
> Attachments: HBASE-15825-1.patch
>
>
> It only happens in test cases. Not sure why it is not caught. Will submit 
> patch soon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15825) Fix the null pointer in DynamicLogicExpressionSuite

2016-05-13 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15825:
---
Attachment: HBASE-15825-1.patch

> Fix the null pointer in DynamicLogicExpressionSuite
> ---
>
> Key: HBASE-15825
> URL: https://issues.apache.org/jira/browse/HBASE-15825
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
> Fix For: 2.0.0
>
> Attachments: HBASE-15825-1.patch
>
>
> It only happens in test cases. Not sure why it is not caught. Will submit 
> patch soon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15333) [hbase-spark] Enhance dataframe filters to handle naively encoded short, integer, long, float and double

2016-05-13 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15283159#comment-15283159
 ] 

Zhan Zhang commented on HBASE-15333:


Open jira HBASE-15825 to fix the test case.

> [hbase-spark] Enhance dataframe filters to handle naively encoded short, 
> integer, long, float and double
> 
>
> Key: HBASE-15333
> URL: https://issues.apache.org/jira/browse/HBASE-15333
> Project: HBase
>  Issue Type: Sub-task
>  Components: spark
>    Reporter: Zhan Zhang
>Assignee: Zhan Zhang
> Fix For: 2.0.0
>
> Attachments: HBASE-15333-1.patch, HBASE-15333-10.patch, 
> HBASE-15333-2.patch, HBASE-15333-3.patch, HBASE-15333-4.patch, 
> HBASE-15333-5.patch, HBASE-15333-6.patch, HBASE-15333-7.patch, 
> HBASE-15333-8.patch, HBASE-15333-9.patch
>
>
> Currently, the range filter is based on the order of bytes. But for java 
> primitive type, such as short, int, long, double, float, etc, their order is 
> not consistent with their byte order, extra manipulation has to be in place 
> to take care of them  correctly.
> For example, for the integer range (-100, 100), the filter <= 1, the current 
> filter will return 0 and 1, and the right return value should be (-100, 1]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-15825) Fix the null pointer in DynamicLogicExpressionSuite

2016-05-13 Thread Zhan Zhang (JIRA)
Zhan Zhang created HBASE-15825:
--

 Summary: Fix the null pointer in DynamicLogicExpressionSuite
 Key: HBASE-15825
 URL: https://issues.apache.org/jira/browse/HBASE-15825
 Project: HBase
  Issue Type: Sub-task
Reporter: Zhan Zhang


It only happens in test cases. Not sure why it is not caught. Will submit patch 
soon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-15825) Fix the null pointer in DynamicLogicExpressionSuite

2016-05-13 Thread Zhan Zhang (JIRA)
Zhan Zhang created HBASE-15825:
--

 Summary: Fix the null pointer in DynamicLogicExpressionSuite
 Key: HBASE-15825
 URL: https://issues.apache.org/jira/browse/HBASE-15825
 Project: HBase
  Issue Type: Sub-task
Reporter: Zhan Zhang


It only happens in test cases. Not sure why it is not caught. Will submit patch 
soon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15333) [hbase-spark] Enhance dataframe filters to handle naively encoded short, integer, long, float and double

2016-05-13 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15283153#comment-15283153
 ] 

Zhan Zhang commented on HBASE-15333:


Not sure why it is not caught during the systest.  The failure happens only in 
test cases. Will open a followup jira to fix the issue.

> [hbase-spark] Enhance dataframe filters to handle naively encoded short, 
> integer, long, float and double
> 
>
> Key: HBASE-15333
> URL: https://issues.apache.org/jira/browse/HBASE-15333
> Project: HBase
>  Issue Type: Sub-task
>  Components: spark
>    Reporter: Zhan Zhang
>Assignee: Zhan Zhang
> Fix For: 2.0.0
>
> Attachments: HBASE-15333-1.patch, HBASE-15333-10.patch, 
> HBASE-15333-2.patch, HBASE-15333-3.patch, HBASE-15333-4.patch, 
> HBASE-15333-5.patch, HBASE-15333-6.patch, HBASE-15333-7.patch, 
> HBASE-15333-8.patch, HBASE-15333-9.patch
>
>
> Currently, the range filter is based on the order of bytes. But for java 
> primitive type, such as short, int, long, double, float, etc, their order is 
> not consistent with their byte order, extra manipulation has to be in place 
> to take care of them  correctly.
> For example, for the integer range (-100, 100), the filter <= 1, the current 
> filter will return 0 and 1, and the right return value should be (-100, 1]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15333) Enhance the filter to handle short, integer, long, float and double

2016-05-12 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15333:
---
Attachment: HBASE-15333-10.patch

> Enhance the filter to handle short, integer, long, float and double
> ---
>
> Key: HBASE-15333
> URL: https://issues.apache.org/jira/browse/HBASE-15333
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15333-1.patch, HBASE-15333-10.patch, 
> HBASE-15333-2.patch, HBASE-15333-3.patch, HBASE-15333-4.patch, 
> HBASE-15333-5.patch, HBASE-15333-6.patch, HBASE-15333-7.patch, 
> HBASE-15333-8.patch, HBASE-15333-9.patch
>
>
> Currently, the range filter is based on the order of bytes. But for java 
> primitive type, such as short, int, long, double, float, etc, their order is 
> not consistent with their byte order, extra manipulation has to be in place 
> to take care of them  correctly.
> For example, for the integer range (-100, 100), the filter <= 1, the current 
> filter will return 0 and 1, and the right return value should be (-100, 1]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15333) Enhance the filter to handle short, integer, long, float and double

2016-05-11 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280574#comment-15280574
 ] 

Zhan Zhang commented on HBASE-15333:


[~jmhsieh] Would you like to take a final look? Thanks.

> Enhance the filter to handle short, integer, long, float and double
> ---
>
> Key: HBASE-15333
> URL: https://issues.apache.org/jira/browse/HBASE-15333
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15333-1.patch, HBASE-15333-2.patch, 
> HBASE-15333-3.patch, HBASE-15333-4.patch, HBASE-15333-5.patch, 
> HBASE-15333-6.patch, HBASE-15333-7.patch, HBASE-15333-8.patch, 
> HBASE-15333-9.patch
>
>
> Currently, the range filter is based on the order of bytes. But for java 
> primitive type, such as short, int, long, double, float, etc, their order is 
> not consistent with their byte order, extra manipulation has to be in place 
> to take care of them  correctly.
> For example, for the integer range (-100, 100), the filter <= 1, the current 
> filter will return 0 and 1, and the right return value should be (-100, 1]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15333) Enhance the filter to handle short, integer, long, float and double

2016-05-06 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274815#comment-15274815
 ] 

Zhan Zhang commented on HBASE-15333:


I checked the warning, they are false positive.

> Enhance the filter to handle short, integer, long, float and double
> ---
>
> Key: HBASE-15333
> URL: https://issues.apache.org/jira/browse/HBASE-15333
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15333-1.patch, HBASE-15333-2.patch, 
> HBASE-15333-3.patch, HBASE-15333-4.patch, HBASE-15333-5.patch, 
> HBASE-15333-6.patch, HBASE-15333-7.patch, HBASE-15333-8.patch, 
> HBASE-15333-9.patch
>
>
> Currently, the range filter is based on the order of bytes. But for java 
> primitive type, such as short, int, long, double, float, etc, their order is 
> not consistent with their byte order, extra manipulation has to be in place 
> to take care of them  correctly.
> For example, for the integer range (-100, 100), the filter <= 1, the current 
> filter will return 0 and 1, and the right return value should be (-100, 1]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15333) Enhance the filter to handle short, integer, long, float and double

2016-05-06 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15333:
---
Attachment: HBASE-15333-9.patch

> Enhance the filter to handle short, integer, long, float and double
> ---
>
> Key: HBASE-15333
> URL: https://issues.apache.org/jira/browse/HBASE-15333
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15333-1.patch, HBASE-15333-2.patch, 
> HBASE-15333-3.patch, HBASE-15333-4.patch, HBASE-15333-5.patch, 
> HBASE-15333-6.patch, HBASE-15333-7.patch, HBASE-15333-8.patch, 
> HBASE-15333-9.patch
>
>
> Currently, the range filter is based on the order of bytes. But for java 
> primitive type, such as short, int, long, double, float, etc, their order is 
> not consistent with their byte order, extra manipulation has to be in place 
> to take care of them  correctly.
> For example, for the integer range (-100, 100), the filter <= 1, the current 
> filter will return 0 and 1, and the right return value should be (-100, 1]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15333) Enhance the filter to handle short, integer, long, float and double

2016-05-05 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15333:
---
Attachment: HBASE-15333-8.patch

> Enhance the filter to handle short, integer, long, float and double
> ---
>
> Key: HBASE-15333
> URL: https://issues.apache.org/jira/browse/HBASE-15333
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15333-1.patch, HBASE-15333-2.patch, 
> HBASE-15333-3.patch, HBASE-15333-4.patch, HBASE-15333-5.patch, 
> HBASE-15333-6.patch, HBASE-15333-7.patch, HBASE-15333-8.patch
>
>
> Currently, the range filter is based on the order of bytes. But for java 
> primitive type, such as short, int, long, double, float, etc, their order is 
> not consistent with their byte order, extra manipulation has to be in place 
> to take care of them  correctly.
> For example, for the integer range (-100, 100), the filter <= 1, the current 
> filter will return 0 and 1, and the right return value should be (-100, 1]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15333) Enhance the filter to handle short, integer, long, float and double

2016-05-05 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15333:
---
Attachment: HBASE-15333-7.patch

solve warnings

> Enhance the filter to handle short, integer, long, float and double
> ---
>
> Key: HBASE-15333
> URL: https://issues.apache.org/jira/browse/HBASE-15333
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15333-1.patch, HBASE-15333-2.patch, 
> HBASE-15333-3.patch, HBASE-15333-4.patch, HBASE-15333-5.patch, 
> HBASE-15333-6.patch, HBASE-15333-7.patch
>
>
> Currently, the range filter is based on the order of bytes. But for java 
> primitive type, such as short, int, long, double, float, etc, their order is 
> not consistent with their byte order, extra manipulation has to be in place 
> to take care of them  correctly.
> For example, for the integer range (-100, 100), the filter <= 1, the current 
> filter will return 0 and 1, and the right return value should be (-100, 1]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15333) Enhance the filter to handle short, integer, long, float and double

2016-05-05 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15333:
---
Attachment: HBASE-15333-6.patch

Solve review comments. Restructure the encoder as a plugin, so that it can be 
extended to support other format. 

> Enhance the filter to handle short, integer, long, float and double
> ---
>
> Key: HBASE-15333
> URL: https://issues.apache.org/jira/browse/HBASE-15333
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15333-1.patch, HBASE-15333-2.patch, 
> HBASE-15333-3.patch, HBASE-15333-4.patch, HBASE-15333-5.patch, 
> HBASE-15333-6.patch
>
>
> Currently, the range filter is based on the order of bytes. But for java 
> primitive type, such as short, int, long, double, float, etc, their order is 
> not consistent with their byte order, extra manipulation has to be in place 
> to take care of them  correctly.
> For example, for the integer range (-100, 100), the filter <= 1, the current 
> filter will return 0 and 1, and the right return value should be (-100, 1]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15333) Enhance the filter to handle short, integer, long, float and double

2016-05-02 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15267434#comment-15267434
 ] 

Zhan Zhang commented on HBASE-15333:


Thanks for the feedback, and I will restructure the code and sent out for 
review this week.

> Enhance the filter to handle short, integer, long, float and double
> ---
>
> Key: HBASE-15333
> URL: https://issues.apache.org/jira/browse/HBASE-15333
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15333-1.patch, HBASE-15333-2.patch, 
> HBASE-15333-3.patch, HBASE-15333-4.patch, HBASE-15333-5.patch
>
>
> Currently, the range filter is based on the order of bytes. But for java 
> primitive type, such as short, int, long, double, float, etc, their order is 
> not consistent with their byte order, extra manipulation has to be in place 
> to take care of them  correctly.
> For example, for the integer range (-100, 100), the filter <= 1, the current 
> filter will return 0 and 1, and the right return value should be (-100, 1]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15333) Enhance the filter to handle short, integer, long, float and double

2016-04-27 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260928#comment-15260928
 ] 

Zhan Zhang commented on HBASE-15333:


[~jmhsieh] Thanks for reviewing the code. I want to discuss in more details 
before changing the code.
Following is my comments:

1. DefaultSource.scala: It is not a replacement, instead it fix the logic in 
the partition pruning and predicate pushdown logic (here we assume the data is 
naive encoded, same as the current code base).
2. DefaultSource.scala:618: typo- Will fix it.
3. Make sense. Will do it.
4.
  4.1 will move to separate class.
  4.2 We are not assume any specific encoding/decoding here. As we want to 
support the java primitive type as already done in the current existing 
codebase. You are definitely right that we may want to more flexibility and 
different encoder/decoder. I think fixing the current naive bytes may take 
priority, and can be the first step. 
5. If we don't change it, and there is PassFilter, it will crash the region 
server.
6. BoundRange.scala:26: Will format the doc in more formal way. Typically it is 
inclusive, but the upper level logic need to take care of exclusive for special 
cases.
7. Will change FilterOps to  JavaBytesEncoder
8. Will enhance the current test cases.

Overall, I think we can first make the code base support naive encoding work 
correctly as the first step, and the framework level it does not prevent adding 
special encoding/decoding, which can be added later by me or other 
contributors. How do you think? Please let me know if you have any concerns.




> Enhance the filter to handle short, integer, long, float and double
> ---
>
> Key: HBASE-15333
> URL: https://issues.apache.org/jira/browse/HBASE-15333
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15333-1.patch, HBASE-15333-2.patch, 
> HBASE-15333-3.patch, HBASE-15333-4.patch, HBASE-15333-5.patch
>
>
> Currently, the range filter is based on the order of bytes. But for java 
> primitive type, such as short, int, long, double, float, etc, their order is 
> not consistent with their byte order, extra manipulation has to be in place 
> to take care of them  correctly.
> For example, for the integer range (-100, 100), the filter <= 1, the current 
> filter will return 0 and 1, and the right return value should be (-100, 1]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: How this unit test passed on master trunk?

2016-04-23 Thread Zhan Zhang
There are multiple records for the DF

scala> structDF.groupBy($"a").agg(min(struct($"record.*"))).show
+---+-+
|  a|min(struct(unresolvedstar()))|
+---+-+
|  1|[1,1]|
|  3|[3,1]|
|  2|[2,1]|

The meaning of .groupBy($"a").agg(min(struct($"record.*"))) is to get the min 
for all the records with the same $”a”

For example: TestData2(1,1) :: TestData2(1,2) The result would be 1, (1, 1), 
since struct(1, 1) is less than struct(1, 2). Please check how the Ordering is 
implemented in InterpretedOrdering.

The output itself does not have any ordering. I am not sure why the unit test 
and the real env have different environment.

Xiao,

I do see the difference between unit test and local cluster run. Do you know 
the reason?

Thanks.

Zhan Zhang




On Apr 22, 2016, at 11:23 AM, Yong Zhang 
<java8...@hotmail.com<mailto:java8...@hotmail.com>> wrote:

Hi,

I was trying to find out why this unit test can pass in Spark code.

in
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

for this unit test:

  test("Star Expansion - CreateStruct and CreateArray") {
val structDf = testData2.select("a", "b").as("record")
// CreateStruct and CreateArray in aggregateExpressions
assert(structDf.groupBy($"a").agg(min(struct($"record.*"))).first() == 
Row(3, Row(3, 1)))
assert(structDf.groupBy($"a").agg(min(array($"record.*"))).first() == 
Row(3, Seq(3, 1)))

// CreateStruct and CreateArray in project list (unresolved alias)
assert(structDf.select(struct($"record.*")).first() == Row(Row(1, 1)))
assert(structDf.select(array($"record.*")).first().getAs[Seq[Int]](0) === 
Seq(1, 1))

// CreateStruct and CreateArray in project list (alias)
assert(structDf.select(struct($"record.*").as("a")).first() == Row(Row(1, 
1)))

assert(structDf.select(array($"record.*").as("a")).first().getAs[Seq[Int]](0) 
=== Seq(1, 1))
  }

>From my understanding, the data return in this case should be Row(1, Row(1, 
>1]), as that will be min of struct.

In fact, if I run the spark-shell on my laptop, and I got the result I expected:


./bin/spark-shell
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala> case class TestData2(a: Int, b: Int)
defined class TestData2

scala> val testData2DF = sqlContext.sparkContext.parallelize(TestData2(1,1) :: 
TestData2(1,2) :: TestData2(2,1) :: TestData2(2,2) :: TestData2(3,1) :: 
TestData2(3,2) :: Nil, 2).toDF()

scala> val structDF = testData2DF.select("a","b").as("record")

scala> structDF.groupBy($"a").agg(min(struct($"record.*"))).first()
res0: org.apache.spark.sql.Row = [1,[1,1]]

scala> structDF.show
+---+---+
|  a|  b|
+---+---+
|  1|  1|
|  1|  2|
|  2|  1|
|  2|  2|
|  3|  1|
|  3|  2|
+---+---+

So from my spark, which I built on the master, I cannot get Row[3,[1,1]] back 
in this case. Why the unit test asserts that Row[3,[1,1]] should be the first, 
and it will pass? But I cannot reproduce that in my spark-shell? I am trying to 
understand how to interpret the meaning of "agg(min(struct($"record.*")))"


Thanks

Yong



Re: Save DataFrame to HBase

2016-04-22 Thread Zhan Zhang
You can try this

https://github.com/hortonworks/shc.git

or here
http://spark-packages.org/package/zhzhan/shc

Currently it is in the process of merging into HBase.

Thanks.

Zhan Zhang

On Apr 21, 2016, at 8:44 AM, Benjamin Kim 
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wrote:

Hi Ted,

Can this module be used with an older version of HBase, such as 1.0 or 1.1? 
Where can I get the module from?

Thanks,
Ben

On Apr 21, 2016, at 6:56 AM, Ted Yu 
<yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote:

The hbase-spark module in Apache HBase (coming with hbase 2.0 release) can do 
this.

On Thu, Apr 21, 2016 at 6:52 AM, Benjamin Kim 
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wrote:
Has anyone found an easy way to save a DataFrame into HBase?

Thanks,
Ben


-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>






Re: Spark SQL insert overwrite table not showing all the partition.

2016-04-22 Thread Zhan Zhang
INSERT OVERWRITE will overwrite any existing data in the table or partition

  *   unless IF NOT EXISTS is provided for a partition (as of Hive 
0.9.0<https://issues.apache.org/jira/browse/HIVE-2612>).


Thanks.

Zhan Zhang

On Apr 21, 2016, at 3:20 PM, Bijay Kumar Pathak 
<bkpat...@mtu.edu<mailto:bkpat...@mtu.edu>> wrote:

Hi,

I have a job which writes to the Hive table with dynamic partition. Inside the 
job,  I am writing into the table two-time but I am only seeing the partition 
with last write although I can see in the Spark UI it is processing data fro 
both the partition.

Below is the query I am using to write to the table.

hive_c.sql("""INSERT OVERWRITE TABLE base_table PARTITION (date='{1}', date_2)
  SELECT * from temp_table
""".format(date_val)
 )


Thanks,
Bijay



Re: Spark DataFrame sum of multiple columns

2016-04-22 Thread Zhan Zhang
You can define your own udf, following is one example

Thanks

Zhan Zhang


val foo = udf((a: Int, b: String) => a.toString + b)

checkAnswer(
  // SELECT *, foo(key, value) FROM testData
  testData.select($"*", foo('key, 'value)).limit(3),



On Apr 21, 2016, at 8:51 PM, Naveen Kumar Pokala 
<npok...@spcapitaliq.com<mailto:npok...@spcapitaliq.com>> wrote:

Hi,

Do we have any way to perform Row level operations in spark dataframes.


For example,

I have a dataframe with columns from A,B,C,…Z.. I want to add one more column 
New Column with sum of all column values.

A

B

C

D

.

.

.

Z

New Column

1

2

4

3







26

351



Can somebody help me on this?


Thanks,
Naveen



Re: Why Spark having OutOfMemory Exception?

2016-04-21 Thread Zhan Zhang
The data may be not large, but the driver need to do a lot of bookkeeping. In 
your case,  it is possible the driver control plane takes too much memory.

I think you can find a java developer to look at the coredump. Otherwise, it is 
hard to tell exactly which part are using all the memory.

Thanks.

Zhan Zhang


On Apr 20, 2016, at 1:38 AM, 李明伟 
<kramer2...@126.com<mailto:kramer2...@126.com>> wrote:

Hi

the input data size is less than 10M. The task result size should be less I 
think. Because I am doing aggregation on the data





At 2016-04-20 16:18:31, "Jeff Zhang" 
<zjf...@gmail.com<mailto:zjf...@gmail.com>> wrote:
Do you mean the input data size as 10M or the task result size ?

>>> But my way is to setup a forever loop to handle continued income data. Not 
>>> sure if it is the right way to use spark
Not sure what this mean, do you use spark-streaming, for doing batch job in the 
forever loop ?



On Wed, Apr 20, 2016 at 3:55 PM, 李明伟 
<kramer2...@126.com<mailto:kramer2...@126.com>> wrote:
Hi Jeff

The total size of my data is less than 10M. I already set the driver memory to 
4GB.







在 2016-04-20 13:42:25,"Jeff Zhang" <zjf...@gmail.com<mailto:zjf...@gmail.com>> 
写道:
Seems it is OOM in driver side when fetching task result.

You can try to increase spark.driver.memory and spark.driver.maxResultSize

On Tue, Apr 19, 2016 at 4:06 PM, 李明伟 
<kramer2...@126.com<mailto:kramer2...@126.com>> wrote:
Hi Zhan Zhang


Please see the exception trace below. It is saying some GC overhead limit error
I am not a java or scala developer so it is hard for me to understand these 
infor.
Also reading coredump is too difficult to me..

I am not sure if the way I am using spark is correct. I understand that spark 
can do batch or stream calculation. But my way is to setup a forever loop to 
handle continued income data.
Not sure if it is the right way to use spark


16/04/19 15:54:55 ERROR Utils: Uncaught exception in thread task-result-getter-2
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
at 
scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
at 
org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
at org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62)
at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: GC 
overhead limit exceeded
at scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
at 
scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethod

Re: [GRAPHX] Graph Algorithms and Spark

2016-04-21 Thread Zhan Zhang

You can take a look at this blog from data bricks about GraphFrames

https://databricks.com/blog/2016/03/03/introducing-graphframes.html

Thanks.

Zhan Zhang

On Apr 21, 2016, at 12:53 PM, Robin East 
<robin.e...@xense.co.uk<mailto:robin.e...@xense.co.uk>> wrote:

Hi

Aside from LDA, which is implemented in MLLib, GraphX has the following 
built-in algorithms:


  *   PageRank/Personalised PageRank
  *   Connected Components
  *   Strongly Connected Components
  *   Triangle Count
  *   Shortest Paths
  *   Label Propagation

It also implements a version of Pregel framework, a form of bulk-synchronous 
parallel processing that is the foundation of most of the above algorithms. We 
cover other algorithms in our book and if you search on google you will find a 
number of other examples.

---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action





On 21 Apr 2016, at 19:47, tgensol 
<thibaut.gensol...@gmail.com<mailto:thibaut.gensol...@gmail.com>> wrote:

Hi there,

I am working in a group of the University of Michigan, and we are trying to
make (and find first) some Distributed graph algorithms.

I know spark, and I found GraphX. I read the docs, but I only found Latent
Dirichlet Allocation algorithms working with GraphX, so I was wondering why
?

Basically, the groupe wants to implement Minimal Spanning Tree, kNN,
shortest path at first.

So my askings are :
Is graphX enough stable for developing this kind of algorithms on it ?
Do you know some algorithms like these working on top of GraphX ? And if no,
why do you think, nobody tried to do it ? Is this too hard ? Or just because
nobody needs it ?

Maybe, it is only my knowledge about GraphX which is weak, and it is not
possible to make these algorithms with GraphX.

Thanking you in advance,
Best regards,

Thibaut



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/GRAPHX-Graph-Algorithms-and-Spark-tp17301.html
Sent from the Apache Spark Developers List mailing list archive at 
Nabble.com<http://nabble.com/>.

-
To unsubscribe, e-mail: 
dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>





  1   2   3   4   5   6   >