[jira] [Created] (SPARK-29934) Dataset support GraphX

2019-11-17 Thread darion yaphet (Jira)
darion yaphet created SPARK-29934:
-

 Summary: Dataset support GraphX
 Key: SPARK-29934
 URL: https://issues.apache.org/jira/browse/SPARK-29934
 Project: Spark
  Issue Type: Bug
  Components: Graph, GraphX, Spark Core
Affects Versions: 2.4.4
Reporter: darion yaphet


Do we have some plan to support GraphX with dataset ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21265) Cache method could specified name

2017-06-30 Thread darion yaphet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

darion yaphet updated SPARK-21265:
--
Component/s: (was: Spark Core)

> Cache method could specified name
> -
>
> Key: SPARK-21265
> URL: https://issues.apache.org/jira/browse/SPARK-21265
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1
>Reporter: darion yaphet
>
> Currently when I cache dataset in my cluster I found the *RDD Name* is the 
> RDD Schema in Storage Tab . When the sturcture is very complex , it's hard to 
> read . So I wish could add a parameter to specified the RDD's name , the 
> RDD's default name is schema . 
> {code:xml}
> Project [ftime#141, id#142L, ei#143, ui#144, kv#145, sts#146L, 
> jsontostruct(StructField(filter,StringType,true), 
> StructField(is_auto,StringType,true), StructField(omgid,StringType,true), 
> StructField(devid,StringType,true), StructField(guid,StringType,true), 
> StructField(ztid,StringType,true), StructField(page_step,StringType,true), 
> StructField(itemValue,StringType,true), 
> StructField(reportParams,StringType,true), 
> StructField(build_type,StringType,true), 
> StructField(ref_page_id,StringType,true), 
> StructField(itemName,StringType,true), StructField(imei,StringType,true), 
> StructField(lid,StringType,true), 
> StructField(app_start_time,StringType,true), 
> StructField(omgbizid,StringType,true), StructField(imsi,StringType,true), 
> StructField(vid,StringType,true), StructField(page_id,StringType,true), 
> StructField(pid,StringType,true), 
> StructField(notification_enable,StringType,true), 
> StructField(call_type,StringType,true), 
> StructField(streamid,StringType,true), StructField(flavor,StringType,true), 
> ... 12 more fields) AS ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21265) Cache method could specified name

2017-06-30 Thread darion yaphet (JIRA)
darion yaphet created SPARK-21265:
-

 Summary: Cache method could specified name
 Key: SPARK-21265
 URL: https://issues.apache.org/jira/browse/SPARK-21265
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.1.1, 2.0.2
Reporter: darion yaphet


Currently when I cache dataset in my cluster I found the *RDD Name* is the RDD 
Schema in Storage Tab . When the sturcture is very complex , it's hard to read 
. So I wish could add a parameter to specified the RDD's name , the RDD's 
default name is schema . 

{code:xml}
Project [ftime#141, id#142L, ei#143, ui#144, kv#145, sts#146L, 
jsontostruct(StructField(filter,StringType,true), 
StructField(is_auto,StringType,true), StructField(omgid,StringType,true), 
StructField(devid,StringType,true), StructField(guid,StringType,true), 
StructField(ztid,StringType,true), StructField(page_step,StringType,true), 
StructField(itemValue,StringType,true), 
StructField(reportParams,StringType,true), 
StructField(build_type,StringType,true), 
StructField(ref_page_id,StringType,true), 
StructField(itemName,StringType,true), StructField(imei,StringType,true), 
StructField(lid,StringType,true), StructField(app_start_time,StringType,true), 
StructField(omgbizid,StringType,true), StructField(imsi,StringType,true), 
StructField(vid,StringType,true), StructField(page_id,StringType,true), 
StructField(pid,StringType,true), 
StructField(notification_enable,StringType,true), 
StructField(call_type,StringType,true), StructField(streamid,StringType,true), 
StructField(flavor,StringType,true), ... 12 more fields) AS ...
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21233) Support pluggable offset storage

2017-06-28 Thread darion yaphet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066179#comment-16066179
 ] 

darion yaphet edited comment on SPARK-21233 at 6/28/17 9:14 AM:


Hi  [Sean|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=srowen]  
In Kafka-0.8  it's using zkClient to commit offset into zookeeper cluster . 
It's seems Kafka 0.10 + could save offset in topic . I wish to add some config 
item to control the storage instance and other parameter . 


was (Author: darion):
[Sean|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=srowen]  In 
Kafka-0.8  it's using zkClient to commit offset into zookeeper cluster . It's 
seems Kafka 0.10 + could save offset in topic . I wish to add some config item 
to control the storage instance and other parameter . 

> Support pluggable offset storage
> 
>
> Key: SPARK-21233
> URL: https://issues.apache.org/jira/browse/SPARK-21233
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.1
>Reporter: darion yaphet
>
> Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there 
> are a lot of streaming program running in the cluster the ZooKeeper Cluster's 
> loading is very high . Maybe Zookeeper is not very suitable to save offset 
> periodicity.
> This issue is wish to support a pluggable offset storage to avoid save it in 
> the zookeeper 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21233) Support pluggable offset storage

2017-06-28 Thread darion yaphet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066179#comment-16066179
 ] 

darion yaphet edited comment on SPARK-21233 at 6/28/17 9:13 AM:


[Sean|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=srowen]  In 
Kafka-0.8  it's using zkClient to commit offset into zookeeper cluster . It's 
seems Kafka 0.10 + could save offset in topic . I wish to add some config item 
to control the storage instance and other parameter . 


was (Author: darion):
[Sean|sro...@gmail.com]  In Kafka-0.8  it's using zkClient to commit offset 
into zookeeper cluster . It's seems Kafka 0.10 + could save offset in topic . I 
wish to add some config item to control the storage instance and other 
parameter . 

> Support pluggable offset storage
> 
>
> Key: SPARK-21233
> URL: https://issues.apache.org/jira/browse/SPARK-21233
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.1
>Reporter: darion yaphet
>
> Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there 
> are a lot of streaming program running in the cluster the ZooKeeper Cluster's 
> loading is very high . Maybe Zookeeper is not very suitable to save offset 
> periodicity.
> This issue is wish to support a pluggable offset storage to avoid save it in 
> the zookeeper 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21233) Support pluggable offset storage

2017-06-28 Thread darion yaphet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066179#comment-16066179
 ] 

darion yaphet edited comment on SPARK-21233 at 6/28/17 9:13 AM:


[Sean|sro...@gmail.com]  In Kafka-0.8  it's using zkClient to commit offset 
into zookeeper cluster . It's seems Kafka 0.10 + could save offset in topic . I 
wish to add some config item to control the storage instance and other 
parameter . 


was (Author: darion):
[Sean Owen|sro...@gmail.com]  In Kafka-0.8  it's using zkClient to commit 
offset into zookeeper cluster . It's seems Kafka 0.10 + could save offset in 
topic . I wish to add some config item to control the storage instance and 
other parameter . 

> Support pluggable offset storage
> 
>
> Key: SPARK-21233
> URL: https://issues.apache.org/jira/browse/SPARK-21233
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.1
>Reporter: darion yaphet
>
> Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there 
> are a lot of streaming program running in the cluster the ZooKeeper Cluster's 
> loading is very high . Maybe Zookeeper is not very suitable to save offset 
> periodicity.
> This issue is wish to support a pluggable offset storage to avoid save it in 
> the zookeeper 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21233) Support pluggable offset storage

2017-06-28 Thread darion yaphet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066179#comment-16066179
 ] 

darion yaphet commented on SPARK-21233:
---

[Sean Owen|sro...@gmail.com]  In Kafka-0.8  it's using zkClient to commit 
offset into zookeeper cluster . It's seems Kafka 0.10 + could save offset in 
topic . I wish to add some config item to control the storage instance and 
other parameter . 

> Support pluggable offset storage
> 
>
> Key: SPARK-21233
> URL: https://issues.apache.org/jira/browse/SPARK-21233
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.1
>Reporter: darion yaphet
>
> Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there 
> are a lot of streaming program running in the cluster the ZooKeeper Cluster's 
> loading is very high . Maybe Zookeeper is not very suitable to save offset 
> periodicity.
> This issue is wish to support a pluggable offset storage to avoid save it in 
> the zookeeper 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21233) Support pluggable offset storage

2017-06-27 Thread darion yaphet (JIRA)
darion yaphet created SPARK-21233:
-

 Summary: Support pluggable offset storage
 Key: SPARK-21233
 URL: https://issues.apache.org/jira/browse/SPARK-21233
 Project: Spark
  Issue Type: New Feature
  Components: DStreams
Affects Versions: 2.1.1, 2.0.2
Reporter: darion yaphet


Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there 
are a lot of streaming program running in the cluster the ZooKeeper Cluster's 
loading is very high . Maybe Zookeeper is not very suitable to save offset 
periodicity.

This issue is wish to support a pluggable offset storage to avoid save it in 
the zookeeper 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21191) DataFrame Row StructType check duplicate name

2017-06-23 Thread darion yaphet (JIRA)
darion yaphet created SPARK-21191:
-

 Summary: DataFrame Row StructType check duplicate name
 Key: SPARK-21191
 URL: https://issues.apache.org/jira/browse/SPARK-21191
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1, 2.0.2, 2.0.0
Reporter: darion yaphet


Currently , when we create a dataframe with *toDF(columns:String)* or describe 
from *StructType* can't avoid have duplicated name .

{code:scala}
val dataset = Seq(
  (0, 3, 4),
  (0, 4, 3),
  (0, 5, 2),
  (1, 3, 3),
  (1, 5, 6),
  (1, 4, 2),
  (2, 3, 5),
  (2, 5, 4),
  (2, 4, 3)
).toDF("1", "1", "2").show
{code}

{code}
+---+---+---+
|  1|  1|  2|
+---+---+---+
|  0|  3|  4|
|  0|  4|  3|
|  0|  5|  2|
|  1|  3|  3|
|  1|  5|  6|
|  1|  4|  2|
|  2|  3|  5|
|  2|  5|  4|
|  2|  4|  3|
+---+---+---+
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21116) Support map_contains function

2017-06-16 Thread darion yaphet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

darion yaphet updated SPARK-21116:
--
Summary: Support map_contains function  (was: Support MapKeyContains 
function)

> Support map_contains function
> -
>
> Key: SPARK-21116
> URL: https://issues.apache.org/jira/browse/SPARK-21116
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1
>Reporter: darion yaphet
>
> map_contains(map , key)
> Check whether the map contains the key . Returns true if the map contains the 
> key. It's similar with *array_contains*
> for example :  map_contains(map(1, 'a', 2, 'b') , 1) will return true . 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21116) Support MapKeyContains function

2017-06-16 Thread darion yaphet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

darion yaphet updated SPARK-21116:
--
Description: 
map_contains(map , key)

Check whether the map contains the key . Returns true if the map contains the 
key. It's similar with *array_contains*

for example :  map_contains(map(1, 'a', 2, 'b') , 1) will return true . 

  was:
map_contains(map , key)

Check whether the map contains the key . Returns true if the map contains the 
key. It's similar with *array_contains*


> Support MapKeyContains function
> ---
>
> Key: SPARK-21116
> URL: https://issues.apache.org/jira/browse/SPARK-21116
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1
>Reporter: darion yaphet
>
> map_contains(map , key)
> Check whether the map contains the key . Returns true if the map contains the 
> key. It's similar with *array_contains*
> for example :  map_contains(map(1, 'a', 2, 'b') , 1) will return true . 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21116) Support MapKeyContains function

2017-06-16 Thread darion yaphet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

darion yaphet updated SPARK-21116:
--
Description: 
map_contains(map , key)

Check whether the map contains the key . Returns true if the map contains the 
key. It's similar with *array_contains*

  was:Check the map contains the key . Returns true if the map contains the key.


> Support MapKeyContains function
> ---
>
> Key: SPARK-21116
> URL: https://issues.apache.org/jira/browse/SPARK-21116
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1
>Reporter: darion yaphet
>
> map_contains(map , key)
> Check whether the map contains the key . Returns true if the map contains the 
> key. It's similar with *array_contains*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21116) Support MapKeyContains function

2017-06-15 Thread darion yaphet (JIRA)
darion yaphet created SPARK-21116:
-

 Summary: Support MapKeyContains function
 Key: SPARK-21116
 URL: https://issues.apache.org/jira/browse/SPARK-21116
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.1.1, 2.0.2
Reporter: darion yaphet


Check the map contains the key . Returns true if the map contains the key.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21104) Support sort with index when parse LibSVM Record

2017-06-15 Thread darion yaphet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16051388#comment-16051388
 ] 

darion yaphet commented on SPARK-21104:
---

we should make sure the array is in ordered instead of check and report with a 
exception . 

> Support sort with index when parse LibSVM Record
> 
>
> Key: SPARK-21104
> URL: https://issues.apache.org/jira/browse/SPARK-21104
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: darion yaphet
>Priority: Minor
>
> When I'm loading LibSVM from HDFS , I found feature index should be in 
> ascending order . 
> We can sorted with *indices* when we parse the input line input a (index, 
> value) tuple and avoid check if indices are in ascending order after that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21104) Support sort with index when parse LibSVM Record

2017-06-15 Thread darion yaphet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16050267#comment-16050267
 ] 

darion yaphet commented on SPARK-21104:
---

The feature sort by index is not necessary . So I add a sort step in record 
parser .

> Support sort with index when parse LibSVM Record
> 
>
> Key: SPARK-21104
> URL: https://issues.apache.org/jira/browse/SPARK-21104
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: darion yaphet
>Priority: Minor
>
> When I'm loading LibSVM from HDFS , I found feature index should be in 
> ascending order . 
> We can sorted with *indices* when we parse the input line input a (index, 
> value) tuple and avoid check if indices are in ascending order after that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21104) Support sort with index when parse LibSVM Record

2017-06-15 Thread darion yaphet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

darion yaphet updated SPARK-21104:
--
Description: 
When I'm loading LibSVM from HDFS , I found feature index should be in 
ascending order . 
We can sorted with *indices* when we parse the input line input a (index, 
value) tuple and avoid check if indices are in ascending order after that.

  was:When I'm loading LibSVM from HDFS , I found feature index should be in 
ascending order . We can sorted with *indices* when we parse the input line 
input a (index, value) tuple and avoid check if indices are in ascending order 
after that.


> Support sort with index when parse LibSVM Record
> 
>
> Key: SPARK-21104
> URL: https://issues.apache.org/jira/browse/SPARK-21104
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: darion yaphet
>Priority: Minor
>
> When I'm loading LibSVM from HDFS , I found feature index should be in 
> ascending order . 
> We can sorted with *indices* when we parse the input line input a (index, 
> value) tuple and avoid check if indices are in ascending order after that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21104) Support sort with index when parse LibSVM Record

2017-06-15 Thread darion yaphet (JIRA)
darion yaphet created SPARK-21104:
-

 Summary: Support sort with index when parse LibSVM Record
 Key: SPARK-21104
 URL: https://issues.apache.org/jira/browse/SPARK-21104
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.1.1
Reporter: darion yaphet
Priority: Minor


When I'm loading LibSVM from HDFS , I found feature index should be in 
ascending order . We can sorted with *indices* when we parse the input line 
input a (index, value) tuple and avoid check if indices are in ascending order 
after that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21073) Support map_keys and map_values functions in DataSet

2017-06-13 Thread darion yaphet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

darion yaphet updated SPARK-21073:
--
Summary: Support map_keys and map_values functions in DataSet  (was: 
Support map_keys and map_values in DataSet)

> Support map_keys and map_values functions in DataSet
> 
>
> Key: SPARK-21073
> URL: https://issues.apache.org/jira/browse/SPARK-21073
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1
>Reporter: darion yaphet
>
> Add *map_keys* to get keys from a MapType cloumn and *map_values* to get 
> values from a MapType column.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21073) Support map_keys and map_values in DataSet

2017-06-13 Thread darion yaphet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

darion yaphet updated SPARK-21073:
--
Summary: Support map_keys and map_values in DataSet  (was: Support map_keys 
and map_values in DataSet and DataFrame)

> Support map_keys and map_values in DataSet
> --
>
> Key: SPARK-21073
> URL: https://issues.apache.org/jira/browse/SPARK-21073
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1
>Reporter: darion yaphet
>
> Add *map_keys* to get keys from a MapType cloumn and *map_values* to get 
> values from a MapType column.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21073) Support map_keys and map_values in DataSet and DataFrame

2017-06-13 Thread darion yaphet (JIRA)
darion yaphet created SPARK-21073:
-

 Summary: Support map_keys and map_values in DataSet and DataFrame
 Key: SPARK-21073
 URL: https://issues.apache.org/jira/browse/SPARK-21073
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.1.1, 2.0.2
Reporter: darion yaphet


Add *map_keys* to get keys from a MapType cloumn and *map_values* to get values 
from a MapType column.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21066) LibSVM load just one input file

2017-06-12 Thread darion yaphet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

darion yaphet updated SPARK-21066:
--
Description: 
Currently when we using SVM to train dataset we found the input files limit 
only one .

The file store on the Distributed File System such as HDFS is split into mutil 
piece and I think this limit is not necessary .

 We can join input paths into a string split with comma. 

  was:
Currently when we using SVM to train dataset we found the input files limit 
only one .

the source code as following :
{{{
 val path = if (dataFiles.length == 1) {
dataFiles.head.getPath.toUri.toString
 } else if (dataFiles.isEmpty) {
throw new IOException("No input path specified for libsvm data")
 } else {
throw new IOException("Multiple input paths are not supported for 
libsvm data.")
 }
}}}

The file store on the Distributed File System such as HDFS is split into mutil 
piece and I think this limit is not necessary . We can join input paths into a 
string split with comma. 


> LibSVM load just one input file
> ---
>
> Key: SPARK-21066
> URL: https://issues.apache.org/jira/browse/SPARK-21066
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: darion yaphet
>
> Currently when we using SVM to train dataset we found the input files limit 
> only one .
> The file store on the Distributed File System such as HDFS is split into 
> mutil piece and I think this limit is not necessary .
>  We can join input paths into a string split with comma. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21066) LibSVM load just one input file

2017-06-12 Thread darion yaphet (JIRA)
darion yaphet created SPARK-21066:
-

 Summary: LibSVM load just one input file
 Key: SPARK-21066
 URL: https://issues.apache.org/jira/browse/SPARK-21066
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.1.1
Reporter: darion yaphet


Currently when we using SVM to train dataset we found the input files limit 
only one .

the source code as following :
{{{
 val path = if (dataFiles.length == 1) {
dataFiles.head.getPath.toUri.toString
 } else if (dataFiles.isEmpty) {
throw new IOException("No input path specified for libsvm data")
 } else {
throw new IOException("Multiple input paths are not supported for 
libsvm data.")
 }
}}}

The file store on the Distributed File System such as HDFS is split into mutil 
piece and I think this limit is not necessary . We can join input paths into a 
string split with comma. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21032) Support add_years and add_days functions

2017-06-11 Thread darion yaphet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16046187#comment-16046187
 ] 

darion yaphet commented on SPARK-21032:
---

Yes it seems have implement this functions . Should we add a alias to represent 
the two functions and make it clearly .

> Support add_years and add_days functions 
> -
>
> Key: SPARK-21032
> URL: https://issues.apache.org/jira/browse/SPARK-21032
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: darion yaphet
>
> Currently SparkSQL have add_months function and want to support add_years and 
> add_days . Maybe add_years and add_days are similar with  add_months . 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21032) Support add_years and add_days functions

2017-06-09 Thread darion yaphet (JIRA)
darion yaphet created SPARK-21032:
-

 Summary: Support add_years and add_days functions 
 Key: SPARK-21032
 URL: https://issues.apache.org/jira/browse/SPARK-21032
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.1.1
Reporter: darion yaphet


Currently SparkSQL have add_months function and want to support add_years and 
add_days . Maybe add_years and add_days are similar with  add_months . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21015) Check field name is not null and empty in GenericRowWithSchema

2017-06-07 Thread darion yaphet (JIRA)
darion yaphet created SPARK-21015:
-

 Summary: Check field name is not null and empty in 
GenericRowWithSchema
 Key: SPARK-21015
 URL: https://issues.apache.org/jira/browse/SPARK-21015
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.1.1
Reporter: darion yaphet
Priority: Minor


When we get field index from row with schema , we shoule make sure the field 
name is not null and empty . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21014) Support get fields with schema name

2017-06-07 Thread darion yaphet (JIRA)
darion yaphet created SPARK-21014:
-

 Summary: Support get fields with schema name 
 Key: SPARK-21014
 URL: https://issues.apache.org/jira/browse/SPARK-21014
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.1.1
Reporter: darion yaphet
Priority: Minor


Currently when we want to get field from row we should use index to fetch 
it.But the row this schema we can use field name to read field from row is very 
useful.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20968) Support separator in Tokenizer

2017-06-04 Thread darion yaphet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

darion yaphet closed SPARK-20968.
-
Resolution: Fixed

> Support separator in Tokenizer
> --
>
> Key: SPARK-20968
> URL: https://issues.apache.org/jira/browse/SPARK-20968
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.0.0, 2.0.2, 2.1.1
>Reporter: darion yaphet
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20968) Support separator in Tokenizer

2017-06-02 Thread darion yaphet (JIRA)
darion yaphet created SPARK-20968:
-

 Summary: Support separator in Tokenizer
 Key: SPARK-20968
 URL: https://issues.apache.org/jira/browse/SPARK-20968
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 2.1.1, 2.0.2, 2.0.0
Reporter: darion yaphet
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20740) Expose UserDefinedType make sure could extends it

2017-05-15 Thread darion yaphet (JIRA)
darion yaphet created SPARK-20740:
-

 Summary: Expose UserDefinedType make sure could extends it
 Key: SPARK-20740
 URL: https://issues.apache.org/jira/browse/SPARK-20740
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
Reporter: darion yaphet


User may want to extends UserDefinedType and create data types . We should make 
UserDefinedType as a public class .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20675) Support Index to skip when retrieval disk structure in CoGroupedRDD

2017-05-09 Thread darion yaphet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

darion yaphet updated SPARK-20675:
--
Description: 
CoGroupedRDD's compute() will retrieval each StreamBuffer(a disk structure 
maintains key-value pairs which sorted by key) and merge the same key into one

So I think add a sequence index file or append the index part at the head of 
temporary shuffle file to seek to the appropriate position could skip a lot of 
scan which are unnecessary.

> Support Index to skip when retrieval disk structure in CoGroupedRDD 
> 
>
> Key: SPARK-20675
> URL: https://issues.apache.org/jira/browse/SPARK-20675
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: darion yaphet
>
> CoGroupedRDD's compute() will retrieval each StreamBuffer(a disk structure 
> maintains key-value pairs which sorted by key) and merge the same key into one
> So I think add a sequence index file or append the index part at the head of 
> temporary shuffle file to seek to the appropriate position could skip a lot 
> of scan which are unnecessary.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20675) Support Index to skip when retrieval disk structure in CoGroupedRDD

2017-05-08 Thread darion yaphet (JIRA)
darion yaphet created SPARK-20675:
-

 Summary: Support Index to skip when retrieval disk structure in 
CoGroupedRDD 
 Key: SPARK-20675
 URL: https://issues.apache.org/jira/browse/SPARK-20675
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: darion yaphet






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20610) Support a function get DataFrame/DataSet from Transformer

2017-05-05 Thread darion yaphet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

darion yaphet closed SPARK-20610.
-
Resolution: Won't Fix

> Support a function get DataFrame/DataSet from Transformer
> -
>
> Key: SPARK-20610
> URL: https://issues.apache.org/jira/browse/SPARK-20610
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.2, 2.1.0
>Reporter: darion yaphet
>
> We are using stages to build our machine learning pipeline. Transformer will 
> transformers input dataset into another output dataset as our dataframe. 
> Sometime we will test the dataframe's result when developing the pipeline. 
> But it is looks like difficulty  to running a test . If spark ml Stages could 
> support a interface to explore the dataframe processed by the stage , we 
> could use it to running test  . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20610) Support a function get DataFrame/DataSet from Transformer

2017-05-05 Thread darion yaphet (JIRA)
darion yaphet created SPARK-20610:
-

 Summary: Support a function get DataFrame/DataSet from Transformer
 Key: SPARK-20610
 URL: https://issues.apache.org/jira/browse/SPARK-20610
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.1.0, 2.0.2
Reporter: darion yaphet


We are using stages to build our machine learning pipeline. Transformer will 
transformers input dataset into another output dataset as our dataframe. 
Sometime we will test the dataframe's result when developing the pipeline. But 
it is looks like difficulty  to running a test . If spark ml Stages could 
support a interface to explore the dataframe processed by the stage , we could 
use it to running test  . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20469) Add a method to display DataFrame schema in PipelineStage

2017-04-26 Thread darion yaphet (JIRA)
darion yaphet created SPARK-20469:
-

 Summary: Add a method to display DataFrame schema in PipelineStage
 Key: SPARK-20469
 URL: https://issues.apache.org/jira/browse/SPARK-20469
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 2.1.0, 2.0.2, 1.6.3
Reporter: darion yaphet
Priority: Minor


Sometime apply Transformer and Estimator on a pipeline. The PipelineStage could 
display schema will be a big help to understand and check the dataset .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20276) ScheduleExecutorService control sleep interval

2017-04-10 Thread darion yaphet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

darion yaphet closed SPARK-20276.
-
Resolution: Won't Fix

> ScheduleExecutorService control sleep interval 
> ---
>
> Key: SPARK-20276
> URL: https://issues.apache.org/jira/browse/SPARK-20276
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: darion yaphet
>Priority: Minor
>
> The SessionManager startup timeout checker periodicity . We can using 
> ScheduleExecutorService to control time interval and to replacement 
> Thread.sleep . It's seem seay and elegant . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20276) ScheduleExecutorService control sleep interval

2017-04-10 Thread darion yaphet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15962630#comment-15962630
 ] 

darion yaphet commented on SPARK-20276:
---

thanks for you response :) *org.apache.hive.service.cli.session.SessionManager* 
 is seems copy from hive source code . I will close this issue .

> ScheduleExecutorService control sleep interval 
> ---
>
> Key: SPARK-20276
> URL: https://issues.apache.org/jira/browse/SPARK-20276
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: darion yaphet
>Priority: Minor
>
> The SessionManager startup timeout checker periodicity . We can using 
> ScheduleExecutorService to control time interval and to replacement 
> Thread.sleep . It's seem seay and elegant . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20276) ScheduleExecutorService control sleep interval

2017-04-10 Thread darion yaphet (JIRA)
darion yaphet created SPARK-20276:
-

 Summary: ScheduleExecutorService control sleep interval 
 Key: SPARK-20276
 URL: https://issues.apache.org/jira/browse/SPARK-20276
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: darion yaphet
Priority: Minor


The SessionManager startup timeout checker periodicity . We can using 
ScheduleExecutorService to control time interval and to replacement 
Thread.sleep . It's seem seay and elegant . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19794) Release HDFS Client after read/write checkpoint

2017-03-02 Thread darion yaphet (JIRA)
darion yaphet created SPARK-19794:
-

 Summary: Release HDFS Client after read/write checkpoint
 Key: SPARK-19794
 URL: https://issues.apache.org/jira/browse/SPARK-19794
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0, 2.0.2
Reporter: darion yaphet


RDD check point write each partation into HDFS and reading from HDFS when RDD 
need recomputation . After process with HDFS HDFS client and streams should be 
closed . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18064) Spark SQL can't load default config file

2016-10-23 Thread darion yaphet (JIRA)
darion yaphet created SPARK-18064:
-

 Summary: Spark SQL can't load default config file 
 Key: SPARK-18064
 URL: https://issues.apache.org/jira/browse/SPARK-18064
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: darion yaphet






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18047) Spark worker port should be greater than 1023

2016-10-21 Thread darion yaphet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

darion yaphet updated SPARK-18047:
--
Description: 
The port numbers in the range from 0 to 1023 are the well-known ports (system 
ports) . 

They are widely used by system network services. Such as Telnet(23), Simple 
Mail Transfer Protocol(25) and Domain Name System(53). 

Work port should avoid using this ports . 

  was:The port numbers in the range from 0 to 1023 are the well-known ports 
(system ports) . They are widely used by system network services. Such as 
Telnet(23), Simple Mail Transfer Protocol(25) and Domain Name System(53). Work 
port should avoid using this ports . 


> Spark worker port should be greater than 1023
> -
>
> Key: SPARK-18047
> URL: https://issues.apache.org/jira/browse/SPARK-18047
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: darion yaphet
>
> The port numbers in the range from 0 to 1023 are the well-known ports (system 
> ports) . 
> They are widely used by system network services. Such as Telnet(23), Simple 
> Mail Transfer Protocol(25) and Domain Name System(53). 
> Work port should avoid using this ports . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18047) Spark worker port should be greater than 1023

2016-10-21 Thread darion yaphet (JIRA)
darion yaphet created SPARK-18047:
-

 Summary: Spark worker port should be greater than 1023
 Key: SPARK-18047
 URL: https://issues.apache.org/jira/browse/SPARK-18047
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.1, 2.0.0
Reporter: darion yaphet


The port numbers in the range from 0 to 1023 are the well-known ports (system 
ports) . They are widely used by system network services. Such as Telnet(23), 
Simple Mail Transfer Protocol(25) and Domain Name System(53). Work port should 
avoid using this ports . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org