[jira] [Created] (SPARK-29934) Dataset support GraphX
darion yaphet created SPARK-29934: - Summary: Dataset support GraphX Key: SPARK-29934 URL: https://issues.apache.org/jira/browse/SPARK-29934 Project: Spark Issue Type: Bug Components: Graph, GraphX, Spark Core Affects Versions: 2.4.4 Reporter: darion yaphet Do we have some plan to support GraphX with dataset ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21265) Cache method could specified name
[ https://issues.apache.org/jira/browse/SPARK-21265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] darion yaphet updated SPARK-21265: -- Component/s: (was: Spark Core) > Cache method could specified name > - > > Key: SPARK-21265 > URL: https://issues.apache.org/jira/browse/SPARK-21265 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.2, 2.1.1 >Reporter: darion yaphet > > Currently when I cache dataset in my cluster I found the *RDD Name* is the > RDD Schema in Storage Tab . When the sturcture is very complex , it's hard to > read . So I wish could add a parameter to specified the RDD's name , the > RDD's default name is schema . > {code:xml} > Project [ftime#141, id#142L, ei#143, ui#144, kv#145, sts#146L, > jsontostruct(StructField(filter,StringType,true), > StructField(is_auto,StringType,true), StructField(omgid,StringType,true), > StructField(devid,StringType,true), StructField(guid,StringType,true), > StructField(ztid,StringType,true), StructField(page_step,StringType,true), > StructField(itemValue,StringType,true), > StructField(reportParams,StringType,true), > StructField(build_type,StringType,true), > StructField(ref_page_id,StringType,true), > StructField(itemName,StringType,true), StructField(imei,StringType,true), > StructField(lid,StringType,true), > StructField(app_start_time,StringType,true), > StructField(omgbizid,StringType,true), StructField(imsi,StringType,true), > StructField(vid,StringType,true), StructField(page_id,StringType,true), > StructField(pid,StringType,true), > StructField(notification_enable,StringType,true), > StructField(call_type,StringType,true), > StructField(streamid,StringType,true), StructField(flavor,StringType,true), > ... 12 more fields) AS ... > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21265) Cache method could specified name
darion yaphet created SPARK-21265: - Summary: Cache method could specified name Key: SPARK-21265 URL: https://issues.apache.org/jira/browse/SPARK-21265 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 2.1.1, 2.0.2 Reporter: darion yaphet Currently when I cache dataset in my cluster I found the *RDD Name* is the RDD Schema in Storage Tab . When the sturcture is very complex , it's hard to read . So I wish could add a parameter to specified the RDD's name , the RDD's default name is schema . {code:xml} Project [ftime#141, id#142L, ei#143, ui#144, kv#145, sts#146L, jsontostruct(StructField(filter,StringType,true), StructField(is_auto,StringType,true), StructField(omgid,StringType,true), StructField(devid,StringType,true), StructField(guid,StringType,true), StructField(ztid,StringType,true), StructField(page_step,StringType,true), StructField(itemValue,StringType,true), StructField(reportParams,StringType,true), StructField(build_type,StringType,true), StructField(ref_page_id,StringType,true), StructField(itemName,StringType,true), StructField(imei,StringType,true), StructField(lid,StringType,true), StructField(app_start_time,StringType,true), StructField(omgbizid,StringType,true), StructField(imsi,StringType,true), StructField(vid,StringType,true), StructField(page_id,StringType,true), StructField(pid,StringType,true), StructField(notification_enable,StringType,true), StructField(call_type,StringType,true), StructField(streamid,StringType,true), StructField(flavor,StringType,true), ... 12 more fields) AS ... {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21233) Support pluggable offset storage
[ https://issues.apache.org/jira/browse/SPARK-21233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066179#comment-16066179 ] darion yaphet edited comment on SPARK-21233 at 6/28/17 9:14 AM: Hi [Sean|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=srowen] In Kafka-0.8 it's using zkClient to commit offset into zookeeper cluster . It's seems Kafka 0.10 + could save offset in topic . I wish to add some config item to control the storage instance and other parameter . was (Author: darion): [Sean|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=srowen] In Kafka-0.8 it's using zkClient to commit offset into zookeeper cluster . It's seems Kafka 0.10 + could save offset in topic . I wish to add some config item to control the storage instance and other parameter . > Support pluggable offset storage > > > Key: SPARK-21233 > URL: https://issues.apache.org/jira/browse/SPARK-21233 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.0.2, 2.1.1 >Reporter: darion yaphet > > Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there > are a lot of streaming program running in the cluster the ZooKeeper Cluster's > loading is very high . Maybe Zookeeper is not very suitable to save offset > periodicity. > This issue is wish to support a pluggable offset storage to avoid save it in > the zookeeper -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21233) Support pluggable offset storage
[ https://issues.apache.org/jira/browse/SPARK-21233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066179#comment-16066179 ] darion yaphet edited comment on SPARK-21233 at 6/28/17 9:13 AM: [Sean|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=srowen] In Kafka-0.8 it's using zkClient to commit offset into zookeeper cluster . It's seems Kafka 0.10 + could save offset in topic . I wish to add some config item to control the storage instance and other parameter . was (Author: darion): [Sean|sro...@gmail.com] In Kafka-0.8 it's using zkClient to commit offset into zookeeper cluster . It's seems Kafka 0.10 + could save offset in topic . I wish to add some config item to control the storage instance and other parameter . > Support pluggable offset storage > > > Key: SPARK-21233 > URL: https://issues.apache.org/jira/browse/SPARK-21233 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.0.2, 2.1.1 >Reporter: darion yaphet > > Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there > are a lot of streaming program running in the cluster the ZooKeeper Cluster's > loading is very high . Maybe Zookeeper is not very suitable to save offset > periodicity. > This issue is wish to support a pluggable offset storage to avoid save it in > the zookeeper -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21233) Support pluggable offset storage
[ https://issues.apache.org/jira/browse/SPARK-21233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066179#comment-16066179 ] darion yaphet edited comment on SPARK-21233 at 6/28/17 9:13 AM: [Sean|sro...@gmail.com] In Kafka-0.8 it's using zkClient to commit offset into zookeeper cluster . It's seems Kafka 0.10 + could save offset in topic . I wish to add some config item to control the storage instance and other parameter . was (Author: darion): [Sean Owen|sro...@gmail.com] In Kafka-0.8 it's using zkClient to commit offset into zookeeper cluster . It's seems Kafka 0.10 + could save offset in topic . I wish to add some config item to control the storage instance and other parameter . > Support pluggable offset storage > > > Key: SPARK-21233 > URL: https://issues.apache.org/jira/browse/SPARK-21233 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.0.2, 2.1.1 >Reporter: darion yaphet > > Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there > are a lot of streaming program running in the cluster the ZooKeeper Cluster's > loading is very high . Maybe Zookeeper is not very suitable to save offset > periodicity. > This issue is wish to support a pluggable offset storage to avoid save it in > the zookeeper -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21233) Support pluggable offset storage
[ https://issues.apache.org/jira/browse/SPARK-21233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066179#comment-16066179 ] darion yaphet commented on SPARK-21233: --- [Sean Owen|sro...@gmail.com] In Kafka-0.8 it's using zkClient to commit offset into zookeeper cluster . It's seems Kafka 0.10 + could save offset in topic . I wish to add some config item to control the storage instance and other parameter . > Support pluggable offset storage > > > Key: SPARK-21233 > URL: https://issues.apache.org/jira/browse/SPARK-21233 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.0.2, 2.1.1 >Reporter: darion yaphet > > Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there > are a lot of streaming program running in the cluster the ZooKeeper Cluster's > loading is very high . Maybe Zookeeper is not very suitable to save offset > periodicity. > This issue is wish to support a pluggable offset storage to avoid save it in > the zookeeper -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21233) Support pluggable offset storage
darion yaphet created SPARK-21233: - Summary: Support pluggable offset storage Key: SPARK-21233 URL: https://issues.apache.org/jira/browse/SPARK-21233 Project: Spark Issue Type: New Feature Components: DStreams Affects Versions: 2.1.1, 2.0.2 Reporter: darion yaphet Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there are a lot of streaming program running in the cluster the ZooKeeper Cluster's loading is very high . Maybe Zookeeper is not very suitable to save offset periodicity. This issue is wish to support a pluggable offset storage to avoid save it in the zookeeper -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21191) DataFrame Row StructType check duplicate name
darion yaphet created SPARK-21191: - Summary: DataFrame Row StructType check duplicate name Key: SPARK-21191 URL: https://issues.apache.org/jira/browse/SPARK-21191 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1, 2.0.2, 2.0.0 Reporter: darion yaphet Currently , when we create a dataframe with *toDF(columns:String)* or describe from *StructType* can't avoid have duplicated name . {code:scala} val dataset = Seq( (0, 3, 4), (0, 4, 3), (0, 5, 2), (1, 3, 3), (1, 5, 6), (1, 4, 2), (2, 3, 5), (2, 5, 4), (2, 4, 3) ).toDF("1", "1", "2").show {code} {code} +---+---+---+ | 1| 1| 2| +---+---+---+ | 0| 3| 4| | 0| 4| 3| | 0| 5| 2| | 1| 3| 3| | 1| 5| 6| | 1| 4| 2| | 2| 3| 5| | 2| 5| 4| | 2| 4| 3| +---+---+---+ {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21116) Support map_contains function
[ https://issues.apache.org/jira/browse/SPARK-21116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] darion yaphet updated SPARK-21116: -- Summary: Support map_contains function (was: Support MapKeyContains function) > Support map_contains function > - > > Key: SPARK-21116 > URL: https://issues.apache.org/jira/browse/SPARK-21116 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.2, 2.1.1 >Reporter: darion yaphet > > map_contains(map , key) > Check whether the map contains the key . Returns true if the map contains the > key. It's similar with *array_contains* > for example : map_contains(map(1, 'a', 2, 'b') , 1) will return true . -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21116) Support MapKeyContains function
[ https://issues.apache.org/jira/browse/SPARK-21116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] darion yaphet updated SPARK-21116: -- Description: map_contains(map , key) Check whether the map contains the key . Returns true if the map contains the key. It's similar with *array_contains* for example : map_contains(map(1, 'a', 2, 'b') , 1) will return true . was: map_contains(map , key) Check whether the map contains the key . Returns true if the map contains the key. It's similar with *array_contains* > Support MapKeyContains function > --- > > Key: SPARK-21116 > URL: https://issues.apache.org/jira/browse/SPARK-21116 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.2, 2.1.1 >Reporter: darion yaphet > > map_contains(map , key) > Check whether the map contains the key . Returns true if the map contains the > key. It's similar with *array_contains* > for example : map_contains(map(1, 'a', 2, 'b') , 1) will return true . -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21116) Support MapKeyContains function
[ https://issues.apache.org/jira/browse/SPARK-21116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] darion yaphet updated SPARK-21116: -- Description: map_contains(map , key) Check whether the map contains the key . Returns true if the map contains the key. It's similar with *array_contains* was:Check the map contains the key . Returns true if the map contains the key. > Support MapKeyContains function > --- > > Key: SPARK-21116 > URL: https://issues.apache.org/jira/browse/SPARK-21116 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.2, 2.1.1 >Reporter: darion yaphet > > map_contains(map , key) > Check whether the map contains the key . Returns true if the map contains the > key. It's similar with *array_contains* -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21116) Support MapKeyContains function
darion yaphet created SPARK-21116: - Summary: Support MapKeyContains function Key: SPARK-21116 URL: https://issues.apache.org/jira/browse/SPARK-21116 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.1.1, 2.0.2 Reporter: darion yaphet Check the map contains the key . Returns true if the map contains the key. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21104) Support sort with index when parse LibSVM Record
[ https://issues.apache.org/jira/browse/SPARK-21104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16051388#comment-16051388 ] darion yaphet commented on SPARK-21104: --- we should make sure the array is in ordered instead of check and report with a exception . > Support sort with index when parse LibSVM Record > > > Key: SPARK-21104 > URL: https://issues.apache.org/jira/browse/SPARK-21104 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.1 >Reporter: darion yaphet >Priority: Minor > > When I'm loading LibSVM from HDFS , I found feature index should be in > ascending order . > We can sorted with *indices* when we parse the input line input a (index, > value) tuple and avoid check if indices are in ascending order after that. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21104) Support sort with index when parse LibSVM Record
[ https://issues.apache.org/jira/browse/SPARK-21104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16050267#comment-16050267 ] darion yaphet commented on SPARK-21104: --- The feature sort by index is not necessary . So I add a sort step in record parser . > Support sort with index when parse LibSVM Record > > > Key: SPARK-21104 > URL: https://issues.apache.org/jira/browse/SPARK-21104 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.1 >Reporter: darion yaphet >Priority: Minor > > When I'm loading LibSVM from HDFS , I found feature index should be in > ascending order . > We can sorted with *indices* when we parse the input line input a (index, > value) tuple and avoid check if indices are in ascending order after that. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21104) Support sort with index when parse LibSVM Record
[ https://issues.apache.org/jira/browse/SPARK-21104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] darion yaphet updated SPARK-21104: -- Description: When I'm loading LibSVM from HDFS , I found feature index should be in ascending order . We can sorted with *indices* when we parse the input line input a (index, value) tuple and avoid check if indices are in ascending order after that. was:When I'm loading LibSVM from HDFS , I found feature index should be in ascending order . We can sorted with *indices* when we parse the input line input a (index, value) tuple and avoid check if indices are in ascending order after that. > Support sort with index when parse LibSVM Record > > > Key: SPARK-21104 > URL: https://issues.apache.org/jira/browse/SPARK-21104 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.1 >Reporter: darion yaphet >Priority: Minor > > When I'm loading LibSVM from HDFS , I found feature index should be in > ascending order . > We can sorted with *indices* when we parse the input line input a (index, > value) tuple and avoid check if indices are in ascending order after that. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21104) Support sort with index when parse LibSVM Record
darion yaphet created SPARK-21104: - Summary: Support sort with index when parse LibSVM Record Key: SPARK-21104 URL: https://issues.apache.org/jira/browse/SPARK-21104 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.1.1 Reporter: darion yaphet Priority: Minor When I'm loading LibSVM from HDFS , I found feature index should be in ascending order . We can sorted with *indices* when we parse the input line input a (index, value) tuple and avoid check if indices are in ascending order after that. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21073) Support map_keys and map_values functions in DataSet
[ https://issues.apache.org/jira/browse/SPARK-21073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] darion yaphet updated SPARK-21073: -- Summary: Support map_keys and map_values functions in DataSet (was: Support map_keys and map_values in DataSet) > Support map_keys and map_values functions in DataSet > > > Key: SPARK-21073 > URL: https://issues.apache.org/jira/browse/SPARK-21073 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.2, 2.1.1 >Reporter: darion yaphet > > Add *map_keys* to get keys from a MapType cloumn and *map_values* to get > values from a MapType column. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21073) Support map_keys and map_values in DataSet
[ https://issues.apache.org/jira/browse/SPARK-21073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] darion yaphet updated SPARK-21073: -- Summary: Support map_keys and map_values in DataSet (was: Support map_keys and map_values in DataSet and DataFrame) > Support map_keys and map_values in DataSet > -- > > Key: SPARK-21073 > URL: https://issues.apache.org/jira/browse/SPARK-21073 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.2, 2.1.1 >Reporter: darion yaphet > > Add *map_keys* to get keys from a MapType cloumn and *map_values* to get > values from a MapType column. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21073) Support map_keys and map_values in DataSet and DataFrame
darion yaphet created SPARK-21073: - Summary: Support map_keys and map_values in DataSet and DataFrame Key: SPARK-21073 URL: https://issues.apache.org/jira/browse/SPARK-21073 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.1.1, 2.0.2 Reporter: darion yaphet Add *map_keys* to get keys from a MapType cloumn and *map_values* to get values from a MapType column. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21066) LibSVM load just one input file
[ https://issues.apache.org/jira/browse/SPARK-21066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] darion yaphet updated SPARK-21066: -- Description: Currently when we using SVM to train dataset we found the input files limit only one . The file store on the Distributed File System such as HDFS is split into mutil piece and I think this limit is not necessary . We can join input paths into a string split with comma. was: Currently when we using SVM to train dataset we found the input files limit only one . the source code as following : {{{ val path = if (dataFiles.length == 1) { dataFiles.head.getPath.toUri.toString } else if (dataFiles.isEmpty) { throw new IOException("No input path specified for libsvm data") } else { throw new IOException("Multiple input paths are not supported for libsvm data.") } }}} The file store on the Distributed File System such as HDFS is split into mutil piece and I think this limit is not necessary . We can join input paths into a string split with comma. > LibSVM load just one input file > --- > > Key: SPARK-21066 > URL: https://issues.apache.org/jira/browse/SPARK-21066 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.1 >Reporter: darion yaphet > > Currently when we using SVM to train dataset we found the input files limit > only one . > The file store on the Distributed File System such as HDFS is split into > mutil piece and I think this limit is not necessary . > We can join input paths into a string split with comma. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21066) LibSVM load just one input file
darion yaphet created SPARK-21066: - Summary: LibSVM load just one input file Key: SPARK-21066 URL: https://issues.apache.org/jira/browse/SPARK-21066 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.1.1 Reporter: darion yaphet Currently when we using SVM to train dataset we found the input files limit only one . the source code as following : {{{ val path = if (dataFiles.length == 1) { dataFiles.head.getPath.toUri.toString } else if (dataFiles.isEmpty) { throw new IOException("No input path specified for libsvm data") } else { throw new IOException("Multiple input paths are not supported for libsvm data.") } }}} The file store on the Distributed File System such as HDFS is split into mutil piece and I think this limit is not necessary . We can join input paths into a string split with comma. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21032) Support add_years and add_days functions
[ https://issues.apache.org/jira/browse/SPARK-21032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16046187#comment-16046187 ] darion yaphet commented on SPARK-21032: --- Yes it seems have implement this functions . Should we add a alias to represent the two functions and make it clearly . > Support add_years and add_days functions > - > > Key: SPARK-21032 > URL: https://issues.apache.org/jira/browse/SPARK-21032 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.1 >Reporter: darion yaphet > > Currently SparkSQL have add_months function and want to support add_years and > add_days . Maybe add_years and add_days are similar with add_months . -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21032) Support add_years and add_days functions
darion yaphet created SPARK-21032: - Summary: Support add_years and add_days functions Key: SPARK-21032 URL: https://issues.apache.org/jira/browse/SPARK-21032 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.1.1 Reporter: darion yaphet Currently SparkSQL have add_months function and want to support add_years and add_days . Maybe add_years and add_days are similar with add_months . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21015) Check field name is not null and empty in GenericRowWithSchema
darion yaphet created SPARK-21015: - Summary: Check field name is not null and empty in GenericRowWithSchema Key: SPARK-21015 URL: https://issues.apache.org/jira/browse/SPARK-21015 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.1.1 Reporter: darion yaphet Priority: Minor When we get field index from row with schema , we shoule make sure the field name is not null and empty . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21014) Support get fields with schema name
darion yaphet created SPARK-21014: - Summary: Support get fields with schema name Key: SPARK-21014 URL: https://issues.apache.org/jira/browse/SPARK-21014 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.1.1 Reporter: darion yaphet Priority: Minor Currently when we want to get field from row we should use index to fetch it.But the row this schema we can use field name to read field from row is very useful. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20968) Support separator in Tokenizer
[ https://issues.apache.org/jira/browse/SPARK-20968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] darion yaphet closed SPARK-20968. - Resolution: Fixed > Support separator in Tokenizer > -- > > Key: SPARK-20968 > URL: https://issues.apache.org/jira/browse/SPARK-20968 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.0.0, 2.0.2, 2.1.1 >Reporter: darion yaphet >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20968) Support separator in Tokenizer
darion yaphet created SPARK-20968: - Summary: Support separator in Tokenizer Key: SPARK-20968 URL: https://issues.apache.org/jira/browse/SPARK-20968 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 2.1.1, 2.0.2, 2.0.0 Reporter: darion yaphet Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20740) Expose UserDefinedType make sure could extends it
darion yaphet created SPARK-20740: - Summary: Expose UserDefinedType make sure could extends it Key: SPARK-20740 URL: https://issues.apache.org/jira/browse/SPARK-20740 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1 Reporter: darion yaphet User may want to extends UserDefinedType and create data types . We should make UserDefinedType as a public class . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20675) Support Index to skip when retrieval disk structure in CoGroupedRDD
[ https://issues.apache.org/jira/browse/SPARK-20675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] darion yaphet updated SPARK-20675: -- Description: CoGroupedRDD's compute() will retrieval each StreamBuffer(a disk structure maintains key-value pairs which sorted by key) and merge the same key into one So I think add a sequence index file or append the index part at the head of temporary shuffle file to seek to the appropriate position could skip a lot of scan which are unnecessary. > Support Index to skip when retrieval disk structure in CoGroupedRDD > > > Key: SPARK-20675 > URL: https://issues.apache.org/jira/browse/SPARK-20675 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: darion yaphet > > CoGroupedRDD's compute() will retrieval each StreamBuffer(a disk structure > maintains key-value pairs which sorted by key) and merge the same key into one > So I think add a sequence index file or append the index part at the head of > temporary shuffle file to seek to the appropriate position could skip a lot > of scan which are unnecessary. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20675) Support Index to skip when retrieval disk structure in CoGroupedRDD
darion yaphet created SPARK-20675: - Summary: Support Index to skip when retrieval disk structure in CoGroupedRDD Key: SPARK-20675 URL: https://issues.apache.org/jira/browse/SPARK-20675 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.1 Reporter: darion yaphet -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20610) Support a function get DataFrame/DataSet from Transformer
[ https://issues.apache.org/jira/browse/SPARK-20610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] darion yaphet closed SPARK-20610. - Resolution: Won't Fix > Support a function get DataFrame/DataSet from Transformer > - > > Key: SPARK-20610 > URL: https://issues.apache.org/jira/browse/SPARK-20610 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.2, 2.1.0 >Reporter: darion yaphet > > We are using stages to build our machine learning pipeline. Transformer will > transformers input dataset into another output dataset as our dataframe. > Sometime we will test the dataframe's result when developing the pipeline. > But it is looks like difficulty to running a test . If spark ml Stages could > support a interface to explore the dataframe processed by the stage , we > could use it to running test . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20610) Support a function get DataFrame/DataSet from Transformer
darion yaphet created SPARK-20610: - Summary: Support a function get DataFrame/DataSet from Transformer Key: SPARK-20610 URL: https://issues.apache.org/jira/browse/SPARK-20610 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.1.0, 2.0.2 Reporter: darion yaphet We are using stages to build our machine learning pipeline. Transformer will transformers input dataset into another output dataset as our dataframe. Sometime we will test the dataframe's result when developing the pipeline. But it is looks like difficulty to running a test . If spark ml Stages could support a interface to explore the dataframe processed by the stage , we could use it to running test . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20469) Add a method to display DataFrame schema in PipelineStage
darion yaphet created SPARK-20469: - Summary: Add a method to display DataFrame schema in PipelineStage Key: SPARK-20469 URL: https://issues.apache.org/jira/browse/SPARK-20469 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 2.1.0, 2.0.2, 1.6.3 Reporter: darion yaphet Priority: Minor Sometime apply Transformer and Estimator on a pipeline. The PipelineStage could display schema will be a big help to understand and check the dataset . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20276) ScheduleExecutorService control sleep interval
[ https://issues.apache.org/jira/browse/SPARK-20276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] darion yaphet closed SPARK-20276. - Resolution: Won't Fix > ScheduleExecutorService control sleep interval > --- > > Key: SPARK-20276 > URL: https://issues.apache.org/jira/browse/SPARK-20276 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: darion yaphet >Priority: Minor > > The SessionManager startup timeout checker periodicity . We can using > ScheduleExecutorService to control time interval and to replacement > Thread.sleep . It's seem seay and elegant . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20276) ScheduleExecutorService control sleep interval
[ https://issues.apache.org/jira/browse/SPARK-20276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15962630#comment-15962630 ] darion yaphet commented on SPARK-20276: --- thanks for you response :) *org.apache.hive.service.cli.session.SessionManager* is seems copy from hive source code . I will close this issue . > ScheduleExecutorService control sleep interval > --- > > Key: SPARK-20276 > URL: https://issues.apache.org/jira/browse/SPARK-20276 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: darion yaphet >Priority: Minor > > The SessionManager startup timeout checker periodicity . We can using > ScheduleExecutorService to control time interval and to replacement > Thread.sleep . It's seem seay and elegant . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20276) ScheduleExecutorService control sleep interval
darion yaphet created SPARK-20276: - Summary: ScheduleExecutorService control sleep interval Key: SPARK-20276 URL: https://issues.apache.org/jira/browse/SPARK-20276 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: darion yaphet Priority: Minor The SessionManager startup timeout checker periodicity . We can using ScheduleExecutorService to control time interval and to replacement Thread.sleep . It's seem seay and elegant . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19794) Release HDFS Client after read/write checkpoint
darion yaphet created SPARK-19794: - Summary: Release HDFS Client after read/write checkpoint Key: SPARK-19794 URL: https://issues.apache.org/jira/browse/SPARK-19794 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0, 2.0.2 Reporter: darion yaphet RDD check point write each partation into HDFS and reading from HDFS when RDD need recomputation . After process with HDFS HDFS client and streams should be closed . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18064) Spark SQL can't load default config file
darion yaphet created SPARK-18064: - Summary: Spark SQL can't load default config file Key: SPARK-18064 URL: https://issues.apache.org/jira/browse/SPARK-18064 Project: Spark Issue Type: Bug Components: SQL Reporter: darion yaphet -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18047) Spark worker port should be greater than 1023
[ https://issues.apache.org/jira/browse/SPARK-18047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] darion yaphet updated SPARK-18047: -- Description: The port numbers in the range from 0 to 1023 are the well-known ports (system ports) . They are widely used by system network services. Such as Telnet(23), Simple Mail Transfer Protocol(25) and Domain Name System(53). Work port should avoid using this ports . was:The port numbers in the range from 0 to 1023 are the well-known ports (system ports) . They are widely used by system network services. Such as Telnet(23), Simple Mail Transfer Protocol(25) and Domain Name System(53). Work port should avoid using this ports . > Spark worker port should be greater than 1023 > - > > Key: SPARK-18047 > URL: https://issues.apache.org/jira/browse/SPARK-18047 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: darion yaphet > > The port numbers in the range from 0 to 1023 are the well-known ports (system > ports) . > They are widely used by system network services. Such as Telnet(23), Simple > Mail Transfer Protocol(25) and Domain Name System(53). > Work port should avoid using this ports . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18047) Spark worker port should be greater than 1023
darion yaphet created SPARK-18047: - Summary: Spark worker port should be greater than 1023 Key: SPARK-18047 URL: https://issues.apache.org/jira/browse/SPARK-18047 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.1, 2.0.0 Reporter: darion yaphet The port numbers in the range from 0 to 1023 are the well-known ports (system ports) . They are widely used by system network services. Such as Telnet(23), Simple Mail Transfer Protocol(25) and Domain Name System(53). Work port should avoid using this ports . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org