[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16729261#comment-16729261 ] Jacky Li commented on SPARK-24630: -- Actually I encountered this scenario earlier, so we have implemented some commands for using SparkStreaming(or StructureStreaming) on Apache CarbonData, you can refer to the StreamSQL section in this [doc|http://carbondata.apache.org/streaming-guide.html] for more detail. It is good if Spark community can use similar or same syntax, if possible, then in future version of CarbonData can migrate to Spark's syntax. > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP V2.pdf > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701444#comment-16701444 ] Jacky Li commented on SPARK-24630: -- Beside the CREATE TABLE/STREAM to create the source and the sink, is there any syntax to manipulate the streaming job, like starting the job and stopping the stop the job? If I understand correctly, currently INSERT statement is proposed to kick off the structstreaming job, since this streaming job is continous, I am wondering is there a way to show/desc/stop it? > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP.pdf > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16316) dataframe except API returning wrong result in spark 1.5.0
[ https://issues.apache.org/jira/browse/SPARK-16316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacky Li updated SPARK-16316: - Description: Version: spark 1.5.0 Use case: use except API to do subtract between two dataframe scala> val dfa = sc.parallelize(1 to 100).map(x => (x, x)).toDF("i", "j") dfa: org.apache.spark.sql.DataFrame = [i: int, j: int] scala> val dfb = sc.parallelize(1 to 10).map(x => (x, x)).toDF("i", "j") dfb: org.apache.spark.sql.DataFrame = [i: int, j: int] scala> dfa.except(dfb).count res13: Long = 0 It should return 90 instead of 0 While following statement works fine scala> dfa.except(dfb).rdd.count res13: Long = 90 I guess the bug maybe somewhere in dataframe.count was: Version: spark 1.5.0 Use case: use except API to do subtract between two dataframe scala> val dfa = sc.parallelize(1 to 100).map(x => (x, x)).toDF("i", "j") dfa: org.apache.spark.sql.DataFrame = [i: int, j: int] scala> val dfb = sc.parallelize(1 to 10).map(x => (x, x)).toDF("i", "j") dfb: org.apache.spark.sql.DataFrame = [i: int, j: int] scala> dfa.except(dfb).count res13: Long = 0 It should return 90 instead of 0 While following statement works fine scala> dfa.except(dfb).rdd.count res13: Long = 0 I guess the bug maybe somewhere in dataframe.count > dataframe except API returning wrong result in spark 1.5.0 > -- > > Key: SPARK-16316 > URL: https://issues.apache.org/jira/browse/SPARK-16316 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Jacky Li > > Version: spark 1.5.0 > Use case: use except API to do subtract between two dataframe > scala> val dfa = sc.parallelize(1 to 100).map(x => (x, x)).toDF("i", "j") > dfa: org.apache.spark.sql.DataFrame = [i: int, j: int] > scala> val dfb = sc.parallelize(1 to 10).map(x => (x, x)).toDF("i", "j") > dfb: org.apache.spark.sql.DataFrame = [i: int, j: int] > scala> dfa.except(dfb).count > res13: Long = 0 > It should return 90 instead of 0 > While following statement works fine > scala> dfa.except(dfb).rdd.count > res13: Long = 90 > I guess the bug maybe somewhere in dataframe.count -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16316) dataframe except API returning wrong result in spark 1.5.0
[ https://issues.apache.org/jira/browse/SPARK-16316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacky Li updated SPARK-16316: - Description: Version: spark 1.5.0 Use case: use except API to do subtract between two dataframe scala> val dfa = sc.parallelize(1 to 100).map(x => (x, x)).toDF("i", "j") dfa: org.apache.spark.sql.DataFrame = [i: int, j: int] scala> val dfb = sc.parallelize(1 to 10).map(x => (x, x)).toDF("i", "j") dfb: org.apache.spark.sql.DataFrame = [i: int, j: int] scala> dfa.except(dfb).count res13: Long = 0 It should return 90 instead of 0 While following statement works fine scala> dfa.except(dfb).rdd.count res13: Long = 0 I guess the bug maybe somewhere in dataframe.count was: Version: spark 1.5.0 Use case: use except API to do subtract between two dataframe scala> val dfa = sc.parallelize(1 to 100).map(x => (x, x)).toDF("i", "j") dfa: org.apache.spark.sql.DataFrame = [i: int, j: int] scala> val dfb = sc.parallelize(1 to 10).map(x => (x, x)).toDF("i", "j") dfb: org.apache.spark.sql.DataFrame = [i: int, j: int] scala> dfa.except(dfb).count res13: Long = 0 It should return 90 instead of 0 > dataframe except API returning wrong result in spark 1.5.0 > -- > > Key: SPARK-16316 > URL: https://issues.apache.org/jira/browse/SPARK-16316 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Jacky Li > > Version: spark 1.5.0 > Use case: use except API to do subtract between two dataframe > scala> val dfa = sc.parallelize(1 to 100).map(x => (x, x)).toDF("i", "j") > dfa: org.apache.spark.sql.DataFrame = [i: int, j: int] > scala> val dfb = sc.parallelize(1 to 10).map(x => (x, x)).toDF("i", "j") > dfb: org.apache.spark.sql.DataFrame = [i: int, j: int] > scala> dfa.except(dfb).count > res13: Long = 0 > It should return 90 instead of 0 > While following statement works fine > scala> dfa.except(dfb).rdd.count > res13: Long = 0 > I guess the bug maybe somewhere in dataframe.count -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16316) dataframe except API returning wrong result in spark 1.5.0
Jacky Li created SPARK-16316: Summary: dataframe except API returning wrong result in spark 1.5.0 Key: SPARK-16316 URL: https://issues.apache.org/jira/browse/SPARK-16316 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Jacky Li Version: spark 1.5.0 Use case: use except API to do subtract between two dataframe scala> val dfa = sc.parallelize(1 to 100).map(x => (x, x)).toDF("i", "j") dfa: org.apache.spark.sql.DataFrame = [i: int, j: int] scala> val dfb = sc.parallelize(1 to 10).map(x => (x, x)).toDF("i", "j") dfb: org.apache.spark.sql.DataFrame = [i: int, j: int] scala> dfa.except(dfb).count res13: Long = 0 It should return 90 instead of 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6007) Add numRows param in DataFrame.show
Jacky Li created SPARK-6007: --- Summary: Add numRows param in DataFrame.show Key: SPARK-6007 URL: https://issues.apache.org/jira/browse/SPARK-6007 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Jacky Li Priority: Minor Fix For: 1.3.0 Currently, DataFrame.show only takes 20 rows to show, it will be useful if the user can decide how many rows to show by passing it as a parameter in show() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5939) Make FPGrowth example app take parameters
Jacky Li created SPARK-5939: --- Summary: Make FPGrowth example app take parameters Key: SPARK-5939 URL: https://issues.apache.org/jira/browse/SPARK-5939 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.1 Reporter: Jacky Li Priority: Minor Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4787) Resource unreleased during failure in SparkContext initialization
Jacky Li created SPARK-4787: --- Summary: Resource unreleased during failure in SparkContext initialization Key: SPARK-4787 URL: https://issues.apache.org/jira/browse/SPARK-4787 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Jacky Li Fix For: 1.3.0 When client creates a SparkContext, currently there are many val to initialize during object initialization. But when there is failure initializing these val, like throwing an exception, the resources in this SparkContext is not released properly. For example, SparkUI object is created and bind to the HTTP server during initialization using {{ui.foreach(_.bind())}} but if anything goes wrong after this code (say throwing an exception when creating DAGScheduler), the SparkUI server is not stopped, thus the port bind will fail again in the client when creating another SparkContext. So basically this leads to a situation that the client can not create another SparkContext in the same process, which I think it is not reasonable. So, I suggest to refactor the SparkContext code to release resource when there is failure during in initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacky Li updated SPARK-4001: Attachment: Distributed frequent item mining algorithm based on Spark.pptx [~mengxr] please check the attached file, we have tested it using a open data set from http://fimi.ua.ac.be/data/ Currently our test cluster is small (4 nodes), we will test it in a larger cluster later, if required. > Add Apriori algorithm to Spark MLlib > > > Key: SPARK-4001 > URL: https://issues.apache.org/jira/browse/SPARK-4001 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jacky Li >Assignee: Jacky Li > Attachments: Distributed frequent item mining algorithm based on > Spark.pptx > > > Apriori is the classic algorithm for frequent item set mining in a > transactional data set. It will be useful if Apriori algorithm is added to > MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14236501#comment-14236501 ] Jacky Li commented on SPARK-4001: - Sure, Xiangrui. I will update it on next Monday while in office. > Add Apriori algorithm to Spark MLlib > > > Key: SPARK-4001 > URL: https://issues.apache.org/jira/browse/SPARK-4001 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jacky Li >Assignee: Jacky Li > > Apriori is the classic algorithm for frequent item set mining in a > transactional data set. It will be useful if Apriori algorithm is added to > MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4699) Make caseSensitive configurable in Analyzer.scala
Jacky Li created SPARK-4699: --- Summary: Make caseSensitive configurable in Analyzer.scala Key: SPARK-4699 URL: https://issues.apache.org/jira/browse/SPARK-4699 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Jacky Li Fix For: 1.2.0 Currently, case sensitivity is true by default in Analyzer. It should be configurable by setting SQLConf in the client application -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4639) Pass maxIterations in as a parameter in Analyzer
Jacky Li created SPARK-4639: --- Summary: Pass maxIterations in as a parameter in Analyzer Key: SPARK-4639 URL: https://issues.apache.org/jira/browse/SPARK-4639 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Jacky Li Priority: Minor Fix For: 1.3.0 fix a TODO in Analyzer: // TODO: pass this in as a parameter val fixedPoint = FixedPoint(100) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226562#comment-14226562 ] Jacky Li commented on SPARK-4001: - Thanks for your suggestion, Daniel. Here is the current status. 1. Currently I have implemented apriori and fp-growth by referring to YAFIM (http://pasa-bigdata.nju.edu.cn/people/ronggu/pub/YAFIM_ParLearning.pdf) and PFP (http://dl.acm.org/citation.cfm?id=1454027) For apriori, currently there are two versions implemented, one using broadcast variable and another one using cartisian join of two RDD, I am testing them using mushroom and webdoc open dataset (http://fimi.ua.ac.be/data/) to check the performance of them before deciding which one to contribute to MLlib. I have updated the code in the PR (https://github.com/apache/spark/pull/2847), you are welcome to check it and try in your use case. 2. For the input part, currently the apriori algo is taking {{RDD\[Array\[String\]\]}} as the input dataset, but not containing basket_id or user_id. I am not sure whether it can easily fit into your use case. Can you give more detail of how you want to use it in collaborative filtering contexts? > Add Apriori algorithm to Spark MLlib > > > Key: SPARK-4001 > URL: https://issues.apache.org/jira/browse/SPARK-4001 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jacky Li >Assignee: Jacky Li > > Apriori is the classic algorithm for frequent item set mining in a > transactional data set. It will be useful if Apriori algorithm is added to > MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4269) Make wait time in BroadcastHashJoin configurable
Jacky Li created SPARK-4269: --- Summary: Make wait time in BroadcastHashJoin configurable Key: SPARK-4269 URL: https://issues.apache.org/jira/browse/SPARK-4269 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Jacky Li Fix For: 1.2.0 In BroadcastHashJoin, currently it is using a hard coded value (5 minutes) to wait for the execution and broadcast of the small table. In my opinion, it should be a configurable value since broadcast may exceed 5 minutes in some case, like in a busy/congested network environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178200#comment-14178200 ] Jacky Li commented on SPARK-4001: - I accidentally use wangfei (my college) 's account to send the last comment, sorry for the inconvenience. Thanks @Sean Owen for explaining! Frequent itemset algorithm works by scanning the input data set, there is no probabilistic model in nature. To answer @Xiangrui Meng’s earlier questions: 1. These algorithm is used for finding major patterns / association rules in a data set. For a real use case, some analytic applications in telecom domain use them to find subscriber behavior from the data set combining service record, network traffic record, and demographic data. Please refer to this Chinese article for example: http://www.ss-lw.com/wxxw-361.html And, sometimes we use frequent itemset algorithm for preparing features input to other algorithm which selects feature and do other ML task like training a classifier, like this paper: http://dl.acm.org/citation.cfm?id=1401922, 2. Since Apriori is a basic algorithm for frequent itemset mining, I am not aware of any parallel implementation for it. But I think the algorithm fits Spark’s data parallel model since it only need to scan the input data set. And for FP-Growth, I do know there is a Parallel FP-Growth from Haoyuan Li: http://dl.acm.org/citation.cfm?id=1454027 . I think I probably will refer to this paper to implement FP-Growth in Spark 3. The Apriori computation complexity is about O(N*k) where N is the number of item in input data and k is the depth of the frequent item tree to search. FP-Grwoth complexity is about O(N), it is more efficient comparing to Apriori. For space efficiency, FP-growth is also more efficient than Apriori. But in case of smaller data and if frequent itemset is more, Apriori is more efficient. This is because FP-Growth need to construct a FP Tree out of the input data set, and it needs some time. And another advantage of Apriori is that it can output association rules while FP-Growth can not. Although these two algorithms are basic algo (FP-Growth is more complex), I think it will be handy if mllib can include them since there is no frequent itemset mining algo in Spark yet, and especially in distributed environment. Please suggest how to handle this issue. Thanks a lot. > Add Apriori algorithm to Spark MLlib > > > Key: SPARK-4001 > URL: https://issues.apache.org/jira/browse/SPARK-4001 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jacky Li >Assignee: Jacky Li > > Apriori is the classic algorithm for frequent item set mining in a > transactional data set. It will be useful if Apriori algorithm is added to > MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177643#comment-14177643 ] Jacky Li commented on SPARK-4001: - Maybe there is a misunderstand, I do not mean to use it in database by saying "transactional data set". Apriori is the classic algorithm for freq-item mining, and the frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the data set. This has applications in domains such as market basket analysis for cross-selling and up-selling. I am not sure whether there is a freq-item algorithm in the mllib, so actually I am implementing Apriori and FP-Growth (not include in this JIRA thread). Will it be useful if contribute to mllib? > Add Apriori algorithm to Spark MLlib > > > Key: SPARK-4001 > URL: https://issues.apache.org/jira/browse/SPARK-4001 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jacky Li >Assignee: Jacky Li > > Apriori is the classic algorithm for frequent item set mining in a > transactional data set. It will be useful if Apriori algorithm is added to > MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4001) Add Apriori algorithm to Spark MLlib
Jacky Li created SPARK-4001: --- Summary: Add Apriori algorithm to Spark MLlib Key: SPARK-4001 URL: https://issues.apache.org/jira/browse/SPARK-4001 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.0 Reporter: Jacky Li Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org