[jira] [Issue Comment Deleted] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-5682: Comment: was deleted (was: Steps were added to encode and decode the data, the performance will not be fast than before, in the same time, codes also have security issue, for example save the plain text in configuration file and finally used as the part of the key If you use a better cypher solution, the performance downgrade will be minimized. i think AES is a bit heavy. In the same time, the feature based on hadoop 2.6, it is the limitation, that is why i said rely on hadoop Though the API is public stable, however, you cannot ensure if the API will not be changed since it is not the comercial software. ) > Add encrypted shuffle in spark > -- > > Key: SPARK-5682 > URL: https://issues.apache.org/jira/browse/SPARK-5682 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: liyunzhang_intel > Attachments: Design Document of Encrypted Spark > Shuffle_20150209.docx, Design Document of Encrypted Spark > Shuffle_20150318.docx, Design Document of Encrypted Spark > Shuffle_20150402.docx, Design Document of Encrypted Spark > Shuffle_20150506.docx > > > Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle > data safer. This feature is necessary in spark. AES is a specification for > the encryption of electronic data. There are 5 common modes in AES. CTR is > one of the modes. We use two codec JceAesCtrCryptoCodec and > OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used > in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk > provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl > provides. > Because ugi credential info is used in the process of encrypted shuffle, we > first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-5682: Comment: was deleted (was: The solution relied on hadoop API and maybe downgrade the performance. The AES algorithm was used in block data encryption in many case. I think rc4 could be used to encode the stream or a simple solution with a authentication header could be used. : )) > Add encrypted shuffle in spark > -- > > Key: SPARK-5682 > URL: https://issues.apache.org/jira/browse/SPARK-5682 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: liyunzhang_intel > Attachments: Design Document of Encrypted Spark > Shuffle_20150209.docx, Design Document of Encrypted Spark > Shuffle_20150318.docx, Design Document of Encrypted Spark > Shuffle_20150402.docx, Design Document of Encrypted Spark > Shuffle_20150506.docx > > > Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle > data safer. This feature is necessary in spark. AES is a specification for > the encryption of electronic data. There are 5 common modes in AES. CTR is > one of the modes. We use two codec JceAesCtrCryptoCodec and > OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used > in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk > provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl > provides. > Because ugi credential info is used in the process of encrypted shuffle, we > first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-5682: Comment: was deleted (was: Since the encrypted shuffle in spark is focus on the common module, it maybe not good to use hadoop API. On the other side, the AES solution is a bit heavy to encode/decode the live steaming data. ) > Add encrypted shuffle in spark > -- > > Key: SPARK-5682 > URL: https://issues.apache.org/jira/browse/SPARK-5682 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: liyunzhang_intel > Attachments: Design Document of Encrypted Spark > Shuffle_20150209.docx, Design Document of Encrypted Spark > Shuffle_20150318.docx, Design Document of Encrypted Spark > Shuffle_20150402.docx, Design Document of Encrypted Spark > Shuffle_20150506.docx > > > Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle > data safer. This feature is necessary in spark. AES is a specification for > the encryption of electronic data. There are 5 common modes in AES. CTR is > one of the modes. We use two codec JceAesCtrCryptoCodec and > OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used > in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk > provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl > provides. > Because ugi credential info is used in the process of encrypted shuffle, we > first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15174) DataFrame does not have correct number of rows after dropDuplicates
[ https://issues.apache.org/jira/browse/SPARK-15174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281296#comment-15281296 ] hujiayin commented on SPARK-15174: -- [~ihalsema]I just tested this case in my environment with the latest Spark. It will give you an error message "Unable to infer schema for JSON at hdfs:///xxx/xx" if your directory is empty. If your file is empty, the read.json will also throw the error message. If your file just contain "{}", df.count=1 and df.rdd.isEmpty=false will be in both cases. So I think read.json handled the exceptional cases and the bug is not there. > DataFrame does not have correct number of rows after dropDuplicates > --- > > Key: SPARK-15174 > URL: https://issues.apache.org/jira/browse/SPARK-15174 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Ian Hellstrom > > If you read an empty file/folder with the {{SQLContext.read()}} function and > call {{DataFrame.dropDuplicates()}}, the number of rows is incorrect. > {code} > val input = "hdfs:///some/empty/directory" > val df1 = sqlContext.read.json(input) > val df2 = sqlContext.read.json(input).dropDuplicates > df1.count == 0 // true > df1.rdd.isEmpty // true > df2.count == 0 // false: it's actually reported as 1 > df2.rdd.isEmpty // false > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-15174) DataFrame does not have correct number of rows after dropDuplicates
[ https://issues.apache.org/jira/browse/SPARK-15174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-15174: - Comment: was deleted (was: I could fix it.) > DataFrame does not have correct number of rows after dropDuplicates > --- > > Key: SPARK-15174 > URL: https://issues.apache.org/jira/browse/SPARK-15174 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Ian Hellstrom > > If you read an empty file/folder with the {{SQLContext.read()}} function and > call {{DataFrame.dropDuplicates()}}, the number of rows is incorrect. > {code} > val input = "hdfs:///some/empty/directory" > val df1 = sqlContext.read.json(input) > val df2 = sqlContext.read.json(input).dropDuplicates > df1.count == 0 // true > df1.rdd.isEmpty // true > df2.count == 0 // false: it's actually reported as 1 > df2.rdd.isEmpty // false > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15174) DataFrame does not have correct number of rows after dropDuplicates
[ https://issues.apache.org/jira/browse/SPARK-15174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273805#comment-15273805 ] hujiayin edited comment on SPARK-15174 at 5/12/16 4:09 AM: --- I could fix it. was (Author: hujiayin): @Ian Hellstrom, I think the issue is caused by "if (groupingExpressions.isEmpty) { Statistics(sizeInBytes = 1) }" in basicLogicOperators.scala. I could fix it for you. > DataFrame does not have correct number of rows after dropDuplicates > --- > > Key: SPARK-15174 > URL: https://issues.apache.org/jira/browse/SPARK-15174 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Ian Hellstrom > > If you read an empty file/folder with the {{SQLContext.read()}} function and > call {{DataFrame.dropDuplicates()}}, the number of rows is incorrect. > {code} > val input = "hdfs:///some/empty/directory" > val df1 = sqlContext.read.json(input) > val df2 = sqlContext.read.json(input).dropDuplicates > df1.count == 0 // true > df1.rdd.isEmpty // true > df2.count == 0 // false: it's actually reported as 1 > df2.rdd.isEmpty // false > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14623) add label binarizer
[ https://issues.apache.org/jira/browse/SPARK-14623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-14623: - Description: It relates to https://issues.apache.org/jira/browse/SPARK-7445 Map the labels to 0/1. For example, Input: "yellow,green,red,green,0" The labels: "0, green, red, yellow" Output: 0, 0, 0, 1 0, 1, 0, 0 0, 0, 1, 0 0, 1, 0, 0 1, 0 ,0, 0 Refer to http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html was: It relates to https://issues.apache.org/jira/browse/SPARK-7445 Map the labels to 0/1. For example, Input: "yellow,green,red,green,0" The labels: "0, green, red, yellow" Output: 0, 0, 0, 1 0, 1, 0, 0 0, 0, 1, 0 0, 1, 0, 0 1, 0 ,0, 0 > add label binarizer > > > Key: SPARK-14623 > URL: https://issues.apache.org/jira/browse/SPARK-14623 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: hujiayin >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > It relates to https://issues.apache.org/jira/browse/SPARK-7445 > Map the labels to 0/1. > For example, > Input: > "yellow,green,red,green,0" > The labels: "0, green, red, yellow" > Output: > 0, 0, 0, 1 > 0, 1, 0, 0 > 0, 0, 1, 0 > 0, 1, 0, 0 > 1, 0 ,0, 0 > Refer to > http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15174) DataFrame does not have correct number of rows after dropDuplicates
[ https://issues.apache.org/jira/browse/SPARK-15174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273805#comment-15273805 ] hujiayin commented on SPARK-15174: -- @Ian Hellstrom, I think the issue is caused by "if (groupingExpressions.isEmpty) { Statistics(sizeInBytes = 1) }" in basicLogicOperators.scala. I could fix it for you. > DataFrame does not have correct number of rows after dropDuplicates > --- > > Key: SPARK-15174 > URL: https://issues.apache.org/jira/browse/SPARK-15174 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Ian Hellstrom > > If you read an empty file/folder with the {{SQLContext.read()}} function and > call {{DataFrame.dropDuplicates()}}, the number of rows is incorrect. > {code} > val input = "hdfs:///some/empty/directory" > val df1 = sqlContext.read.json(input) > val df2 = sqlContext.read.json(input).dropDuplicates > df1.count == 0 // true > df1.rdd.isEmpty // true > df2.count == 0 // false: it's actually reported as 1 > df2.rdd.isEmpty // false > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14772) Python ML Params.copy treats uid, paramMaps differently than Scala
[ https://issues.apache.org/jira/browse/SPARK-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15270161#comment-15270161 ] hujiayin edited comment on SPARK-14772 at 5/4/16 6:04 AM: -- @holdenk, I had a code for this issue and was busy with the other project in the past days. I just start to look into pyspark and look forward your comments. was (Author: hujiayin): @holdenk, I have a code for this issue and was busy with the other project in the past days. I just start to look into pyspark and look forward your comments. > Python ML Params.copy treats uid, paramMaps differently than Scala > -- > > Key: SPARK-14772 > URL: https://issues.apache.org/jira/browse/SPARK-14772 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Joseph K. Bradley > > In PySpark, {{ml.param.Params.copy}} does not quite match the Scala > implementation: > * It does not copy the UID > * It does not respect the difference between defaultParamMap and paramMap. > This is an issue with {{_copyValues}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14772) Python ML Params.copy treats uid, paramMaps differently than Scala
[ https://issues.apache.org/jira/browse/SPARK-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15270161#comment-15270161 ] hujiayin commented on SPARK-14772: -- @holdenk, I have a code for this issue and was busy with the other project in the past days. I just start to look into pyspark and look forward your comments. > Python ML Params.copy treats uid, paramMaps differently than Scala > -- > > Key: SPARK-14772 > URL: https://issues.apache.org/jira/browse/SPARK-14772 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Joseph K. Bradley > > In PySpark, {{ml.param.Params.copy}} does not quite match the Scala > implementation: > * It does not copy the UID > * It does not respect the difference between defaultParamMap and paramMap. > This is an issue with {{_copyValues}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-14772) Python ML Params.copy treats uid, paramMaps differently than Scala
[ https://issues.apache.org/jira/browse/SPARK-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-14772: - Comment: was deleted (was: I can submit a code to fix this issue and I'm testing it.) > Python ML Params.copy treats uid, paramMaps differently than Scala > -- > > Key: SPARK-14772 > URL: https://issues.apache.org/jira/browse/SPARK-14772 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Joseph K. Bradley > > In PySpark, {{ml.param.Params.copy}} does not quite match the Scala > implementation: > * It does not copy the UID > * It does not respect the difference between defaultParamMap and paramMap. > This is an issue with {{_copyValues}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14772) Python ML Params.copy treats uid, paramMaps differently than Scala
[ https://issues.apache.org/jira/browse/SPARK-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253035#comment-15253035 ] hujiayin commented on SPARK-14772: -- I can submit a code to fix this issue and I'm testing it. > Python ML Params.copy treats uid, paramMaps differently than Scala > -- > > Key: SPARK-14772 > URL: https://issues.apache.org/jira/browse/SPARK-14772 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Joseph K. Bradley > > In PySpark, {{ml.param.Params.copy}} does not quite match the Scala > implementation: > * It does not copy the UID > * It does not respect the difference between defaultParamMap and paramMap. > This is an issue with {{_copyValues}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14712) spark.ml LogisticRegressionModel.toString should summarize model
[ https://issues.apache.org/jira/browse/SPARK-14712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247141#comment-15247141 ] hujiayin edited comment on SPARK-14712 at 4/19/16 3:59 AM: --- Hi Gayathri, I think self has the numFeatures and numClasses defined and I can submit a code for this issue. was (Author: hujiayin): Hi Murali, I think self has the numFeatures and numClasses defined and I can submit a code for this issue. > spark.ml LogisticRegressionModel.toString should summarize model > > > Key: SPARK-14712 > URL: https://issues.apache.org/jira/browse/SPARK-14712 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Trivial > Labels: starter > > spark.mllib LogisticRegressionModel overrides toString to print a little > model info. We should do the same in spark.ml. I'd recommend: > * super.toString > * numClasses > * numFeatures > We should also override {{__repr__}} in pyspark to do the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14712) spark.ml LogisticRegressionModel.toString should summarize model
[ https://issues.apache.org/jira/browse/SPARK-14712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247141#comment-15247141 ] hujiayin commented on SPARK-14712: -- Hi Murali, I think self has the numFeatures and numClasses defined and I can submit a code for this issue. > spark.ml LogisticRegressionModel.toString should summarize model > > > Key: SPARK-14712 > URL: https://issues.apache.org/jira/browse/SPARK-14712 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Trivial > Labels: starter > > spark.mllib LogisticRegressionModel overrides toString to print a little > model info. We should do the same in spark.ml. I'd recommend: > * super.toString > * numClasses > * numFeatures > We should also override {{__repr__}} in pyspark to do the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14623) add label binarizer
[ https://issues.apache.org/jira/browse/SPARK-14623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244061#comment-15244061 ] hujiayin edited comment on SPARK-14623 at 4/16/16 8:24 AM: --- Hi Joseph, I think it is similar as the combination of StringIndexer + OneHotEncoder into one class but the difference is the LabelBinarizer will collect the same element into one vector and will remember the position of the element in the input. For example, Input is "yellow,green,red,green,0" Label Binarizer retrieves the labels from input and the labels are "0, green, red, yellow" Output is 0, 0, 0, 1 0, 1, 0, 0 0, 0, 1, 0 0, 1, 0, 0 1, 0 ,0, 0 The second column reflects element "green" appears at positions 1 and 3 in the input. The 4 columns reflect the 4 labels. Column 0 represents label 0 and column 1 is label "green", so on. If I understand correctly, StringIndexer returns the category number of a label and OneHotEncoder returns the single high 1 binary representation of the category number. was (Author: hujiayin): Hi Joseph, I think it is similar as the combination of StringIndexer + OneHotEncoder into one class but the difference is the LabelBinarizer will collect the same element into one vector and will remember the position of the element in the input. For example, Input is "yellow,green,red,green,0" Label Binarizer retrieves the labels from input and the labels are "0, green, red, yellow" Output is 0, 0, 0, 1 0, 1, 0, 0 0, 0, 1, 0 0, 1, 0, 0 1, 0 ,0, 0 The second column reflects element "green" appears at positions 1 and 3 in the input. The 4 columns reflect the 4 labels. Column 0 represents label 0 and column 1 is label "green", so on. If I understand correctly, StringIndexer returns the category number of a label and OneHotEncoder returns the binary representation of the category number. > add label binarizer > > > Key: SPARK-14623 > URL: https://issues.apache.org/jira/browse/SPARK-14623 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: hujiayin >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > It relates to https://issues.apache.org/jira/browse/SPARK-7445 > Map the labels to 0/1. > For example, > Input: > "yellow,green,red,green,0" > The labels: "0, green, red, yellow" > Output: > 0, 0, 0, 1 > 0, 1, 0, 0 > 0, 0, 1, 0 > 0, 1, 0, 0 > 1, 0 ,0, 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14623) add label binarizer
[ https://issues.apache.org/jira/browse/SPARK-14623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244061#comment-15244061 ] hujiayin commented on SPARK-14623: -- Hi Joseph, I think it is similar as the combination of StringIndexer + OneHotEncoder into one class but the difference is the LabelBinarizer will collect the same element into one vector and will remember the position of the element in the input. For example, Input is "yellow,green,red,green,0" Label Binarizer retrieves the labels from input and the labels are "0, green, red, yellow" Output is 0, 0, 0, 1 0, 1, 0, 0 0, 0, 1, 0 0, 1, 0, 0 1, 0 ,0, 0 The second column reflects element "green" appears at positions 1 and 3 in the input. The 4 columns reflect the 4 labels. Column 0 represents label 0 and column 1 is label "green", so on. If I understand correctly, StringIndexer returns the category number of a label and OneHotEncoder returns the binary representation of the category number. > add label binarizer > > > Key: SPARK-14623 > URL: https://issues.apache.org/jira/browse/SPARK-14623 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: hujiayin >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > It relates to https://issues.apache.org/jira/browse/SPARK-7445 > Map the labels to 0/1. > For example, > Input: > "yellow,green,red,green,0" > The labels: "0, green, red, yellow" > Output: > 0, 0, 0, 1 > 0, 1, 0, 0 > 0, 0, 1, 0 > 0, 1, 0, 0 > 1, 0 ,0, 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14523) Feature parity for Statistics ML with MLlib
[ https://issues.apache.org/jira/browse/SPARK-14523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15242661#comment-15242661 ] hujiayin commented on SPARK-14523: -- We can add ARIMA to Spark. > Feature parity for Statistics ML with MLlib > --- > > Key: SPARK-14523 > URL: https://issues.apache.org/jira/browse/SPARK-14523 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: yuhao yang > > Some statistics functions have been supported by DataFrame directly. Use this > jira to discuss/design the statistics package in Spark.ML and its function > scope. Hypothesis test and correlation computation may still need to expose > independent interfaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14623) add label binarizer
[ https://issues.apache.org/jira/browse/SPARK-14623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-14623: - Description: It relates to https://issues.apache.org/jira/browse/SPARK-7445 Map the labels to 0/1. For example, Input: "yellow,green,red,green,0" The labels: "0, green, red, yellow" Output: 0, 0, 0, 1 0, 1, 0, 0 0, 0, 1, 0 0, 1, 0, 0 1, 0 ,0, 0 was: It relates to https://issues.apache.org/jira/browse/SPARK-7445 Map the labels to 0/1. For example, Input: "yellow,green,red,green,0" The labels: "0, green, red, yellow" Output: 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0 > add label binarizer > > > Key: SPARK-14623 > URL: https://issues.apache.org/jira/browse/SPARK-14623 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.6.1 >Reporter: hujiayin >Priority: Minor > Fix For: 2.0.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > It relates to https://issues.apache.org/jira/browse/SPARK-7445 > Map the labels to 0/1. > For example, > Input: > "yellow,green,red,green,0" > The labels: "0, green, red, yellow" > Output: > 0, 0, 0, 1 > 0, 1, 0, 0 > 0, 0, 1, 0 > 0, 1, 0, 0 > 1, 0 ,0, 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14622) Retain lost executors status
[ https://issues.apache.org/jira/browse/SPARK-14622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240550#comment-15240550 ] hujiayin commented on SPARK-14622: -- I think it is also better to maintain the number of lost executors. When click the number, the details information appear after that. > Retain lost executors status > > > Key: SPARK-14622 > URL: https://issues.apache.org/jira/browse/SPARK-14622 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Qingyang Hong >Priority: Minor > Fix For: 1.6.0 > > > In 'execturos' dashboard, it is necessary to maintain a list of those > executors who have been lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14623) add label binarizer
hujiayin created SPARK-14623: Summary: add label binarizer Key: SPARK-14623 URL: https://issues.apache.org/jira/browse/SPARK-14623 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.6.1 Reporter: hujiayin Priority: Minor Fix For: 2.0.0 It relates to https://issues.apache.org/jira/browse/SPARK-7445 Map the labels to 0/1. For example, Input: "yellow,green,red,green,0" The labels: "0, green, red, yellow" Output: 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-7445) StringIndexer should handle binary labels properly
[ https://issues.apache.org/jira/browse/SPARK-7445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-7445: Comment: was deleted (was: If no one works on it, I'd like to submit a code for this issue.) > StringIndexer should handle binary labels properly > -- > > Key: SPARK-7445 > URL: https://issues.apache.org/jira/browse/SPARK-7445 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Priority: Minor > > StringIndexer orders labels by their counts. However, for binary labels, we > should really map negatives to 0 and positive to 1. So can put special rules > for binary labels: > 1. "+1"/"-1", "1"/"-1", "1"/"0" > 2. "yes"/"no" > 3. "true"/"false" > Another option is to allow users to provide a list or labels and we use the > ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7445) StringIndexer should handle binary labels properly
[ https://issues.apache.org/jira/browse/SPARK-7445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238799#comment-15238799 ] hujiayin commented on SPARK-7445: - If no one works on it, I'd like to submit a code for this issue. > StringIndexer should handle binary labels properly > -- > > Key: SPARK-7445 > URL: https://issues.apache.org/jira/browse/SPARK-7445 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Priority: Minor > > StringIndexer orders labels by their counts. However, for binary labels, we > should really map negatives to 0 and positive to 1. So can put special rules > for binary labels: > 1. "+1"/"-1", "1"/"-1", "1"/"0" > 2. "yes"/"no" > 3. "true"/"false" > Another option is to allow users to provide a list or labels and we use the > ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-4036: Comment: was deleted (was: latest CRF codes) > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf, crf-spark.zip, > dig-hair-eye-train.model, features.hair-eye, sample-input, sample-output > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-4036: Attachment: crf-spark.zip latest CRF codes > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf, crf-spark.zip, > dig-hair-eye-train.model, features.hair-eye, sample-input, sample-output > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059266#comment-15059266 ] hujiayin commented on SPARK-4036: - Hi Andrew, With your latest template file sent to me, I think your case can run with the CRF. However what are the labels in your sample-input? Since CRF is a structured prediction, person needs label the features to train the model. Thanks > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf, dig-hair-eye-train.model, > features.hair-eye, sample-input, sample-output > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15010353#comment-15010353 ] hujiayin edited comment on SPARK-4036 at 12/7/15 8:10 AM: -- Hi Sasaki, I'm not sure if you worked on it as the jira is still open. If you have a PR, you could close my PR https://github.com/apache/spark/pull/9794 I referenced the other document besides Sasaki's design for the implementation. http://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers http://www.cs.utah.edu/~piyush/teaching/structured_prediction.pdf was (Author: hujiayin): Hi Sasaki, I'm not sure if you worked on it as the jira is still open. If you have a PR, you could close my PR https://github.com/apache/spark/pull/9794 I referenced the other document besides Sasaki's design for the implementation. http://www.cs.utah.edu/~piyush/teaching/structured_prediction.pdf > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15044223#comment-15044223 ] hujiayin commented on SPARK-4036: - Hi Andrew, The code is implemented by Scala and integrated with Spark. I tested it after I implemented it. I also verify it with some papers listed inside the code and this jira. Could you send me your features and models that I can have further testing? > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15044223#comment-15044223 ] hujiayin edited comment on SPARK-4036 at 12/7/15 1:00 AM: -- Hi Andrew, The code is implemented by Scala and integrated with Spark. I tested it after I implemented it. I also verified it with some papers listed inside the code and this jira. Could you send me your features and models that I can have further testing? was (Author: hujiayin): Hi Andrew, The code is implemented by Scala and integrated with Spark. I tested it after I implemented it. I also verify it with some papers listed inside the code and this jira. Could you send me your features and models that I can have further testing? > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-4036: Comment: was deleted (was: ok) > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-4036: Comment: was deleted (was: ok) > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-4036: Comment: was deleted (was: ok) > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013095#comment-15013095 ] hujiayin commented on SPARK-4036: - ok > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013097#comment-15013097 ] hujiayin commented on SPARK-4036: - ok > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013098#comment-15013098 ] hujiayin commented on SPARK-4036: - ok > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013096#comment-15013096 ] hujiayin commented on SPARK-4036: - ok > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15010353#comment-15010353 ] hujiayin edited comment on SPARK-4036 at 11/19/15 7:09 AM: --- Hi Sasaki, I'm not sure if you worked on it as the jira is still open. If you have a PR, you could close my PR https://github.com/apache/spark/pull/9794 I referenced the other document besides Sasaki's design for the implementation. http://www.cs.utah.edu/~piyush/teaching/structured_prediction.pdf was (Author: hujiayin): Hi Sasaki, I'm not sure if you worked on it as the jira is still open. If you have a PR, you could close my PR https://github.com/apache/spark/pull/9794 > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15010353#comment-15010353 ] hujiayin commented on SPARK-4036: - Hi Sasaki, I'm not sure if you worked on it as the jira is still open. If you have a PR, you could close my PR https://github.com/apache/spark/pull/9794 > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"
[ https://issues.apache.org/jira/browse/SPARK-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-11200: - Comment: was deleted (was: sparkscore found it happened since commit number cf2e0ae7 and resolved today.) > NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed" > - > > Key: SPARK-11200 > URL: https://issues.apache.org/jira/browse/SPARK-11200 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: hujiayin > > The endless messages "cannot send ${message}because RpcEnv is closed" are pop > up after start any of workloads in MLlib until a manual stop from person. The > environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. > The error is from NettyRpcEnv.scala. I don't have enough time to look into > this issue at this time, but I can verify issue in environment with you if > you have fix. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"
[ https://issues.apache.org/jira/browse/SPARK-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14975993#comment-14975993 ] hujiayin commented on SPARK-11200: -- sparkscore found it happened since commit number cf2e0ae7 and resolved today. > NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed" > - > > Key: SPARK-11200 > URL: https://issues.apache.org/jira/browse/SPARK-11200 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: hujiayin > > The endless messages "cannot send ${message}because RpcEnv is closed" are pop > up after start any of workloads in MLlib until a manual stop from person. The > environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. > The error is from NettyRpcEnv.scala. I don't have enough time to look into > this issue at this time, but I can verify issue in environment with you if > you have fix. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"
[ https://issues.apache.org/jira/browse/SPARK-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966119#comment-14966119 ] hujiayin edited comment on SPARK-11200 at 10/22/15 2:37 AM: Hi Sean, I think it doesn't relate to environment since version 1.5.1 works in the same environment. The error "cannot send ${message} because RpcEnv is closed" looped when launched a task from yarn client mode. ${message} was specific IP address and port related information. I guess the client cannot be created and the process was blocked. I'll look into it after I have sometime. was (Author: hujiayin): Hi Sean, I think it doesn't relate to environment since version 1.4 works in the same environment. The error "cannot send ${message} because RpcEnv is closed" looped when launched a task from yarn client mode. ${message} was specific IP address and port related information. I guess the client cannot be created and the process was blocked. I'll look into it after I have sometime. > NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed" > - > > Key: SPARK-11200 > URL: https://issues.apache.org/jira/browse/SPARK-11200 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: hujiayin > > The endless messages "cannot send ${message}because RpcEnv is closed" are pop > up after start any of workloads in MLlib until a manual stop from person. The > environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. > The error is from NettyRpcEnv.scala. I don't have enough time to look into > this issue at this time, but I can verify issue in environment with you if > you have fix. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"
[ https://issues.apache.org/jira/browse/SPARK-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-11200: - Affects Version/s: (was: 1.5.1) 1.6.0 > NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed" > - > > Key: SPARK-11200 > URL: https://issues.apache.org/jira/browse/SPARK-11200 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: hujiayin > > The endless messages "cannot send ${message}because RpcEnv is closed" are pop > up after start any of workloads in MLlib until a manual stop from person. The > environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. > The error is from NettyRpcEnv.scala. I don't have enough time to look into > this issue at this time, but I can verify issue in environment with you if > you have fix. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"
[ https://issues.apache.org/jira/browse/SPARK-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966391#comment-14966391 ] hujiayin commented on SPARK-11200: -- Hi Sean, I didn't close/reopen it. The issue is there since Last Friday. The errors texts were different but all related to Netty RPC communication failure. > NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed" > - > > Key: SPARK-11200 > URL: https://issues.apache.org/jira/browse/SPARK-11200 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: hujiayin > > The endless messages "cannot send ${message}because RpcEnv is closed" are pop > up after start any of workloads in MLlib until a manual stop from person. The > environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. > The error is from NettyRpcEnv.scala. I don't have enough time to look into > this issue at this time, but I can verify issue in environment with you if > you have fix. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"
[ https://issues.apache.org/jira/browse/SPARK-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966119#comment-14966119 ] hujiayin edited comment on SPARK-11200 at 10/21/15 2:13 AM: Hi Sean, I think it doesn't relate to environment since version 1.4 works in the same environment. The error "cannot send ${message} because RpcEnv is closed" looped when launched a task from yarn client mode. ${message} was specific IP address and port related information. I guess the client cannot be created and the process was blocked. I'll look into it after I have sometime. was (Author: hujiayin): Hi Sean, I think it doesn't relate to environment since version 1.4 works in the same enviroment. The error "cannot send ${message} because RpcEnv is closed" looped when launched a task from yarn client mode. ${message} was specific ip address and port related information. I guess the client cannot be created and the process was blocked. I'll look into it after I have sometime. > NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed" > - > > Key: SPARK-11200 > URL: https://issues.apache.org/jira/browse/SPARK-11200 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: hujiayin > > The endless messages "cannot send ${message}because RpcEnv is closed" are pop > up after start any of workloads in MLlib until a manual stop from person. The > environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. > The error is from NettyRpcEnv.scala. I don't have enough time to look into > this issue at this time, but I can verify issue in environment with you if > you have fix. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"
[ https://issues.apache.org/jira/browse/SPARK-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966119#comment-14966119 ] hujiayin commented on SPARK-11200: -- Hi Sean, I think it doesn't relate to environment since version 1.4 works in the same enviroment. The error "cannot send ${message} because RpcEnv is closed" looped when launched a task from yarn client mode. ${message} was specific ip address and port related information. I guess the client cannot be created and the process was blocked. I'll look into it after I have sometime. > NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed" > - > > Key: SPARK-11200 > URL: https://issues.apache.org/jira/browse/SPARK-11200 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: hujiayin > > The endless messages "cannot send ${message}because RpcEnv is closed" are pop > up after start any of workloads in MLlib until a manual stop from person. The > environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. > The error is from NettyRpcEnv.scala. I don't have enough time to look into > this issue at this time, but I can verify issue in environment with you if > you have fix. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"
[ https://issues.apache.org/jira/browse/SPARK-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-11200: - Description: The endless messages "cannot send ${message}because RpcEnv is closed" are pop up after start any of workloads in MLlib until a manual stop from person. The environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. The error is from NettyRpcEnv.scala. I don't have enough time to look into this issue at this time, but I can verify issue in environment with you if you have fix. was: The endless messages "cannot send ${message}because RpcEnv is closed" are pop up after start any of workloads in MLlib until a manually stop from person. The environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. The error is from NettyRpcEnv.scala. I don't have enough time to look into this issue at this time, but I can verify issue in environment with you if you have fix. > NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed" > - > > Key: SPARK-11200 > URL: https://issues.apache.org/jira/browse/SPARK-11200 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: hujiayin > > The endless messages "cannot send ${message}because RpcEnv is closed" are pop > up after start any of workloads in MLlib until a manual stop from person. The > environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. > The error is from NettyRpcEnv.scala. I don't have enough time to look into > this issue at this time, but I can verify issue in environment with you if > you have fix. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"
[ https://issues.apache.org/jira/browse/SPARK-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-11200: - Description: The endless messages "cannot send ${message}because RpcEnv is closed" are pop up after start any of workloads in MLlib until a manually stop from person. The environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. The error is from NettyRpcEnv.scala. I don't have enough time to look into this issue at this time, but I can verify issue in environment with you if you have fix. was: The endless messages "cannot send ${message} because RpcEnv is closed" are pop up after start any of workloads in MLlib until a manually stop from person. The environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. The error is from NettyRpcEnv.scala. I don't have enough time to look into this issue at this time, but I can verify issue in environment with you if you have fix. > NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed" > - > > Key: SPARK-11200 > URL: https://issues.apache.org/jira/browse/SPARK-11200 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: hujiayin > > The endless messages "cannot send ${message}because RpcEnv is closed" are pop > up after start any of workloads in MLlib until a manually stop from person. > The environment is hadoop-cdh-5.3.2 Spark master version run in yarn client > mode. The error is from NettyRpcEnv.scala. I don't have enough time to look > into this issue at this time, but I can verify issue in environment with you > if you have fix. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"
hujiayin created SPARK-11200: Summary: NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed" Key: SPARK-11200 URL: https://issues.apache.org/jira/browse/SPARK-11200 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.1 Reporter: hujiayin The endless messages "cannot send ${message} because RpcEnv is closed" are pop up after start any of workloads in MLlib until a manually stop from person. The environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. The error is from NettyRpcEnv.scala. I don't have enough time to look into this issue at this time, but I can verify issue in environment with you if you have fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6724) Model import/export for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-6724: Comment: was deleted (was: ok, : )) > Model import/export for FPGrowth > > > Key: SPARK-6724 > URL: https://issues.apache.org/jira/browse/SPARK-6724 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Note: experimental model API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-10329) Cost RDD in k-means|| initialization is not storage-efficient
[ https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-10329: - Comment: was deleted (was: ok, I will try to fix it today) > Cost RDD in k-means|| initialization is not storage-efficient > - > > Key: SPARK-10329 > URL: https://issues.apache.org/jira/browse/SPARK-10329 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.1, 1.4.1, 1.5.0 >Reporter: Xiangrui Meng >Assignee: hujiayin > Labels: clustering > > Currently we use `RDD[Vector]` to store point cost during k-means|| > initialization, where each `Vector` has size `runs`. This is not > storage-efficient because `runs` is usually 1 and then each record is a > Vector of size 1. What we need is just the 8 bytes to store the cost, but we > introduce two objects (DenseVector and its values array), which could cost 16 > bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel > for reporting this issue! > There are several solutions: > 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per > record. > 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each > `Array[Double]` object covers 1024 instances, which could remove most of the > overhead. > Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs > kicking out the training dataset from memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10329) Cost RDD in k-means|| initialization is not storage-efficient
[ https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721799#comment-14721799 ] hujiayin commented on SPARK-10329: -- ok, I will try to fix it today > Cost RDD in k-means|| initialization is not storage-efficient > - > > Key: SPARK-10329 > URL: https://issues.apache.org/jira/browse/SPARK-10329 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.1, 1.4.1, 1.5.0 >Reporter: Xiangrui Meng >Assignee: hujiayin > Labels: clustering > > Currently we use `RDD[Vector]` to store point cost during k-means|| > initialization, where each `Vector` has size `runs`. This is not > storage-efficient because `runs` is usually 1 and then each record is a > Vector of size 1. What we need is just the 8 bytes to store the cost, but we > introduce two objects (DenseVector and its values array), which could cost 16 > bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel > for reporting this issue! > There are several solutions: > 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per > record. > 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each > `Array[Double]` object covers 1024 instances, which could remove most of the > overhead. > Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs > kicking out the training dataset from memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10329) Cost RDD in k-means|| initialization is not storage-efficient
[ https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14718039#comment-14718039 ] hujiayin commented on SPARK-10329: -- Hi Xiangrui, I'll try to fix it in 1.6. > Cost RDD in k-means|| initialization is not storage-efficient > - > > Key: SPARK-10329 > URL: https://issues.apache.org/jira/browse/SPARK-10329 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.1, 1.4.1, 1.5.0 >Reporter: Xiangrui Meng > Labels: clustering > > Currently we use `RDD[Vector]` to store point cost during k-means|| > initialization, where each `Vector` has size `runs`. This is not > storage-efficient because `runs` is usually 1 and then each record is a > Vector of size 1. What we need is just the 8 bytes to store the cost, but we > introduce two objects (DenseVector and its values array), which could cost 16 > bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel > for reporting this issue! > There are several solutions: > 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per > record. > 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each > `Array[Double]` object covers 1024 instances, which could remove most of the > overhead. > Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs > kicking out the training dataset from memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5837) HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14679818#comment-14679818 ] hujiayin commented on SPARK-5837: - The master doesn't have this problem. The other parts works well except the webUI in yarn view in 1.2~1.3 releases. I found a PR https://github.com/apache/spark/pull/2858 try to fix this issue and it should be included from 1.2, but the problem happens with 1.2 from my side. I think there are some other patches in 1.4 or 1.5 fixed this issue, do you know this patch > HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode > -- > > Key: SPARK-5837 > URL: https://issues.apache.org/jira/browse/SPARK-5837 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.0, 1.2.1 >Reporter: Marco Capuccini > > Both Spark 1.2.0 and Spark 1.2.1 return this error when I try to access the > Spark UI if I run over yarn (version 2.4.0): > HTTP ERROR 500 > Problem accessing /proxy/application_1423564210894_0017/. Reason: > Connection refused > Caused by: > java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:579) > at java.net.Socket.connect(Socket.java:528) > at java.net.Socket.(Socket.java:425) > at java.net.Socket.(Socket.java:280) > at > org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) > at > org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122) > at > org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707) > at > org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387) > at > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:187) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:344) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.
[jira] [Issue Comment Deleted] (SPARK-5837) HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-5837: Comment: was deleted (was: The master doesn't have this problem. The other parts works well except the webUI in yarn view in 1.2~1.3 releases. I found a PR https://github.com/apache/spark/pull/2858 try to fix this issue and it should be included from 1.2, but the problem happens with 1.2 from my side. I think there are some other patches in 1.4 or 1.5 fixed this issue, do you know this patch) > HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode > -- > > Key: SPARK-5837 > URL: https://issues.apache.org/jira/browse/SPARK-5837 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.0, 1.2.1 >Reporter: Marco Capuccini > > Both Spark 1.2.0 and Spark 1.2.1 return this error when I try to access the > Spark UI if I run over yarn (version 2.4.0): > HTTP ERROR 500 > Problem accessing /proxy/application_1423564210894_0017/. Reason: > Connection refused > Caused by: > java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:579) > at java.net.Socket.connect(Socket.java:528) > at java.net.Socket.(Socket.java:425) > at java.net.Socket.(Socket.java:280) > at > org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) > at > org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122) > at > org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707) > at > org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387) > at > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:187) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:344) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.j
[jira] [Commented] (SPARK-5837) HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14679817#comment-14679817 ] hujiayin commented on SPARK-5837: - The master doesn't have this problem. The other parts works well except the webUI in yarn view in 1.2~1.3 releases. I found a PR https://github.com/apache/spark/pull/2858 try to fix this issue and it should be included from 1.2, but the problem happens with 1.2 from my side. I think there are some other patches in 1.4 or 1.5 fixed this issue, do you know this patch > HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode > -- > > Key: SPARK-5837 > URL: https://issues.apache.org/jira/browse/SPARK-5837 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.0, 1.2.1 >Reporter: Marco Capuccini > > Both Spark 1.2.0 and Spark 1.2.1 return this error when I try to access the > Spark UI if I run over yarn (version 2.4.0): > HTTP ERROR 500 > Problem accessing /proxy/application_1423564210894_0017/. Reason: > Connection refused > Caused by: > java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:579) > at java.net.Socket.connect(Socket.java:528) > at java.net.Socket.(Socket.java:425) > at java.net.Socket.(Socket.java:280) > at > org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) > at > org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122) > at > org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707) > at > org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387) > at > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:187) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:344) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.
[jira] [Commented] (SPARK-5837) HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14679782#comment-14679782 ] hujiayin commented on SPARK-5837: - yes, I've tried all the setting above. The issue just happening with version 1.2 ~ 1.3. My hadoop is hadoop-2.5.0-cdh5.3.2 > HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode > -- > > Key: SPARK-5837 > URL: https://issues.apache.org/jira/browse/SPARK-5837 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.0, 1.2.1 >Reporter: Marco Capuccini > > Both Spark 1.2.0 and Spark 1.2.1 return this error when I try to access the > Spark UI if I run over yarn (version 2.4.0): > HTTP ERROR 500 > Problem accessing /proxy/application_1423564210894_0017/. Reason: > Connection refused > Caused by: > java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:579) > at java.net.Socket.connect(Socket.java:528) > at java.net.Socket.(Socket.java:425) > at java.net.Socket.(Socket.java:280) > at > org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) > at > org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122) > at > org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707) > at > org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387) > at > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:187) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:344) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at > org.mortbay.jetty.servl
[jira] [Updated] (SPARK-9779) HTTP500 revisit when open web-UI in yarn-cluster yarn-client mode (1.2-1.3)
[ https://issues.apache.org/jira/browse/SPARK-9779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-9779: Description: When I try to open the web-UI from AM in yarn-cluster mode, I meet the HTTP500 error as described in https://issues.apache.org/jira/browse/SPARK-5837 I want to collect stage logs of a feature in Spark1.2 for a performance problem. The error doesn't happen in the latest spark build and only happens in Spark1.2-1.3. The hadoop yarn related configuration are the same between the different Spark versions. was: When I try to open the web-UI from AM in yarn-cluster mode, I meet the HTTP500 error as described in https://issues.apache.org/jira/browse/SPARK-5837 I want to collect stage logs of a feature in Spark1.2 to find a performance problem. The error doesn't happen in the latest spark build and only happens in Spark1.2-1.3. The hadoop yarn related configuration are the same between the different Spark versions. > HTTP500 revisit when open web-UI in yarn-cluster yarn-client mode (1.2-1.3) > > > Key: SPARK-9779 > URL: https://issues.apache.org/jira/browse/SPARK-9779 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0 > Environment: hadoop2.5.0-cdh5.3.2 >Reporter: hujiayin > Fix For: 1.4.2 > > > When I try to open the web-UI from AM in yarn-cluster mode, I meet the > HTTP500 error as described in > https://issues.apache.org/jira/browse/SPARK-5837 I want to collect stage logs > of a feature in Spark1.2 for a performance problem. > The error doesn't happen in the latest spark build and only happens in > Spark1.2-1.3. The hadoop yarn related configuration are the same between the > different Spark versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9779) HTTP500 revisit when open web-UI in yarn-cluster yarn-client mode (1.2-1.3)
[ https://issues.apache.org/jira/browse/SPARK-9779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-9779: Description: When I try to open the web-UI from AM in yarn-cluster mode, I meet the HTTP500 error as described in https://issues.apache.org/jira/browse/SPARK-5837 I want to collect stage logs of a feature in Spark1.2 to find a performance problem. The error doesn't happen in the latest spark build and only happens in Spark1.2-1.3. The hadoop yarn related configuration are the same between the different Spark versions. was: When I try to open the web-UI from AM in yarn-cluster mode, I meet the HTTP500 error as described in https://issues.apache.org/jira/browse/SPARK-5837 I meet this issue is because I want to collect stage logs of a feature in Spark1.2 to find a performance problem. The error doesn't happen in the latest spark build and only happens in Spark1.2-1.3. The hadoop yarn related configuration are the same between the different Spark versions. > HTTP500 revisit when open web-UI in yarn-cluster yarn-client mode (1.2-1.3) > > > Key: SPARK-9779 > URL: https://issues.apache.org/jira/browse/SPARK-9779 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0 > Environment: hadoop2.5.0-cdh5.3.2 >Reporter: hujiayin > Fix For: 1.4.2 > > > When I try to open the web-UI from AM in yarn-cluster mode, I meet the > HTTP500 error as described in > https://issues.apache.org/jira/browse/SPARK-5837 I want to collect stage logs > of a feature in Spark1.2 to find a performance problem. > The error doesn't happen in the latest spark build and only happens in > Spark1.2-1.3. The hadoop yarn related configuration are the same between the > different Spark versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9779) HTTP500 revisit when open web-UI in yarn-cluster yarn-client mode (1.2-1.3)
hujiayin created SPARK-9779: --- Summary: HTTP500 revisit when open web-UI in yarn-cluster yarn-client mode (1.2-1.3) Key: SPARK-9779 URL: https://issues.apache.org/jira/browse/SPARK-9779 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0, 1.2.2, 1.2.1, 1.2.0 Environment: hadoop2.5.0-cdh5.3.2 Reporter: hujiayin Fix For: 1.4.2 When I try to open the web-UI from AM in yarn-cluster mode, I meet the HTTP500 error as described in https://issues.apache.org/jira/browse/SPARK-5837 I meet this issue is because I want to collect stage logs of a feature in Spark1.2 to find a performance problem. The error doesn't happen in the latest spark build and only happens in Spark1.2-1.3. The hadoop yarn related configuration are the same between the different Spark versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5837) HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14679730#comment-14679730 ] hujiayin commented on SPARK-5837: - Hi Sean, I meet this problem on spark 1.2 1.2.1 1.2.2 1.3 again but the problem doesn't happen on the latest code with the same configuration of hadoop. I started the history server of 1.2. The hadoop is hadoop-2.5.0-cdh5.3.2. I meet this problem again is because I need to get the log of a feature in 1.2 for a performance issue but I cannot get the stage log now. > HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode > -- > > Key: SPARK-5837 > URL: https://issues.apache.org/jira/browse/SPARK-5837 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.0, 1.2.1 >Reporter: Marco Capuccini > > Both Spark 1.2.0 and Spark 1.2.1 return this error when I try to access the > Spark UI if I run over yarn (version 2.4.0): > HTTP ERROR 500 > Problem accessing /proxy/application_1423564210894_0017/. Reason: > Connection refused > Caused by: > java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:579) > at java.net.Socket.connect(Socket.java:528) > at java.net.Socket.(Socket.java:425) > at java.net.Socket.(Socket.java:280) > at > org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) > at > org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122) > at > org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707) > at > org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387) > at > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:187) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:344) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.do
[jira] [Commented] (SPARK-9129) Integrate convolutional deep belief networks for visual recognition tasks
[ https://issues.apache.org/jira/browse/SPARK-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642144#comment-14642144 ] hujiayin commented on SPARK-9129: - ok : ) > Integrate convolutional deep belief networks for visual recognition tasks > --- > > Key: SPARK-9129 > URL: https://issues.apache.org/jira/browse/SPARK-9129 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.4.0 >Reporter: hujiayin > Labels: patch > Original Estimate: 1,008h > Remaining Estimate: 1,008h > > There has been much interest in unsupervised learning of hierarchical > generative models such as deep belief networks. Scaling such models to > full-sized, high-dimensional images remains a difficult problem. Some users > complain about the performance and convergence speed of a model. Integrate > this to create a new usage for Spark in visual related tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9129) Integrate convolutional deep belief networks for visual recognition tasks
hujiayin created SPARK-9129: --- Summary: Integrate convolutional deep belief networks for visual recognition tasks Key: SPARK-9129 URL: https://issues.apache.org/jira/browse/SPARK-9129 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: hujiayin There has been much interest in unsupervised learning of hierarchical generative models such as deep belief networks. Scaling such models to full-sized, high-dimensional images remains a difficult problem. Some users complain about the performance and convergence speed of a model. Integrate this to create a new usage for Spark in visual related tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14616275#comment-14616275 ] hujiayin commented on SPARK-6724: - ok, : ) > Model import/export for FPGrowth > > > Key: SPARK-6724 > URL: https://issues.apache.org/jira/browse/SPARK-6724 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Note: experimental model API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6724) Model import/export for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-6724: Comment: was deleted (was: Can I take a look at this issue?) > Model import/export for FPGrowth > > > Key: SPARK-6724 > URL: https://issues.apache.org/jira/browse/SPARK-6724 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Note: experimental model API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14616264#comment-14616264 ] hujiayin commented on SPARK-6724: - Can I take a look at this issue? > Model import/export for FPGrowth > > > Key: SPARK-6724 > URL: https://issues.apache.org/jira/browse/SPARK-6724 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Note: experimental model API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611553#comment-14611553 ] hujiayin commented on SPARK-5682: - Since the encrypted shuffle in spark is focus on the common module, it maybe not good to use hadoop API. On the other side, the AES solution is a bit heavy to encode/decode the live steaming data. > Add encrypted shuffle in spark > -- > > Key: SPARK-5682 > URL: https://issues.apache.org/jira/browse/SPARK-5682 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: liyunzhang_intel > Attachments: Design Document of Encrypted Spark > Shuffle_20150209.docx, Design Document of Encrypted Spark > Shuffle_20150318.docx, Design Document of Encrypted Spark > Shuffle_20150402.docx, Design Document of Encrypted Spark > Shuffle_20150506.docx > > > Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle > data safer. This feature is necessary in spark. AES is a specification for > the encryption of electronic data. There are 5 common modes in AES. CTR is > one of the modes. We use two codec JceAesCtrCryptoCodec and > OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used > in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk > provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl > provides. > Because ugi credential info is used in the process of encrypted shuffle, we > first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611527#comment-14611527 ] hujiayin edited comment on SPARK-5682 at 7/2/15 6:10 AM: - Steps were added to encode and decode the data, the performance will not be fast than before, in the same time, codes also have security issue, for example save the plain text in configuration file and finally used as the part of the key If you use a better cypher solution, the performance downgrade will be minimized. i think AES is a bit heavy. In the same time, the feature based on hadoop 2.6, it is the limitation, that is why i said rely on hadoop Though the API is public stable, however, you cannot ensure if the API will not be changed since it is not the comercial software. was (Author: hujiayin): Steps were added to encode and decode the data, the performance will not be fast than before, in the same time, codes also have security issue, for example save the plain text in configuration file and finally used as the part of the key In the same time, the feature based on hadoop 2.6, it is the limitation, that is why i said rely on hadoop Though the API is public stable, however, you cannot ensure if the API will not be changed since it is not the comercial software. > Add encrypted shuffle in spark > -- > > Key: SPARK-5682 > URL: https://issues.apache.org/jira/browse/SPARK-5682 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: liyunzhang_intel > Attachments: Design Document of Encrypted Spark > Shuffle_20150209.docx, Design Document of Encrypted Spark > Shuffle_20150318.docx, Design Document of Encrypted Spark > Shuffle_20150402.docx, Design Document of Encrypted Spark > Shuffle_20150506.docx > > > Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle > data safer. This feature is necessary in spark. AES is a specification for > the encryption of electronic data. There are 5 common modes in AES. CTR is > one of the modes. We use two codec JceAesCtrCryptoCodec and > OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used > in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk > provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl > provides. > Because ugi credential info is used in the process of encrypted shuffle, we > first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611527#comment-14611527 ] hujiayin edited comment on SPARK-5682 at 7/2/15 6:03 AM: - Steps were added to encode and decode the data, the performance will not be fast than before, in the same time, codes also have security issue, for example save the plain text in configuration file and finally used as the part of the key In the same time, the feature based on hadoop 2.6, it is the limitation, that is why i said rely on hadoop Though the API is public stable, however, you cannot ensure if the API will not be changed since it is not the comercial software. was (Author: hujiayin): steps were added to encode and decode the data, the performance will not be fast than before, in the same time, codes also have security issue, for example save the plain text in configuration file and finally used as the part of the key in the same time, the feature based on hadoop 2.6, it is the limitation, that is why i said rely on hadoop though it is public stable, however, you cannot ensure if the api will not be changed since it was not the comercial software. > Add encrypted shuffle in spark > -- > > Key: SPARK-5682 > URL: https://issues.apache.org/jira/browse/SPARK-5682 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: liyunzhang_intel > Attachments: Design Document of Encrypted Spark > Shuffle_20150209.docx, Design Document of Encrypted Spark > Shuffle_20150318.docx, Design Document of Encrypted Spark > Shuffle_20150402.docx, Design Document of Encrypted Spark > Shuffle_20150506.docx > > > Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle > data safer. This feature is necessary in spark. AES is a specification for > the encryption of electronic data. There are 5 common modes in AES. CTR is > one of the modes. We use two codec JceAesCtrCryptoCodec and > OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used > in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk > provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl > provides. > Because ugi credential info is used in the process of encrypted shuffle, we > first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611527#comment-14611527 ] hujiayin edited comment on SPARK-5682 at 7/2/15 6:02 AM: - steps were added to encode and decode the data, the performance will not be fast than before, in the same time, codes also have security issue, for example save the plain text in configuration file and finally used as the part of the key in the same time, the feature based on hadoop 2.6, it is the limitation, that is why i said rely on hadoop though it is public stable, however, you cannot ensure if the api will not be changed since it was not the comercial software. was (Author: hujiayin): steps were added to encode and decode the data, the performance will not be fast than before, in the same time, codes also have security issue, for example save the plain text in configuration file and finally used as the part of the key in the same time, the feature based on hadoop 2.6, it is the limitation, that is why i said reply on hadoop though it is public stable, however, you cannot ensure if the api will not be changed since it was not the comercial software. > Add encrypted shuffle in spark > -- > > Key: SPARK-5682 > URL: https://issues.apache.org/jira/browse/SPARK-5682 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: liyunzhang_intel > Attachments: Design Document of Encrypted Spark > Shuffle_20150209.docx, Design Document of Encrypted Spark > Shuffle_20150318.docx, Design Document of Encrypted Spark > Shuffle_20150402.docx, Design Document of Encrypted Spark > Shuffle_20150506.docx > > > Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle > data safer. This feature is necessary in spark. AES is a specification for > the encryption of electronic data. There are 5 common modes in AES. CTR is > one of the modes. We use two codec JceAesCtrCryptoCodec and > OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used > in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk > provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl > provides. > Because ugi credential info is used in the process of encrypted shuffle, we > first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611527#comment-14611527 ] hujiayin commented on SPARK-5682: - steps were added to encode and decode the data, the performance will not be fast than before, in the same time, codes also have security issue, for example save the plain text in configuration file and finally used as the part of the key in the same time, the feature based on hadoop 2.6, it is the limitation, that is why i said reply on hadoop though it is public stable, however, you cannot ensure if the api will not be changed since it was not the comercial software. > Add encrypted shuffle in spark > -- > > Key: SPARK-5682 > URL: https://issues.apache.org/jira/browse/SPARK-5682 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: liyunzhang_intel > Attachments: Design Document of Encrypted Spark > Shuffle_20150209.docx, Design Document of Encrypted Spark > Shuffle_20150318.docx, Design Document of Encrypted Spark > Shuffle_20150402.docx, Design Document of Encrypted Spark > Shuffle_20150506.docx > > > Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle > data safer. This feature is necessary in spark. AES is a specification for > the encryption of electronic data. There are 5 common modes in AES. CTR is > one of the modes. We use two codec JceAesCtrCryptoCodec and > OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used > in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk > provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl > provides. > Because ugi credential info is used in the process of encrypted shuffle, we > first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611328#comment-14611328 ] hujiayin commented on SPARK-5682: - The solution relied on hadoop API and maybe downgrade the performance. The AES algorithm was used in block data encryption in many case. I think rc4 could be used to encode the stream or a simple solution with a authentication header could be used. : ) > Add encrypted shuffle in spark > -- > > Key: SPARK-5682 > URL: https://issues.apache.org/jira/browse/SPARK-5682 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: liyunzhang_intel > Attachments: Design Document of Encrypted Spark > Shuffle_20150209.docx, Design Document of Encrypted Spark > Shuffle_20150318.docx, Design Document of Encrypted Spark > Shuffle_20150402.docx, Design Document of Encrypted Spark > Shuffle_20150506.docx > > > Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle > data safer. This feature is necessary in spark. AES is a specification for > the encryption of electronic data. There are 5 common modes in AES. CTR is > one of the modes. We use two codec JceAesCtrCryptoCodec and > OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used > in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk > provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl > provides. > Because ugi credential info is used in the process of encrypted shuffle, we > first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-8370) Add API for data sources to register databases
[ https://issues.apache.org/jira/browse/SPARK-8370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-8370: Comment: was deleted (was: Does the datasource has size limitation) > Add API for data sources to register databases > -- > > Key: SPARK-8370 > URL: https://issues.apache.org/jira/browse/SPARK-8370 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Santiago M. Mola > > This API would allow to register a database with a data source instead of > just a table. Registering a data source database would register all its table > and maintain the catalog updated. The catalog could delegate to the data > source lookups of tables for a database registered with this API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8370) Add API for data sources to register databases
[ https://issues.apache.org/jira/browse/SPARK-8370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589242#comment-14589242 ] hujiayin commented on SPARK-8370: - Does the datasource has size limitation > Add API for data sources to register databases > -- > > Key: SPARK-8370 > URL: https://issues.apache.org/jira/browse/SPARK-8370 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Santiago M. Mola > > This API would allow to register a database with a data source instead of > just a table. Registering a data source database would register all its table > and maintain the catalog updated. The catalog could delegate to the data > source lookups of tables for a database registered with this API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org