[jira] [Issue Comment Deleted] (SPARK-5682) Add encrypted shuffle in spark

2016-05-25 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-5682:

Comment: was deleted

(was: Steps were added to encode and decode the data, the performance will not 
be fast than before, in the same time, codes also have security issue, for 
example save the plain text in configuration file and finally used as the part 
of the key

If you use a better cypher solution, the performance downgrade will be 
minimized. i think AES is a bit heavy.

In the same time, the feature based on hadoop 2.6, it is the limitation, that 
is why i said rely on hadoop

Though the API is public stable, however, you cannot ensure if the API will not 
be changed since it is not the comercial software.
)

> Add encrypted shuffle in spark
> --
>
> Key: SPARK-5682
> URL: https://issues.apache.org/jira/browse/SPARK-5682
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: liyunzhang_intel
> Attachments: Design Document of Encrypted Spark 
> Shuffle_20150209.docx, Design Document of Encrypted Spark 
> Shuffle_20150318.docx, Design Document of Encrypted Spark 
> Shuffle_20150402.docx, Design Document of Encrypted Spark 
> Shuffle_20150506.docx
>
>
> Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
> data safer. This feature is necessary in spark. AES  is a specification for 
> the encryption of electronic data. There are 5 common modes in AES. CTR is 
> one of the modes. We use two codec JceAesCtrCryptoCodec and 
> OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
> in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
> provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
> provides. 
> Because ugi credential info is used in the process of encrypted shuffle, we 
> first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5682) Add encrypted shuffle in spark

2016-05-25 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-5682:

Comment: was deleted

(was: The solution relied on hadoop API and maybe downgrade the performance. 
The AES algorithm was used in block data encryption in many case. I think rc4 
could be used to encode the stream or a simple solution with a authentication 
header could be used.   : ))

> Add encrypted shuffle in spark
> --
>
> Key: SPARK-5682
> URL: https://issues.apache.org/jira/browse/SPARK-5682
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: liyunzhang_intel
> Attachments: Design Document of Encrypted Spark 
> Shuffle_20150209.docx, Design Document of Encrypted Spark 
> Shuffle_20150318.docx, Design Document of Encrypted Spark 
> Shuffle_20150402.docx, Design Document of Encrypted Spark 
> Shuffle_20150506.docx
>
>
> Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
> data safer. This feature is necessary in spark. AES  is a specification for 
> the encryption of electronic data. There are 5 common modes in AES. CTR is 
> one of the modes. We use two codec JceAesCtrCryptoCodec and 
> OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
> in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
> provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
> provides. 
> Because ugi credential info is used in the process of encrypted shuffle, we 
> first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5682) Add encrypted shuffle in spark

2016-05-25 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-5682:

Comment: was deleted

(was: Since the encrypted shuffle in spark is focus on the common module, it 
maybe not good to use hadoop API. On the other side, the AES solution is a bit 
heavy to encode/decode the live steaming data. )

> Add encrypted shuffle in spark
> --
>
> Key: SPARK-5682
> URL: https://issues.apache.org/jira/browse/SPARK-5682
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: liyunzhang_intel
> Attachments: Design Document of Encrypted Spark 
> Shuffle_20150209.docx, Design Document of Encrypted Spark 
> Shuffle_20150318.docx, Design Document of Encrypted Spark 
> Shuffle_20150402.docx, Design Document of Encrypted Spark 
> Shuffle_20150506.docx
>
>
> Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
> data safer. This feature is necessary in spark. AES  is a specification for 
> the encryption of electronic data. There are 5 common modes in AES. CTR is 
> one of the modes. We use two codec JceAesCtrCryptoCodec and 
> OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
> in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
> provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
> provides. 
> Because ugi credential info is used in the process of encrypted shuffle, we 
> first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15174) DataFrame does not have correct number of rows after dropDuplicates

2016-05-12 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281296#comment-15281296
 ] 

hujiayin commented on SPARK-15174:
--

[~ihalsema]I just tested this case in my environment with the latest Spark. It 
will give you an error message "Unable to infer schema for JSON at 
hdfs:///xxx/xx" if your directory is empty. If your file is empty, the 
read.json will also throw the error message. If your file just contain "{}", 
df.count=1 and df.rdd.isEmpty=false will be in both cases. So I think read.json 
handled the exceptional cases and the bug is not there. 

> DataFrame does not have correct number of rows after dropDuplicates
> ---
>
> Key: SPARK-15174
> URL: https://issues.apache.org/jira/browse/SPARK-15174
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Ian Hellstrom
>
> If you read an empty file/folder with the {{SQLContext.read()}} function and 
> call {{DataFrame.dropDuplicates()}}, the number of rows is incorrect.
> {code}
> val input = "hdfs:///some/empty/directory"
> val df1 = sqlContext.read.json(input)
> val df2 = sqlContext.read.json(input).dropDuplicates
> df1.count == 0 // true
> df1.rdd.isEmpty // true
> df2.count == 0 // false: it's actually reported as 1
> df2.rdd.isEmpty // false
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-15174) DataFrame does not have correct number of rows after dropDuplicates

2016-05-12 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-15174:
-
Comment: was deleted

(was: I could fix it.)

> DataFrame does not have correct number of rows after dropDuplicates
> ---
>
> Key: SPARK-15174
> URL: https://issues.apache.org/jira/browse/SPARK-15174
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Ian Hellstrom
>
> If you read an empty file/folder with the {{SQLContext.read()}} function and 
> call {{DataFrame.dropDuplicates()}}, the number of rows is incorrect.
> {code}
> val input = "hdfs:///some/empty/directory"
> val df1 = sqlContext.read.json(input)
> val df2 = sqlContext.read.json(input).dropDuplicates
> df1.count == 0 // true
> df1.rdd.isEmpty // true
> df2.count == 0 // false: it's actually reported as 1
> df2.rdd.isEmpty // false
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15174) DataFrame does not have correct number of rows after dropDuplicates

2016-05-11 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273805#comment-15273805
 ] 

hujiayin edited comment on SPARK-15174 at 5/12/16 4:09 AM:
---

I could fix it.


was (Author: hujiayin):
@Ian Hellstrom, I think the issue is caused by "if 
(groupingExpressions.isEmpty) {  Statistics(sizeInBytes = 1) }" in 
basicLogicOperators.scala. I could fix it for you. 

> DataFrame does not have correct number of rows after dropDuplicates
> ---
>
> Key: SPARK-15174
> URL: https://issues.apache.org/jira/browse/SPARK-15174
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Ian Hellstrom
>
> If you read an empty file/folder with the {{SQLContext.read()}} function and 
> call {{DataFrame.dropDuplicates()}}, the number of rows is incorrect.
> {code}
> val input = "hdfs:///some/empty/directory"
> val df1 = sqlContext.read.json(input)
> val df2 = sqlContext.read.json(input).dropDuplicates
> df1.count == 0 // true
> df1.rdd.isEmpty // true
> df2.count == 0 // false: it's actually reported as 1
> df2.rdd.isEmpty // false
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14623) add label binarizer

2016-05-09 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-14623:
-
Description: 
It relates to https://issues.apache.org/jira/browse/SPARK-7445

Map the labels to 0/1. 
For example,
Input:
"yellow,green,red,green,0"
The labels: "0, green, red, yellow"
Output:
0, 0, 0, 1
0, 1, 0, 0
0, 0, 1, 0
0, 1, 0, 0
1, 0 ,0, 0
Refer to 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html



  was:
It relates to https://issues.apache.org/jira/browse/SPARK-7445

Map the labels to 0/1. 
For example,
Input:
"yellow,green,red,green,0"
The labels: "0, green, red, yellow"
Output:
0, 0, 0, 1
0, 1, 0, 0
0, 0, 1, 0
0, 1, 0, 0
1, 0 ,0, 0




> add label binarizer 
> 
>
> Key: SPARK-14623
> URL: https://issues.apache.org/jira/browse/SPARK-14623
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: hujiayin
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It relates to https://issues.apache.org/jira/browse/SPARK-7445
> Map the labels to 0/1. 
> For example,
> Input:
> "yellow,green,red,green,0"
> The labels: "0, green, red, yellow"
> Output:
> 0, 0, 0, 1
> 0, 1, 0, 0
> 0, 0, 1, 0
> 0, 1, 0, 0
> 1, 0 ,0, 0
> Refer to 
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15174) DataFrame does not have correct number of rows after dropDuplicates

2016-05-06 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273805#comment-15273805
 ] 

hujiayin commented on SPARK-15174:
--

@Ian Hellstrom, I think the issue is caused by "if 
(groupingExpressions.isEmpty) {  Statistics(sizeInBytes = 1) }" in 
basicLogicOperators.scala. I could fix it for you. 

> DataFrame does not have correct number of rows after dropDuplicates
> ---
>
> Key: SPARK-15174
> URL: https://issues.apache.org/jira/browse/SPARK-15174
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Ian Hellstrom
>
> If you read an empty file/folder with the {{SQLContext.read()}} function and 
> call {{DataFrame.dropDuplicates()}}, the number of rows is incorrect.
> {code}
> val input = "hdfs:///some/empty/directory"
> val df1 = sqlContext.read.json(input)
> val df2 = sqlContext.read.json(input).dropDuplicates
> df1.count == 0 // true
> df1.rdd.isEmpty // true
> df2.count == 0 // false: it's actually reported as 1
> df2.rdd.isEmpty // false
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14772) Python ML Params.copy treats uid, paramMaps differently than Scala

2016-05-03 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15270161#comment-15270161
 ] 

hujiayin edited comment on SPARK-14772 at 5/4/16 6:04 AM:
--

@holdenk, I had a code for this issue and was busy with the other project in 
the past days. I just start to look into pyspark and look forward your 
comments. 


was (Author: hujiayin):
@holdenk, I have a code for this issue and was busy with the other project in 
the past days. I just start to look into pyspark and look forward your 
comments. 

> Python ML Params.copy treats uid, paramMaps differently than Scala
> --
>
> Key: SPARK-14772
> URL: https://issues.apache.org/jira/browse/SPARK-14772
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> In PySpark, {{ml.param.Params.copy}} does not quite match the Scala 
> implementation:
> * It does not copy the UID
> * It does not respect the difference between defaultParamMap and paramMap.  
> This is an issue with {{_copyValues}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14772) Python ML Params.copy treats uid, paramMaps differently than Scala

2016-05-03 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15270161#comment-15270161
 ] 

hujiayin commented on SPARK-14772:
--

@holdenk, I have a code for this issue and was busy with the other project in 
the past days. I just start to look into pyspark and look forward your 
comments. 

> Python ML Params.copy treats uid, paramMaps differently than Scala
> --
>
> Key: SPARK-14772
> URL: https://issues.apache.org/jira/browse/SPARK-14772
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> In PySpark, {{ml.param.Params.copy}} does not quite match the Scala 
> implementation:
> * It does not copy the UID
> * It does not respect the difference between defaultParamMap and paramMap.  
> This is an issue with {{_copyValues}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-14772) Python ML Params.copy treats uid, paramMaps differently than Scala

2016-04-25 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-14772:
-
Comment: was deleted

(was: I can submit a code to fix this issue and I'm testing it.)

> Python ML Params.copy treats uid, paramMaps differently than Scala
> --
>
> Key: SPARK-14772
> URL: https://issues.apache.org/jira/browse/SPARK-14772
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> In PySpark, {{ml.param.Params.copy}} does not quite match the Scala 
> implementation:
> * It does not copy the UID
> * It does not respect the difference between defaultParamMap and paramMap.  
> This is an issue with {{_copyValues}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14772) Python ML Params.copy treats uid, paramMaps differently than Scala

2016-04-21 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253035#comment-15253035
 ] 

hujiayin commented on SPARK-14772:
--

I can submit a code to fix this issue and I'm testing it.

> Python ML Params.copy treats uid, paramMaps differently than Scala
> --
>
> Key: SPARK-14772
> URL: https://issues.apache.org/jira/browse/SPARK-14772
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> In PySpark, {{ml.param.Params.copy}} does not quite match the Scala 
> implementation:
> * It does not copy the UID
> * It does not respect the difference between defaultParamMap and paramMap.  
> This is an issue with {{_copyValues}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14712) spark.ml LogisticRegressionModel.toString should summarize model

2016-04-18 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247141#comment-15247141
 ] 

hujiayin edited comment on SPARK-14712 at 4/19/16 3:59 AM:
---

Hi Gayathri, I think self has the numFeatures and numClasses defined and I can 
submit a code for this issue.


was (Author: hujiayin):
Hi Murali, I think self has the numFeatures and numClasses defined and I can 
submit a code for this issue.

> spark.ml LogisticRegressionModel.toString should summarize model
> 
>
> Key: SPARK-14712
> URL: https://issues.apache.org/jira/browse/SPARK-14712
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> spark.mllib LogisticRegressionModel overrides toString to print a little 
> model info.  We should do the same in spark.ml.  I'd recommend:
> * super.toString
> * numClasses
> * numFeatures
> We should also override {{__repr__}} in pyspark to do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14712) spark.ml LogisticRegressionModel.toString should summarize model

2016-04-18 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247141#comment-15247141
 ] 

hujiayin commented on SPARK-14712:
--

Hi Murali, I think self has the numFeatures and numClasses defined and I can 
submit a code for this issue.

> spark.ml LogisticRegressionModel.toString should summarize model
> 
>
> Key: SPARK-14712
> URL: https://issues.apache.org/jira/browse/SPARK-14712
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> spark.mllib LogisticRegressionModel overrides toString to print a little 
> model info.  We should do the same in spark.ml.  I'd recommend:
> * super.toString
> * numClasses
> * numFeatures
> We should also override {{__repr__}} in pyspark to do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14623) add label binarizer

2016-04-16 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244061#comment-15244061
 ] 

hujiayin edited comment on SPARK-14623 at 4/16/16 8:24 AM:
---

Hi Joseph, I think it is similar as the combination of StringIndexer + 
OneHotEncoder into one class but the difference is the LabelBinarizer will 
collect the same element into one vector and will remember the position of the 
element in the input. 

For example, 
Input is "yellow,green,red,green,0"
Label Binarizer retrieves the labels from input and the labels are "0, green, 
red, yellow"
Output is
0, 0, 0, 1
0, 1, 0, 0
0, 0, 1, 0
0, 1, 0, 0
1, 0 ,0, 0
The second column reflects element "green" appears at positions 1 and 3 in the 
input. The 4 columns reflect the 4 labels. Column 0 represents label 0 and 
column 1 is label "green", so on. If I understand correctly, StringIndexer 
returns the category number of a label and OneHotEncoder returns the single 
high 1 binary representation of the category number.


was (Author: hujiayin):
Hi Joseph, I think it is similar as the combination of StringIndexer + 
OneHotEncoder into one class but the difference is the LabelBinarizer will 
collect the same element into one vector and will remember the position of the 
element in the input. 

For example, 
Input is "yellow,green,red,green,0"
Label Binarizer retrieves the labels from input and the labels are "0, green, 
red, yellow"
Output is
0, 0, 0, 1
0, 1, 0, 0
0, 0, 1, 0
0, 1, 0, 0
1, 0 ,0, 0
The second column reflects element "green" appears at positions 1 and 3 in the 
input. The 4 columns reflect the 4 labels. Column 0 represents label 0 and 
column 1 is label "green", so on. If I understand correctly, StringIndexer 
returns the category number of a label and OneHotEncoder returns the binary 
representation of the category number.

> add label binarizer 
> 
>
> Key: SPARK-14623
> URL: https://issues.apache.org/jira/browse/SPARK-14623
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: hujiayin
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It relates to https://issues.apache.org/jira/browse/SPARK-7445
> Map the labels to 0/1. 
> For example,
> Input:
> "yellow,green,red,green,0"
> The labels: "0, green, red, yellow"
> Output:
> 0, 0, 0, 1
> 0, 1, 0, 0
> 0, 0, 1, 0
> 0, 1, 0, 0
> 1, 0 ,0, 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14623) add label binarizer

2016-04-16 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244061#comment-15244061
 ] 

hujiayin commented on SPARK-14623:
--

Hi Joseph, I think it is similar as the combination of StringIndexer + 
OneHotEncoder into one class but the difference is the LabelBinarizer will 
collect the same element into one vector and will remember the position of the 
element in the input. 

For example, 
Input is "yellow,green,red,green,0"
Label Binarizer retrieves the labels from input and the labels are "0, green, 
red, yellow"
Output is
0, 0, 0, 1
0, 1, 0, 0
0, 0, 1, 0
0, 1, 0, 0
1, 0 ,0, 0
The second column reflects element "green" appears at positions 1 and 3 in the 
input. The 4 columns reflect the 4 labels. Column 0 represents label 0 and 
column 1 is label "green", so on. If I understand correctly, StringIndexer 
returns the category number of a label and OneHotEncoder returns the binary 
representation of the category number.

> add label binarizer 
> 
>
> Key: SPARK-14623
> URL: https://issues.apache.org/jira/browse/SPARK-14623
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: hujiayin
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It relates to https://issues.apache.org/jira/browse/SPARK-7445
> Map the labels to 0/1. 
> For example,
> Input:
> "yellow,green,red,green,0"
> The labels: "0, green, red, yellow"
> Output:
> 0, 0, 0, 1
> 0, 1, 0, 0
> 0, 0, 1, 0
> 0, 1, 0, 0
> 1, 0 ,0, 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14523) Feature parity for Statistics ML with MLlib

2016-04-15 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15242661#comment-15242661
 ] 

hujiayin commented on SPARK-14523:
--

We can add ARIMA to Spark.

> Feature parity for Statistics ML with MLlib
> ---
>
> Key: SPARK-14523
> URL: https://issues.apache.org/jira/browse/SPARK-14523
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: yuhao yang
>
> Some statistics functions have been supported by DataFrame directly. Use this 
> jira to discuss/design the statistics package in Spark.ML and its function 
> scope. Hypothesis test and correlation computation may still need to expose 
> independent interfaces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14623) add label binarizer

2016-04-14 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-14623:
-
Description: 
It relates to https://issues.apache.org/jira/browse/SPARK-7445

Map the labels to 0/1. 
For example,
Input:
"yellow,green,red,green,0"
The labels: "0, green, red, yellow"
Output:
0, 0, 0, 1
0, 1, 0, 0
0, 0, 1, 0
0, 1, 0, 0
1, 0 ,0, 0



  was:
It relates to https://issues.apache.org/jira/browse/SPARK-7445

Map the labels to 0/1. 
For example,
Input:
"yellow,green,red,green,0"
The labels: "0, green, red, yellow"
Output:
0, 0, 0, 0, 1, 
0, 1, 0, 1, 0, 
0, 0, 1, 0, 0, 
1, 0, 0, 0, 0





> add label binarizer 
> 
>
> Key: SPARK-14623
> URL: https://issues.apache.org/jira/browse/SPARK-14623
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.1
>Reporter: hujiayin
>Priority: Minor
> Fix For: 2.0.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It relates to https://issues.apache.org/jira/browse/SPARK-7445
> Map the labels to 0/1. 
> For example,
> Input:
> "yellow,green,red,green,0"
> The labels: "0, green, red, yellow"
> Output:
> 0, 0, 0, 1
> 0, 1, 0, 0
> 0, 0, 1, 0
> 0, 1, 0, 0
> 1, 0 ,0, 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14622) Retain lost executors status

2016-04-13 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240550#comment-15240550
 ] 

hujiayin commented on SPARK-14622:
--

I think it is also better to maintain the number of lost executors. When click 
the number, the details information appear after that.

> Retain lost executors status
> 
>
> Key: SPARK-14622
> URL: https://issues.apache.org/jira/browse/SPARK-14622
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Qingyang Hong
>Priority: Minor
> Fix For: 1.6.0
>
>
> In 'execturos' dashboard, it is necessary to maintain a list of those 
> executors who have been lost. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14623) add label binarizer

2016-04-13 Thread hujiayin (JIRA)
hujiayin created SPARK-14623:


 Summary: add label binarizer 
 Key: SPARK-14623
 URL: https://issues.apache.org/jira/browse/SPARK-14623
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.6.1
Reporter: hujiayin
Priority: Minor
 Fix For: 2.0.0


It relates to https://issues.apache.org/jira/browse/SPARK-7445

Map the labels to 0/1. 
For example,
Input:
"yellow,green,red,green,0"
The labels: "0, green, red, yellow"
Output:
0, 0, 0, 0, 1, 
0, 1, 0, 1, 0, 
0, 0, 1, 0, 0, 
1, 0, 0, 0, 0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-7445) StringIndexer should handle binary labels properly

2016-04-13 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-7445:

Comment: was deleted

(was: If no one works on it, I'd like to submit a code for this issue.)

> StringIndexer should handle binary labels properly
> --
>
> Key: SPARK-7445
> URL: https://issues.apache.org/jira/browse/SPARK-7445
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> StringIndexer orders labels by their counts. However, for binary labels, we 
> should really map negatives to 0 and positive to 1. So can put special rules 
> for binary labels:
> 1. "+1"/"-1", "1"/"-1", "1"/"0"
> 2. "yes"/"no"
> 3. "true"/"false"
> Another option is to allow users to provide a list or labels and we use the 
> ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7445) StringIndexer should handle binary labels properly

2016-04-13 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238799#comment-15238799
 ] 

hujiayin commented on SPARK-7445:
-

If no one works on it, I'd like to submit a code for this issue.

> StringIndexer should handle binary labels properly
> --
>
> Key: SPARK-7445
> URL: https://issues.apache.org/jira/browse/SPARK-7445
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> StringIndexer orders labels by their counts. However, for binary labels, we 
> should really map negatives to 0 and positive to 1. So can put special rules 
> for binary labels:
> 1. "+1"/"-1", "1"/"-1", "1"/"0"
> 2. "yes"/"no"
> 3. "true"/"false"
> Another option is to allow users to provide a list or labels and we use the 
> ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2016-02-02 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-4036:

Comment: was deleted

(was: latest CRF codes)

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf, crf-spark.zip, 
> dig-hair-eye-train.model, features.hair-eye, sample-input, sample-output
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2016-02-02 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-4036:

Attachment: crf-spark.zip

latest CRF codes

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf, crf-spark.zip, 
> dig-hair-eye-train.model, features.hair-eye, sample-input, sample-output
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-12-15 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059266#comment-15059266
 ] 

hujiayin commented on SPARK-4036:
-

Hi Andrew,
With your latest template file sent to me, I think your case can run with the 
CRF. However what are the labels in your sample-input? Since CRF is a 
structured prediction, person needs label the features to train the model. 
Thanks

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf, dig-hair-eye-train.model, 
> features.hair-eye, sample-input, sample-output
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-12-07 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15010353#comment-15010353
 ] 

hujiayin edited comment on SPARK-4036 at 12/7/15 8:10 AM:
--

Hi Sasaki,
I'm not sure if you worked on it as the jira is still open. If you have a PR, 
you could close my PR https://github.com/apache/spark/pull/9794
I referenced the other document besides Sasaki's design for the implementation.
http://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers
http://www.cs.utah.edu/~piyush/teaching/structured_prediction.pdf




was (Author: hujiayin):
Hi Sasaki,
I'm not sure if you worked on it as the jira is still open. If you have a PR, 
you could close my PR https://github.com/apache/spark/pull/9794
I referenced the other document besides Sasaki's design for the implementation.
http://www.cs.utah.edu/~piyush/teaching/structured_prediction.pdf



> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-12-06 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15044223#comment-15044223
 ] 

hujiayin commented on SPARK-4036:
-

Hi Andrew,

The code is implemented by Scala and integrated with Spark. I tested it after I 
implemented it. I also verify it with some papers listed inside the code and 
this jira. Could you send me your features and models that I can have further 
testing?



> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-12-06 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15044223#comment-15044223
 ] 

hujiayin edited comment on SPARK-4036 at 12/7/15 1:00 AM:
--

Hi Andrew,

The code is implemented by Scala and integrated with Spark. I tested it after I 
implemented it. I also verified it with some papers listed inside the code and 
this jira. Could you send me your features and models that I can have further 
testing? 




was (Author: hujiayin):
Hi Andrew,

The code is implemented by Scala and integrated with Spark. I tested it after I 
implemented it. I also verify it with some papers listed inside the code and 
this jira. Could you send me your features and models that I can have further 
testing?



> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-11-18 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-4036:

Comment: was deleted

(was: ok)

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-11-18 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-4036:

Comment: was deleted

(was: ok)

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-11-18 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-4036:

Comment: was deleted

(was: ok)

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-11-18 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013095#comment-15013095
 ] 

hujiayin commented on SPARK-4036:
-

ok

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-11-18 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013097#comment-15013097
 ] 

hujiayin commented on SPARK-4036:
-

ok

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-11-18 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013098#comment-15013098
 ] 

hujiayin commented on SPARK-4036:
-

ok

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-11-18 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013096#comment-15013096
 ] 

hujiayin commented on SPARK-4036:
-

ok

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-11-18 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15010353#comment-15010353
 ] 

hujiayin edited comment on SPARK-4036 at 11/19/15 7:09 AM:
---

Hi Sasaki,
I'm not sure if you worked on it as the jira is still open. If you have a PR, 
you could close my PR https://github.com/apache/spark/pull/9794
I referenced the other document besides Sasaki's design for the implementation.
http://www.cs.utah.edu/~piyush/teaching/structured_prediction.pdf




was (Author: hujiayin):
Hi Sasaki,
I'm not sure if you worked on it as the jira is still open. If you have a PR, 
you could close my PR https://github.com/apache/spark/pull/9794


> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-11-17 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15010353#comment-15010353
 ] 

hujiayin commented on SPARK-4036:
-

Hi Sasaki,
I'm not sure if you worked on it as the jira is still open. If you have a PR, 
you could close my PR https://github.com/apache/spark/pull/9794


> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"

2015-10-27 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-11200:
-
Comment: was deleted

(was: sparkscore found it happened since commit number cf2e0ae7 and resolved 
today.)

> NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"
> -
>
> Key: SPARK-11200
> URL: https://issues.apache.org/jira/browse/SPARK-11200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: hujiayin
>
> The endless messages "cannot send ${message}because RpcEnv is closed" are pop 
> up after start any of workloads in MLlib until a manual stop from person. The 
> environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. 
> The error is from NettyRpcEnv.scala. I don't have enough time to look into 
> this issue at this time, but I can verify issue in environment with you if 
> you have fix. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"

2015-10-27 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14975993#comment-14975993
 ] 

hujiayin commented on SPARK-11200:
--

sparkscore found it happened since commit number cf2e0ae7 and resolved today.

> NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"
> -
>
> Key: SPARK-11200
> URL: https://issues.apache.org/jira/browse/SPARK-11200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: hujiayin
>
> The endless messages "cannot send ${message}because RpcEnv is closed" are pop 
> up after start any of workloads in MLlib until a manual stop from person. The 
> environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. 
> The error is from NettyRpcEnv.scala. I don't have enough time to look into 
> this issue at this time, but I can verify issue in environment with you if 
> you have fix. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"

2015-10-21 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966119#comment-14966119
 ] 

hujiayin edited comment on SPARK-11200 at 10/22/15 2:37 AM:


Hi Sean,

I think it doesn't relate to environment since version 1.5.1 works in the same 
environment. The error "cannot send ${message} because RpcEnv is closed" looped 
when launched a task from yarn client mode. ${message} was specific IP address 
and port related information. I guess the client cannot be created and the 
process was blocked. I'll look into it after I have sometime. 


was (Author: hujiayin):
Hi Sean,

I think it doesn't relate to environment since version 1.4 works in the same 
environment. The error "cannot send ${message} because RpcEnv is closed" looped 
when launched a task from yarn client mode. ${message} was specific IP address 
and port related information. I guess the client cannot be created and the 
process was blocked. I'll look into it after I have sometime. 

> NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"
> -
>
> Key: SPARK-11200
> URL: https://issues.apache.org/jira/browse/SPARK-11200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: hujiayin
>
> The endless messages "cannot send ${message}because RpcEnv is closed" are pop 
> up after start any of workloads in MLlib until a manual stop from person. The 
> environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. 
> The error is from NettyRpcEnv.scala. I don't have enough time to look into 
> this issue at this time, but I can verify issue in environment with you if 
> you have fix. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"

2015-10-21 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-11200:
-
Affects Version/s: (was: 1.5.1)
   1.6.0

> NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"
> -
>
> Key: SPARK-11200
> URL: https://issues.apache.org/jira/browse/SPARK-11200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: hujiayin
>
> The endless messages "cannot send ${message}because RpcEnv is closed" are pop 
> up after start any of workloads in MLlib until a manual stop from person. The 
> environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. 
> The error is from NettyRpcEnv.scala. I don't have enough time to look into 
> this issue at this time, but I can verify issue in environment with you if 
> you have fix. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"

2015-10-21 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966391#comment-14966391
 ] 

hujiayin commented on SPARK-11200:
--

Hi Sean,
I didn't close/reopen it. The issue is there since Last Friday. The errors 
texts were different but all related to Netty RPC communication failure.   

> NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"
> -
>
> Key: SPARK-11200
> URL: https://issues.apache.org/jira/browse/SPARK-11200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: hujiayin
>
> The endless messages "cannot send ${message}because RpcEnv is closed" are pop 
> up after start any of workloads in MLlib until a manual stop from person. The 
> environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. 
> The error is from NettyRpcEnv.scala. I don't have enough time to look into 
> this issue at this time, but I can verify issue in environment with you if 
> you have fix. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"

2015-10-20 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966119#comment-14966119
 ] 

hujiayin edited comment on SPARK-11200 at 10/21/15 2:13 AM:


Hi Sean,

I think it doesn't relate to environment since version 1.4 works in the same 
environment. The error "cannot send ${message} because RpcEnv is closed" looped 
when launched a task from yarn client mode. ${message} was specific IP address 
and port related information. I guess the client cannot be created and the 
process was blocked. I'll look into it after I have sometime. 


was (Author: hujiayin):
Hi Sean,

I think it doesn't relate to environment since version 1.4 works in the same 
enviroment. The error "cannot send ${message} because RpcEnv is closed" looped 
when launched a task from yarn client mode. ${message} was specific ip address 
and port related information. I guess the client cannot be created and the 
process was blocked. I'll look into it after I have sometime. 

> NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"
> -
>
> Key: SPARK-11200
> URL: https://issues.apache.org/jira/browse/SPARK-11200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: hujiayin
>
> The endless messages "cannot send ${message}because RpcEnv is closed" are pop 
> up after start any of workloads in MLlib until a manual stop from person. The 
> environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. 
> The error is from NettyRpcEnv.scala. I don't have enough time to look into 
> this issue at this time, but I can verify issue in environment with you if 
> you have fix. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"

2015-10-20 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966119#comment-14966119
 ] 

hujiayin commented on SPARK-11200:
--

Hi Sean,

I think it doesn't relate to environment since version 1.4 works in the same 
enviroment. The error "cannot send ${message} because RpcEnv is closed" looped 
when launched a task from yarn client mode. ${message} was specific ip address 
and port related information. I guess the client cannot be created and the 
process was blocked. I'll look into it after I have sometime. 

> NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"
> -
>
> Key: SPARK-11200
> URL: https://issues.apache.org/jira/browse/SPARK-11200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: hujiayin
>
> The endless messages "cannot send ${message}because RpcEnv is closed" are pop 
> up after start any of workloads in MLlib until a manual stop from person. The 
> environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. 
> The error is from NettyRpcEnv.scala. I don't have enough time to look into 
> this issue at this time, but I can verify issue in environment with you if 
> you have fix. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"

2015-10-19 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-11200:
-
Description: 
The endless messages "cannot send ${message}because RpcEnv is closed" are pop 
up after start any of workloads in MLlib until a manual stop from person. The 
environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. 
The error is from NettyRpcEnv.scala. I don't have enough time to look into this 
issue at this time, but I can verify issue in environment with you if you have 
fix. 
 

  was:
The endless messages "cannot send ${message}because RpcEnv is closed" are pop 
up after start any of workloads in MLlib until a manually stop from person. The 
environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. 
The error is from NettyRpcEnv.scala. I don't have enough time to look into this 
issue at this time, but I can verify issue in environment with you if you have 
fix. 
 


> NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"
> -
>
> Key: SPARK-11200
> URL: https://issues.apache.org/jira/browse/SPARK-11200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: hujiayin
>
> The endless messages "cannot send ${message}because RpcEnv is closed" are pop 
> up after start any of workloads in MLlib until a manual stop from person. The 
> environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. 
> The error is from NettyRpcEnv.scala. I don't have enough time to look into 
> this issue at this time, but I can verify issue in environment with you if 
> you have fix. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"

2015-10-19 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-11200:
-
Description: 
The endless messages "cannot send ${message}because RpcEnv is closed" are pop 
up after start any of workloads in MLlib until a manually stop from person. The 
environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. 
The error is from NettyRpcEnv.scala. I don't have enough time to look into this 
issue at this time, but I can verify issue in environment with you if you have 
fix. 
 

  was:
The endless messages "cannot send ${message} because RpcEnv is closed" are pop 
up after start any of workloads in MLlib until a manually stop from person. The 
environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. 
The error is from NettyRpcEnv.scala. I don't have enough time to look into this 
issue at this time, but I can verify issue in environment with you if you have 
fix. 
 


> NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"
> -
>
> Key: SPARK-11200
> URL: https://issues.apache.org/jira/browse/SPARK-11200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: hujiayin
>
> The endless messages "cannot send ${message}because RpcEnv is closed" are pop 
> up after start any of workloads in MLlib until a manually stop from person. 
> The environment is hadoop-cdh-5.3.2 Spark master version run in yarn client 
> mode. The error is from NettyRpcEnv.scala. I don't have enough time to look 
> into this issue at this time, but I can verify issue in environment with you 
> if you have fix. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"

2015-10-19 Thread hujiayin (JIRA)
hujiayin created SPARK-11200:


 Summary: NettyRpcEnv endless message "cannot send ${message} 
because RpcEnv is closed"
 Key: SPARK-11200
 URL: https://issues.apache.org/jira/browse/SPARK-11200
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.1
Reporter: hujiayin


The endless messages "cannot send ${message} because RpcEnv is closed" are pop 
up after start any of workloads in MLlib until a manually stop from person. The 
environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. 
The error is from NettyRpcEnv.scala. I don't have enough time to look into this 
issue at this time, but I can verify issue in environment with you if you have 
fix. 
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6724) Model import/export for FPGrowth

2015-08-30 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-6724:

Comment: was deleted

(was: ok, : ))

> Model import/export for FPGrowth
> 
>
> Key: SPARK-6724
> URL: https://issues.apache.org/jira/browse/SPARK-6724
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10329) Cost RDD in k-means|| initialization is not storage-efficient

2015-08-30 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-10329:
-
Comment: was deleted

(was: ok, I will try to fix it today)

> Cost RDD in k-means|| initialization is not storage-efficient
> -
>
> Key: SPARK-10329
> URL: https://issues.apache.org/jira/browse/SPARK-10329
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Xiangrui Meng
>Assignee: hujiayin
>  Labels: clustering
>
> Currently we use `RDD[Vector]` to store point cost during k-means|| 
> initialization, where each `Vector` has size `runs`. This is not 
> storage-efficient because `runs` is usually 1 and then each record is a 
> Vector of size 1. What we need is just the 8 bytes to store the cost, but we 
> introduce two objects (DenseVector and its values array), which could cost 16 
> bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel 
> for reporting this issue!
> There are several solutions:
> 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per 
> record.
> 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each 
> `Array[Double]` object covers 1024 instances, which could remove most of the 
> overhead.
> Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs 
> kicking out the training dataset from memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10329) Cost RDD in k-means|| initialization is not storage-efficient

2015-08-30 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721799#comment-14721799
 ] 

hujiayin commented on SPARK-10329:
--

ok, I will try to fix it today

> Cost RDD in k-means|| initialization is not storage-efficient
> -
>
> Key: SPARK-10329
> URL: https://issues.apache.org/jira/browse/SPARK-10329
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Xiangrui Meng
>Assignee: hujiayin
>  Labels: clustering
>
> Currently we use `RDD[Vector]` to store point cost during k-means|| 
> initialization, where each `Vector` has size `runs`. This is not 
> storage-efficient because `runs` is usually 1 and then each record is a 
> Vector of size 1. What we need is just the 8 bytes to store the cost, but we 
> introduce two objects (DenseVector and its values array), which could cost 16 
> bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel 
> for reporting this issue!
> There are several solutions:
> 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per 
> record.
> 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each 
> `Array[Double]` object covers 1024 instances, which could remove most of the 
> overhead.
> Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs 
> kicking out the training dataset from memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10329) Cost RDD in k-means|| initialization is not storage-efficient

2015-08-27 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14718039#comment-14718039
 ] 

hujiayin commented on SPARK-10329:
--

Hi Xiangrui,

I'll try to fix it in 1.6.

> Cost RDD in k-means|| initialization is not storage-efficient
> -
>
> Key: SPARK-10329
> URL: https://issues.apache.org/jira/browse/SPARK-10329
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Xiangrui Meng
>  Labels: clustering
>
> Currently we use `RDD[Vector]` to store point cost during k-means|| 
> initialization, where each `Vector` has size `runs`. This is not 
> storage-efficient because `runs` is usually 1 and then each record is a 
> Vector of size 1. What we need is just the 8 bytes to store the cost, but we 
> introduce two objects (DenseVector and its values array), which could cost 16 
> bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel 
> for reporting this issue!
> There are several solutions:
> 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per 
> record.
> 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each 
> `Array[Double]` object covers 1024 instances, which could remove most of the 
> overhead.
> Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs 
> kicking out the training dataset from memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5837) HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode

2015-08-10 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14679818#comment-14679818
 ] 

hujiayin commented on SPARK-5837:
-

The master doesn't have this problem. The other parts works well except the 
webUI in yarn view in 1.2~1.3 releases. I found a PR 
https://github.com/apache/spark/pull/2858 try to fix this issue and it should 
be included from 1.2, but the problem happens with 1.2 from my side. I think 
there are some other  patches in 1.4 or 1.5 fixed this issue, do you know this 
patch

> HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode
> --
>
> Key: SPARK-5837
> URL: https://issues.apache.org/jira/browse/SPARK-5837
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0, 1.2.1
>Reporter: Marco Capuccini
>
> Both Spark 1.2.0 and Spark 1.2.1 return this error when I try to access the 
> Spark UI if I run over yarn (version 2.4.0):
> HTTP ERROR 500
> Problem accessing /proxy/application_1423564210894_0017/. Reason:
> Connection refused
> Caused by:
> java.net.ConnectException: Connection refused
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:579)
>   at java.net.Socket.connect(Socket.java:528)
>   at java.net.Socket.(Socket.java:425)
>   at java.net.Socket.(Socket.java:280)
>   at 
> org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
>   at 
> org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
>   at 
> org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
>   at 
> org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
>   at 
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:187)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:344)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>   at 
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
>   at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>   at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
>   at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.

[jira] [Issue Comment Deleted] (SPARK-5837) HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode

2015-08-10 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-5837:

Comment: was deleted

(was: The master doesn't have this problem. The other parts works well except 
the webUI in yarn view in 1.2~1.3 releases. I found a PR 
https://github.com/apache/spark/pull/2858 try to fix this issue and it should 
be included from 1.2, but the problem happens with 1.2 from my side. I think 
there are some other  patches in 1.4 or 1.5 fixed this issue, do you know this 
patch)

> HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode
> --
>
> Key: SPARK-5837
> URL: https://issues.apache.org/jira/browse/SPARK-5837
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0, 1.2.1
>Reporter: Marco Capuccini
>
> Both Spark 1.2.0 and Spark 1.2.1 return this error when I try to access the 
> Spark UI if I run over yarn (version 2.4.0):
> HTTP ERROR 500
> Problem accessing /proxy/application_1423564210894_0017/. Reason:
> Connection refused
> Caused by:
> java.net.ConnectException: Connection refused
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:579)
>   at java.net.Socket.connect(Socket.java:528)
>   at java.net.Socket.(Socket.java:425)
>   at java.net.Socket.(Socket.java:280)
>   at 
> org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
>   at 
> org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
>   at 
> org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
>   at 
> org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
>   at 
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:187)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:344)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>   at 
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
>   at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>   at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
>   at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.j

[jira] [Commented] (SPARK-5837) HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode

2015-08-10 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14679817#comment-14679817
 ] 

hujiayin commented on SPARK-5837:
-

The master doesn't have this problem. The other parts works well except the 
webUI in yarn view in 1.2~1.3 releases. I found a PR 
https://github.com/apache/spark/pull/2858 try to fix this issue and it should 
be included from 1.2, but the problem happens with 1.2 from my side. I think 
there are some other  patches in 1.4 or 1.5 fixed this issue, do you know this 
patch

> HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode
> --
>
> Key: SPARK-5837
> URL: https://issues.apache.org/jira/browse/SPARK-5837
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0, 1.2.1
>Reporter: Marco Capuccini
>
> Both Spark 1.2.0 and Spark 1.2.1 return this error when I try to access the 
> Spark UI if I run over yarn (version 2.4.0):
> HTTP ERROR 500
> Problem accessing /proxy/application_1423564210894_0017/. Reason:
> Connection refused
> Caused by:
> java.net.ConnectException: Connection refused
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:579)
>   at java.net.Socket.connect(Socket.java:528)
>   at java.net.Socket.(Socket.java:425)
>   at java.net.Socket.(Socket.java:280)
>   at 
> org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
>   at 
> org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
>   at 
> org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
>   at 
> org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
>   at 
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:187)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:344)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>   at 
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
>   at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>   at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
>   at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.

[jira] [Commented] (SPARK-5837) HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode

2015-08-10 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14679782#comment-14679782
 ] 

hujiayin commented on SPARK-5837:
-

yes, I've tried all the setting above. The issue just happening with version 
1.2 ~ 1.3. My hadoop is hadoop-2.5.0-cdh5.3.2


> HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode
> --
>
> Key: SPARK-5837
> URL: https://issues.apache.org/jira/browse/SPARK-5837
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0, 1.2.1
>Reporter: Marco Capuccini
>
> Both Spark 1.2.0 and Spark 1.2.1 return this error when I try to access the 
> Spark UI if I run over yarn (version 2.4.0):
> HTTP ERROR 500
> Problem accessing /proxy/application_1423564210894_0017/. Reason:
> Connection refused
> Caused by:
> java.net.ConnectException: Connection refused
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:579)
>   at java.net.Socket.connect(Socket.java:528)
>   at java.net.Socket.(Socket.java:425)
>   at java.net.Socket.(Socket.java:280)
>   at 
> org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
>   at 
> org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
>   at 
> org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
>   at 
> org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
>   at 
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:187)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:344)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>   at 
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
>   at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>   at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
>   at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>   at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>   at 
> org.mortbay.jetty.servl

[jira] [Updated] (SPARK-9779) HTTP500 revisit when open web-UI in yarn-cluster yarn-client mode (1.2-1.3)

2015-08-10 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-9779:

Description: 
When I try to open the web-UI from AM in yarn-cluster mode, I meet the HTTP500 
error as described in https://issues.apache.org/jira/browse/SPARK-5837 I want 
to collect stage logs of a feature in Spark1.2 for a performance problem.

The error doesn't happen in the latest spark build and only happens in 
Spark1.2-1.3. The hadoop yarn related configuration are the same between the 
different Spark versions.  






  was:
When I try to open the web-UI from AM in yarn-cluster mode, I meet the HTTP500 
error as described in https://issues.apache.org/jira/browse/SPARK-5837 I want 
to collect stage logs of a feature in Spark1.2 to find a performance problem.

The error doesn't happen in the latest spark build and only happens in 
Spark1.2-1.3. The hadoop yarn related configuration are the same between the 
different Spark versions.  







> HTTP500 revisit when open web-UI in yarn-cluster yarn-client mode (1.2-1.3) 
> 
>
> Key: SPARK-9779
> URL: https://issues.apache.org/jira/browse/SPARK-9779
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0
> Environment: hadoop2.5.0-cdh5.3.2
>Reporter: hujiayin
> Fix For: 1.4.2
>
>
> When I try to open the web-UI from AM in yarn-cluster mode, I meet the 
> HTTP500 error as described in 
> https://issues.apache.org/jira/browse/SPARK-5837 I want to collect stage logs 
> of a feature in Spark1.2 for a performance problem.
> The error doesn't happen in the latest spark build and only happens in 
> Spark1.2-1.3. The hadoop yarn related configuration are the same between the 
> different Spark versions.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9779) HTTP500 revisit when open web-UI in yarn-cluster yarn-client mode (1.2-1.3)

2015-08-10 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-9779:

Description: 
When I try to open the web-UI from AM in yarn-cluster mode, I meet the HTTP500 
error as described in https://issues.apache.org/jira/browse/SPARK-5837 I want 
to collect stage logs of a feature in Spark1.2 to find a performance problem.

The error doesn't happen in the latest spark build and only happens in 
Spark1.2-1.3. The hadoop yarn related configuration are the same between the 
different Spark versions.  






  was:
When I try to open the web-UI from AM in yarn-cluster mode, I meet the HTTP500 
error as described in https://issues.apache.org/jira/browse/SPARK-5837 I meet 
this issue is because I want to collect stage logs of a feature in Spark1.2 to 
find a performance problem.

The error doesn't happen in the latest spark build and only happens in 
Spark1.2-1.3. The hadoop yarn related configuration are the same between the 
different Spark versions.  







> HTTP500 revisit when open web-UI in yarn-cluster yarn-client mode (1.2-1.3) 
> 
>
> Key: SPARK-9779
> URL: https://issues.apache.org/jira/browse/SPARK-9779
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0
> Environment: hadoop2.5.0-cdh5.3.2
>Reporter: hujiayin
> Fix For: 1.4.2
>
>
> When I try to open the web-UI from AM in yarn-cluster mode, I meet the 
> HTTP500 error as described in 
> https://issues.apache.org/jira/browse/SPARK-5837 I want to collect stage logs 
> of a feature in Spark1.2 to find a performance problem.
> The error doesn't happen in the latest spark build and only happens in 
> Spark1.2-1.3. The hadoop yarn related configuration are the same between the 
> different Spark versions.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9779) HTTP500 revisit when open web-UI in yarn-cluster yarn-client mode (1.2-1.3)

2015-08-10 Thread hujiayin (JIRA)
hujiayin created SPARK-9779:
---

 Summary: HTTP500 revisit when open web-UI in yarn-cluster 
yarn-client mode (1.2-1.3) 
 Key: SPARK-9779
 URL: https://issues.apache.org/jira/browse/SPARK-9779
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.0, 1.2.2, 1.2.1, 1.2.0
 Environment: hadoop2.5.0-cdh5.3.2
Reporter: hujiayin
 Fix For: 1.4.2


When I try to open the web-UI from AM in yarn-cluster mode, I meet the HTTP500 
error as described in https://issues.apache.org/jira/browse/SPARK-5837 I meet 
this issue is because I want to collect stage logs of a feature in Spark1.2 to 
find a performance problem.

The error doesn't happen in the latest spark build and only happens in 
Spark1.2-1.3. The hadoop yarn related configuration are the same between the 
different Spark versions.  








--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5837) HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode

2015-08-10 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14679730#comment-14679730
 ] 

hujiayin commented on SPARK-5837:
-

Hi Sean,

I meet this problem on spark 1.2 1.2.1 1.2.2 1.3 again but the problem doesn't 
happen on the latest code with the same configuration of hadoop. I started the 
history server of 1.2. The hadoop is hadoop-2.5.0-cdh5.3.2. I meet this problem 
again is because I need to get the log of a feature in 1.2 for a performance 
issue but I cannot get the stage log now. 

> HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode
> --
>
> Key: SPARK-5837
> URL: https://issues.apache.org/jira/browse/SPARK-5837
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0, 1.2.1
>Reporter: Marco Capuccini
>
> Both Spark 1.2.0 and Spark 1.2.1 return this error when I try to access the 
> Spark UI if I run over yarn (version 2.4.0):
> HTTP ERROR 500
> Problem accessing /proxy/application_1423564210894_0017/. Reason:
> Connection refused
> Caused by:
> java.net.ConnectException: Connection refused
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:579)
>   at java.net.Socket.connect(Socket.java:528)
>   at java.net.Socket.(Socket.java:425)
>   at java.net.Socket.(Socket.java:280)
>   at 
> org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
>   at 
> org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
>   at 
> org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
>   at 
> org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
>   at 
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:187)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:344)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>   at 
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
>   at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>   at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
>   at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.do

[jira] [Commented] (SPARK-9129) Integrate convolutional deep belief networks for visual recognition tasks

2015-07-26 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642144#comment-14642144
 ] 

hujiayin commented on SPARK-9129:
-

ok : )

> Integrate convolutional deep belief networks for visual recognition tasks  
> ---
>
> Key: SPARK-9129
> URL: https://issues.apache.org/jira/browse/SPARK-9129
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: hujiayin
>  Labels: patch
>   Original Estimate: 1,008h
>  Remaining Estimate: 1,008h
>
> There has been much interest in unsupervised learning of hierarchical 
> generative models such as deep belief networks. Scaling such models to 
> full-sized, high-dimensional images remains a difficult problem. Some users 
> complain about the performance and convergence speed of a model. Integrate 
> this to create a new usage for Spark in visual related tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9129) Integrate convolutional deep belief networks for visual recognition tasks

2015-07-16 Thread hujiayin (JIRA)
hujiayin created SPARK-9129:
---

 Summary: Integrate convolutional deep belief networks for visual 
recognition tasks  
 Key: SPARK-9129
 URL: https://issues.apache.org/jira/browse/SPARK-9129
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: hujiayin


There has been much interest in unsupervised learning of hierarchical 
generative models such as deep belief networks. Scaling such models to 
full-sized, high-dimensional images remains a difficult problem. Some users 
complain about the performance and convergence speed of a model. Integrate this 
to create a new usage for Spark in visual related tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth

2015-07-06 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14616275#comment-14616275
 ] 

hujiayin commented on SPARK-6724:
-

ok, : )

> Model import/export for FPGrowth
> 
>
> Key: SPARK-6724
> URL: https://issues.apache.org/jira/browse/SPARK-6724
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6724) Model import/export for FPGrowth

2015-07-06 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-6724:

Comment: was deleted

(was: Can I take a look at this issue?)

> Model import/export for FPGrowth
> 
>
> Key: SPARK-6724
> URL: https://issues.apache.org/jira/browse/SPARK-6724
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth

2015-07-06 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14616264#comment-14616264
 ] 

hujiayin commented on SPARK-6724:
-

Can I take a look at this issue?

> Model import/export for FPGrowth
> 
>
> Key: SPARK-6724
> URL: https://issues.apache.org/jira/browse/SPARK-6724
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark

2015-07-01 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611553#comment-14611553
 ] 

hujiayin commented on SPARK-5682:
-

Since the encrypted shuffle in spark is focus on the common module, it maybe 
not good to use hadoop API. On the other side, the AES solution is a bit heavy 
to encode/decode the live steaming data. 

> Add encrypted shuffle in spark
> --
>
> Key: SPARK-5682
> URL: https://issues.apache.org/jira/browse/SPARK-5682
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: liyunzhang_intel
> Attachments: Design Document of Encrypted Spark 
> Shuffle_20150209.docx, Design Document of Encrypted Spark 
> Shuffle_20150318.docx, Design Document of Encrypted Spark 
> Shuffle_20150402.docx, Design Document of Encrypted Spark 
> Shuffle_20150506.docx
>
>
> Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
> data safer. This feature is necessary in spark. AES  is a specification for 
> the encryption of electronic data. There are 5 common modes in AES. CTR is 
> one of the modes. We use two codec JceAesCtrCryptoCodec and 
> OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
> in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
> provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
> provides. 
> Because ugi credential info is used in the process of encrypted shuffle, we 
> first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5682) Add encrypted shuffle in spark

2015-07-01 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611527#comment-14611527
 ] 

hujiayin edited comment on SPARK-5682 at 7/2/15 6:10 AM:
-

Steps were added to encode and decode the data, the performance will not be 
fast than before, in the same time, codes also have security issue, for example 
save the plain text in configuration file and finally used as the part of the 
key

If you use a better cypher solution, the performance downgrade will be 
minimized. i think AES is a bit heavy.

In the same time, the feature based on hadoop 2.6, it is the limitation, that 
is why i said rely on hadoop

Though the API is public stable, however, you cannot ensure if the API will not 
be changed since it is not the comercial software.



was (Author: hujiayin):
Steps were added to encode and decode the data, the performance will not be 
fast than before, in the same time, codes also have security issue, for example 
save the plain text in configuration file and finally used as the part of the 
key

In the same time, the feature based on hadoop 2.6, it is the limitation, that 
is why i said rely on hadoop

Though the API is public stable, however, you cannot ensure if the API will not 
be changed since it is not the comercial software.


> Add encrypted shuffle in spark
> --
>
> Key: SPARK-5682
> URL: https://issues.apache.org/jira/browse/SPARK-5682
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: liyunzhang_intel
> Attachments: Design Document of Encrypted Spark 
> Shuffle_20150209.docx, Design Document of Encrypted Spark 
> Shuffle_20150318.docx, Design Document of Encrypted Spark 
> Shuffle_20150402.docx, Design Document of Encrypted Spark 
> Shuffle_20150506.docx
>
>
> Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
> data safer. This feature is necessary in spark. AES  is a specification for 
> the encryption of electronic data. There are 5 common modes in AES. CTR is 
> one of the modes. We use two codec JceAesCtrCryptoCodec and 
> OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
> in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
> provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
> provides. 
> Because ugi credential info is used in the process of encrypted shuffle, we 
> first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5682) Add encrypted shuffle in spark

2015-07-01 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611527#comment-14611527
 ] 

hujiayin edited comment on SPARK-5682 at 7/2/15 6:03 AM:
-

Steps were added to encode and decode the data, the performance will not be 
fast than before, in the same time, codes also have security issue, for example 
save the plain text in configuration file and finally used as the part of the 
key

In the same time, the feature based on hadoop 2.6, it is the limitation, that 
is why i said rely on hadoop

Though the API is public stable, however, you cannot ensure if the API will not 
be changed since it is not the comercial software.



was (Author: hujiayin):
steps were added to encode and decode the data, the performance will not be 
fast than before, in the same time, codes also have security issue, for example 
save the plain text in configuration file and finally used as the part of the 
key

in the same time, the feature based on hadoop 2.6, it is the limitation, that 
is why i said rely on hadoop

though it is public stable, however, you cannot ensure if the api will not be 
changed since it was not the comercial software.


> Add encrypted shuffle in spark
> --
>
> Key: SPARK-5682
> URL: https://issues.apache.org/jira/browse/SPARK-5682
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: liyunzhang_intel
> Attachments: Design Document of Encrypted Spark 
> Shuffle_20150209.docx, Design Document of Encrypted Spark 
> Shuffle_20150318.docx, Design Document of Encrypted Spark 
> Shuffle_20150402.docx, Design Document of Encrypted Spark 
> Shuffle_20150506.docx
>
>
> Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
> data safer. This feature is necessary in spark. AES  is a specification for 
> the encryption of electronic data. There are 5 common modes in AES. CTR is 
> one of the modes. We use two codec JceAesCtrCryptoCodec and 
> OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
> in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
> provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
> provides. 
> Because ugi credential info is used in the process of encrypted shuffle, we 
> first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5682) Add encrypted shuffle in spark

2015-07-01 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611527#comment-14611527
 ] 

hujiayin edited comment on SPARK-5682 at 7/2/15 6:02 AM:
-

steps were added to encode and decode the data, the performance will not be 
fast than before, in the same time, codes also have security issue, for example 
save the plain text in configuration file and finally used as the part of the 
key

in the same time, the feature based on hadoop 2.6, it is the limitation, that 
is why i said rely on hadoop

though it is public stable, however, you cannot ensure if the api will not be 
changed since it was not the comercial software.



was (Author: hujiayin):
steps were added to encode and decode the data, the performance will not be 
fast than before, in the same time, codes also have security issue, for example 
save the plain text in configuration file and finally used as the part of the 
key

in the same time, the feature based on hadoop 2.6, it is the limitation, that 
is why i said reply on hadoop

though it is public stable, however, you cannot ensure if the api will not be 
changed since it was not the comercial software.


> Add encrypted shuffle in spark
> --
>
> Key: SPARK-5682
> URL: https://issues.apache.org/jira/browse/SPARK-5682
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: liyunzhang_intel
> Attachments: Design Document of Encrypted Spark 
> Shuffle_20150209.docx, Design Document of Encrypted Spark 
> Shuffle_20150318.docx, Design Document of Encrypted Spark 
> Shuffle_20150402.docx, Design Document of Encrypted Spark 
> Shuffle_20150506.docx
>
>
> Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
> data safer. This feature is necessary in spark. AES  is a specification for 
> the encryption of electronic data. There are 5 common modes in AES. CTR is 
> one of the modes. We use two codec JceAesCtrCryptoCodec and 
> OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
> in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
> provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
> provides. 
> Because ugi credential info is used in the process of encrypted shuffle, we 
> first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark

2015-07-01 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611527#comment-14611527
 ] 

hujiayin commented on SPARK-5682:
-

steps were added to encode and decode the data, the performance will not be 
fast than before, in the same time, codes also have security issue, for example 
save the plain text in configuration file and finally used as the part of the 
key

in the same time, the feature based on hadoop 2.6, it is the limitation, that 
is why i said reply on hadoop

though it is public stable, however, you cannot ensure if the api will not be 
changed since it was not the comercial software.


> Add encrypted shuffle in spark
> --
>
> Key: SPARK-5682
> URL: https://issues.apache.org/jira/browse/SPARK-5682
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: liyunzhang_intel
> Attachments: Design Document of Encrypted Spark 
> Shuffle_20150209.docx, Design Document of Encrypted Spark 
> Shuffle_20150318.docx, Design Document of Encrypted Spark 
> Shuffle_20150402.docx, Design Document of Encrypted Spark 
> Shuffle_20150506.docx
>
>
> Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
> data safer. This feature is necessary in spark. AES  is a specification for 
> the encryption of electronic data. There are 5 common modes in AES. CTR is 
> one of the modes. We use two codec JceAesCtrCryptoCodec and 
> OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
> in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
> provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
> provides. 
> Because ugi credential info is used in the process of encrypted shuffle, we 
> first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark

2015-07-01 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611328#comment-14611328
 ] 

hujiayin commented on SPARK-5682:
-

The solution relied on hadoop API and maybe downgrade the performance. The AES 
algorithm was used in block data encryption in many case. I think rc4 could be 
used to encode the stream or a simple solution with a authentication header 
could be used.   : )

> Add encrypted shuffle in spark
> --
>
> Key: SPARK-5682
> URL: https://issues.apache.org/jira/browse/SPARK-5682
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: liyunzhang_intel
> Attachments: Design Document of Encrypted Spark 
> Shuffle_20150209.docx, Design Document of Encrypted Spark 
> Shuffle_20150318.docx, Design Document of Encrypted Spark 
> Shuffle_20150402.docx, Design Document of Encrypted Spark 
> Shuffle_20150506.docx
>
>
> Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
> data safer. This feature is necessary in spark. AES  is a specification for 
> the encryption of electronic data. There are 5 common modes in AES. CTR is 
> one of the modes. We use two codec JceAesCtrCryptoCodec and 
> OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
> in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
> provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
> provides. 
> Because ugi credential info is used in the process of encrypted shuffle, we 
> first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-8370) Add API for data sources to register databases

2015-06-16 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-8370:

Comment: was deleted

(was: Does the datasource has size limitation)

> Add API for data sources to register databases
> --
>
> Key: SPARK-8370
> URL: https://issues.apache.org/jira/browse/SPARK-8370
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Santiago M. Mola
>
> This API would allow to register a database with a data source instead of 
> just a table. Registering a data source database would register all its table 
> and maintain the catalog updated. The catalog could delegate to the data 
> source lookups of tables for a database registered with this API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8370) Add API for data sources to register databases

2015-06-16 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589242#comment-14589242
 ] 

hujiayin commented on SPARK-8370:
-

Does the datasource has size limitation

> Add API for data sources to register databases
> --
>
> Key: SPARK-8370
> URL: https://issues.apache.org/jira/browse/SPARK-8370
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Santiago M. Mola
>
> This API would allow to register a database with a data source instead of 
> just a table. Registering a data source database would register all its table 
> and maintain the catalog updated. The catalog could delegate to the data 
> source lookups of tables for a database registered with this API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org