[jira] [Updated] (SPARK-12153) Word2Vec uses a fixed length for sentences which is not reasonable for reality, and similarity functions and fields are not accessible

2016-02-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12153:
--
Assignee: YongGang Cao

> Word2Vec uses a fixed length for sentences which is not reasonable for 
> reality, and similarity functions and fields are not accessible
> --
>
> Key: SPARK-12153
> URL: https://issues.apache.org/jira/browse/SPARK-12153
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: YongGang Cao
>Assignee: YongGang Cao
>Priority: Minor
>
> sentence boundary matters for sliding window, we shouldn't train model from a 
> window across sentences. 
> the current 1000 word as a hard split for sentences doesn't really make sense 
> which is not consistent with both original c version or other implementation 
> like deeplearning4j etc.
> the max sentence length is fixed and not tunable. Made it tunable as well.
> I made changes to address above issues.
> here is the pull request: https://github.com/apache/spark/pull/10152



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12153) Word2Vec uses a fixed length for sentences which is not reasonable for reality, and similarity functions and fields are not accessible

2015-12-18 Thread YongGang Cao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YongGang Cao updated SPARK-12153:
-
Issue Type: Bug  (was: Improvement)

> Word2Vec uses a fixed length for sentences which is not reasonable for 
> reality, and similarity functions and fields are not accessible
> --
>
> Key: SPARK-12153
> URL: https://issues.apache.org/jira/browse/SPARK-12153
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: YongGang Cao
>Priority: Minor
>
> sentence boundary matters for sliding window, we shouldn't train model from a 
> window across sentences. the current 100 word as a hard split for sentences 
> doesn't really make sense.
> And the cosinesimilarity functions is private which is useless for caller. 
> we may need to access the vocabulary and wordindex table as well, those need 
> getters
> I made changes to address above issues.
> here is the pull request: https://github.com/apache/spark/pull/10152



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12153) Word2Vec uses a fixed length for sentences which is not reasonable for reality, and similarity functions and fields are not accessible

2015-12-18 Thread YongGang Cao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YongGang Cao updated SPARK-12153:
-
Description: 
sentence boundary matters for sliding window, we shouldn't train model from a 
window across sentences. 
the current 1000 word as a hard split for sentences doesn't really make sense 
which is not consistent with both original c version or other implementation 
like deeplearning4j etc.
the max sentence length is fixed and not tunable. Made it tunable as well.

I made changes to address above issues.
here is the pull request: https://github.com/apache/spark/pull/10152

  was:
sentence boundary matters for sliding window, we shouldn't train model from a 
window across sentences. the current 100 word as a hard split for sentences 
doesn't really make sense.
And the cosinesimilarity functions is private which is useless for caller. 
we may need to access the vocabulary and wordindex table as well, those need 
getters

I made changes to address above issues.
here is the pull request: https://github.com/apache/spark/pull/10152


> Word2Vec uses a fixed length for sentences which is not reasonable for 
> reality, and similarity functions and fields are not accessible
> --
>
> Key: SPARK-12153
> URL: https://issues.apache.org/jira/browse/SPARK-12153
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: YongGang Cao
>Priority: Minor
>
> sentence boundary matters for sliding window, we shouldn't train model from a 
> window across sentences. 
> the current 1000 word as a hard split for sentences doesn't really make sense 
> which is not consistent with both original c version or other implementation 
> like deeplearning4j etc.
> the max sentence length is fixed and not tunable. Made it tunable as well.
> I made changes to address above issues.
> here is the pull request: https://github.com/apache/spark/pull/10152



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12153) Word2Vec uses a fixed length for sentences which is not reasonable for reality, and similarity functions and fields are not accessible

2015-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12153:
--
  Labels:   (was: patch)
Priority: Minor  (was: Major)

(I don't think this can be considered major)

> Word2Vec uses a fixed length for sentences which is not reasonable for 
> reality, and similarity functions and fields are not accessible
> --
>
> Key: SPARK-12153
> URL: https://issues.apache.org/jira/browse/SPARK-12153
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: YongGang Cao
>Priority: Minor
>
> sentence boundary matters for sliding window, we shouldn't train model from a 
> window across sentences. the current 100 word as a hard split for sentences 
> doesn't really make sense.
> And the cosinesimilarity functions is private which is useless for caller. 
> we may need to access the vocabulary and wordindex table as well, those need 
> getters
> I made changes to address above issues.
> here is the pull request: https://github.com/apache/spark/pull/10152



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12153) Word2Vec uses a fixed length for sentences which is not reasonable for reality, and similarity functions and fields are not accessible

2015-12-05 Thread YongGang Cao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YongGang Cao updated SPARK-12153:
-
Priority: Major  (was: Minor)

> Word2Vec uses a fixed length for sentences which is not reasonable for 
> reality, and similarity functions and fields are not accessible
> --
>
> Key: SPARK-12153
> URL: https://issues.apache.org/jira/browse/SPARK-12153
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: YongGang Cao
>  Labels: patch
>
> sentence boundary matters for sliding window, we shouldn't train model from a 
> window across sentences. the current 100 word as a hard split for sentences 
> doesn't really make sense.
> And the cosinesimilarity functions is private which is useless for caller. 
> we may need to access the vocabulary and wordindex table as well, those need 
> getters
> I made changes to address above issues.
> here is the pull request: https://github.com/apache/spark/pull/10152



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12153) Word2Vec uses a fixed length for sentences which is not reasonable for reality, and similarity functions and fields are not accessible

2015-12-04 Thread YongGang Cao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YongGang Cao updated SPARK-12153:
-
Description: 
sentence boundary matters for sliding window, we shouldn't train model from a 
window across sentences. the current 100 word as a hard split for sentences 
doesn't really make sense.
And the cosinesimilarity functions is private which is useless for caller. 
we may need to access the vocabulary and wordindex table as well, those need 
getters

I made changes to address above issues.
here is the pull request: https://github.com/apache/spark/pull/10152

  was:
sentence boundary matters for sliding window, we shouldn't train model from a 
window across sentences. the current 100 word as a hard split for sentences 
doesn't really make sense.
And the cosinesimilarity functions is private which is useless for caller. 
we may need to access the vocabulary and wordindex table as well, those need 
getters

I made changes to address above issues. will send out pull request for your 
review.


> Word2Vec uses a fixed length for sentences which is not reasonable for 
> reality, and similarity functions and fields are not accessible
> --
>
> Key: SPARK-12153
> URL: https://issues.apache.org/jira/browse/SPARK-12153
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: YongGang Cao
>Priority: Minor
>  Labels: patch
>
> sentence boundary matters for sliding window, we shouldn't train model from a 
> window across sentences. the current 100 word as a hard split for sentences 
> doesn't really make sense.
> And the cosinesimilarity functions is private which is useless for caller. 
> we may need to access the vocabulary and wordindex table as well, those need 
> getters
> I made changes to address above issues.
> here is the pull request: https://github.com/apache/spark/pull/10152



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org