[jira] [Updated] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS

2015-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4231:
-
Assignee: Debasish Das

> Add RankingMetrics to examples.MovieLensALS
> ---
>
> Key: SPARK-4231
> URL: https://issues.apache.org/jira/browse/SPARK-4231
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.4.0
>Reporter: Debasish Das
>Assignee: Debasish Das
>Priority: Minor
> Fix For: 1.4.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> examples.MovieLensALS computes RMSE for movielens dataset but after addition 
> of RankingMetrics and enhancements to ALS, it is critical to look at not only 
> the RMSE but also measures like prec@k and MAP.
> In this JIRA we added RMSE and MAP computation for examples.MovieLensALS and 
> also added a flag that takes an input whether user/product recommendation is 
> being validated.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12143) When column type is binary, select occurs ClassCastExcption in Beeline.

2015-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12143:
--
Component/s: SQL

[~meiyoula] set component please and read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Please edit the description too. It doesn't sound like it's even Spark related 
as written.

> When column type is binary, select occurs ClassCastExcption in Beeline.
> ---
>
> Key: SPARK-12143
> URL: https://issues.apache.org/jira/browse/SPARK-12143
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: meiyoula
>
> In Beeline, execute below sql:
> 1. create table bb(bi binary);
> 2. load data inpath 'tmp/data' into table bb;
> 3.select * from bb;
>  Error: java.lang.ClassCastException: java.lang.String cannot be cast to [B 
> (state=, code=0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12156) SPARK_EXECUTOR_INSTANCES is not effective

2015-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12156:
--
Target Version/s:   (was: 1.6.0)
Priority: Minor  (was: Major)
   Fix Version/s: (was: 1.6.0)

[~KaiXinXIaoLei] don't set target/fix version, and set priority appropriately. 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

> SPARK_EXECUTOR_INSTANCES  is not effective
> --
>
> Key: SPARK-12156
> URL: https://issues.apache.org/jira/browse/SPARK-12156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: KaiXinXIaoLei
>Priority: Minor
>
> I set SPARK_EXECUTOR_INSTANCES=3, but two executors starts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12164) [SQL] Display the binary/encoded values

2015-12-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12164:

Description: 
So far, we are using comma-separated decimal format to output the encoded 
contents. This way is rare when the data is in binary. This could be a common 
issue when we use Dataset API. For example, 

For example, 
[
implicit val kryoEncoder = Encoders.kryo[KryoClassData]
val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
KryoClassData("c", 3)).toDS()
ds.show(20, false);
]

  was:
So far, we are using comma-separated decimal format to output the encoded 
contents. This way is rare when the data is in binary. This could be a common 
issue when we use Dataset API. For example, 

For example, 
implicit val kryoEncoder = Encoders.kryo[KryoClassData]
val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
KryoClassData("c", 3)).toDS()
ds.show(20, false);


> [SQL] Display the binary/encoded values
> ---
>
> Key: SPARK-12164
> URL: https://issues.apache.org/jira/browse/SPARK-12164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> So far, we are using comma-separated decimal format to output the encoded 
> contents. This way is rare when the data is in binary. This could be a common 
> issue when we use Dataset API. For example, 
> For example, 
> [
> implicit val kryoEncoder = Encoders.kryo[KryoClassData]
> val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
> KryoClassData("c", 3)).toDS()
> ds.show(20, false);
> ]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12164) [SQL] Display the binary/encoded values

2015-12-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12164:

Description: 
So far, we are using comma-separated decimal format to output the encoded 
contents. This way is rare when the data is in binary. This could be a common 
issue when we use Dataset API. For example, 

For example, 
implicit val kryoEncoder = Encoders.kryo[KryoClassData]
val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
KryoClassData("c", 3)).toDS()
ds.show(20, false);

  was:
So far, we are using decimal format to output the encoded contents. This way is 
rare when the data is in binary. 

For example, 
implicit val kryoEncoder = Encoders.kryo[KryoClassData]
val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
KryoClassData("c", 3)).toDS()
ds.show(20, false);


> [SQL] Display the binary/encoded values
> ---
>
> Key: SPARK-12164
> URL: https://issues.apache.org/jira/browse/SPARK-12164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> So far, we are using comma-separated decimal format to output the encoded 
> contents. This way is rare when the data is in binary. This could be a common 
> issue when we use Dataset API. For example, 
> For example, 
> implicit val kryoEncoder = Encoders.kryo[KryoClassData]
> val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
> KryoClassData("c", 3)).toDS()
> ds.show(20, false);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12164) [SQL] Display the binary/encoded values

2015-12-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12164:

Description: 
So far, we are using comma-separated decimal format to output the encoded 
contents. This way is rare when the data is in binary. This could be a common 
issue when we use Dataset API. For example, 

For example, 
{
implicit val kryoEncoder = Encoders.kryo[KryoClassData]
val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
KryoClassData("c", 3)).toDS()
ds.show(20, false);
}

  was:
So far, we are using comma-separated decimal format to output the encoded 
contents. This way is rare when the data is in binary. This could be a common 
issue when we use Dataset API. For example, 

For example, 
[
implicit val kryoEncoder = Encoders.kryo[KryoClassData]
val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
KryoClassData("c", 3)).toDS()
ds.show(20, false);
]


> [SQL] Display the binary/encoded values
> ---
>
> Key: SPARK-12164
> URL: https://issues.apache.org/jira/browse/SPARK-12164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> So far, we are using comma-separated decimal format to output the encoded 
> contents. This way is rare when the data is in binary. This could be a common 
> issue when we use Dataset API. For example, 
> For example, 
> {
> implicit val kryoEncoder = Encoders.kryo[KryoClassData]
> val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
> KryoClassData("c", 3)).toDS()
> ds.show(20, false);
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12164) [SQL] Display the binary/encoded values

2015-12-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12164:

Description: 
So far, we are using comma-separated decimal format to output the encoded 
contents. This way is rare when the data is in binary. This could be a common 
issue when we use Dataset API. 

For example, 
{code}
implicit val kryoEncoder = Encoders.kryo[KryoClassData]
val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
KryoClassData("c", 3)).toDS()
ds.show(20, false);
{code}

  was:
So far, we are using comma-separated decimal format to output the encoded 
contents. This way is rare when the data is in binary. This could be a common 
issue when we use Dataset API. For example, 

For example, 
{code}
implicit val kryoEncoder = Encoders.kryo[KryoClassData]
val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
KryoClassData("c", 3)).toDS()
ds.show(20, false);
{code}


> [SQL] Display the binary/encoded values
> ---
>
> Key: SPARK-12164
> URL: https://issues.apache.org/jira/browse/SPARK-12164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> So far, we are using comma-separated decimal format to output the encoded 
> contents. This way is rare when the data is in binary. This could be a common 
> issue when we use Dataset API. 
> For example, 
> {code}
> implicit val kryoEncoder = Encoders.kryo[KryoClassData]
> val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
> KryoClassData("c", 3)).toDS()
> ds.show(20, false);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12164) [SQL] Display the binary/encoded values

2015-12-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12164:

Description: 
So far, we are using comma-separated decimal format to output the encoded 
contents. This way is rare when the data is in binary. This could be a common 
issue when we use Dataset API. For example, 

For example, 
{code}
implicit val kryoEncoder = Encoders.kryo[KryoClassData]
val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
KryoClassData("c", 3)).toDS()
ds.show(20, false);
{code}

  was:
So far, we are using comma-separated decimal format to output the encoded 
contents. This way is rare when the data is in binary. This could be a common 
issue when we use Dataset API. For example, 

For example, 
{
implicit val kryoEncoder = Encoders.kryo[KryoClassData]
val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
KryoClassData("c", 3)).toDS()
ds.show(20, false);
}


> [SQL] Display the binary/encoded values
> ---
>
> Key: SPARK-12164
> URL: https://issues.apache.org/jira/browse/SPARK-12164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> So far, we are using comma-separated decimal format to output the encoded 
> contents. This way is rare when the data is in binary. This could be a common 
> issue when we use Dataset API. For example, 
> For example, 
> {code}
> implicit val kryoEncoder = Encoders.kryo[KryoClassData]
> val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
> KryoClassData("c", 3)).toDS()
> ds.show(20, false);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12125) pull out nondeterministic expressions from Join

2015-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12125:
--
Component/s: SQL

> pull out nondeterministic expressions from Join
> ---
>
> Key: SPARK-12125
> URL: https://issues.apache.org/jira/browse/SPARK-12125
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: iward
>Priority: Minor
>
> Currently,*nondeterministic expressions* are only allowed in *Project* or 
> *Filter*,And only when we use nondeterministic expressions in *UnaryNode* can 
> be pulled out.
> But,Sometime in many case,we will use nondeterministic expressions to process 
> *join keys* avoiding data skew.for example:
> {noformat}
> select * 
> from tableA a 
> join 
> (select * from tableB) b 
> on upper((case when (a.brand_code is null or a.brand_code = '' ) then cast( 
> (-rand() * 1000 ) as string ) else a.brand_code end ))  = b.brand_code
> {noformat}
> This PR introduce a mechanism to pull out nondeterministic expressions from 
> *Join*,so we can use nondeterministic expression in *Join* appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12060) Avoid memory copy in JavaSerializerInstance.serialize

2015-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12060:
--
Fix Version/s: (was: 1.6.0)

> Avoid memory copy in JavaSerializerInstance.serialize
> -
>
> Key: SPARK-12060
> URL: https://issues.apache.org/jira/browse/SPARK-12060
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Critical
>
> JavaSerializerInstance.serialize uses ByteArrayOutputStream.toByteArray to 
> get the serialized data. ByteArrayOutputStream.toByteArray needs to copy the 
> content in the internal array to a new array. However, since the array will 
> be converted to ByteBuffer at once, we can avoid the memory copy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12159) Add user guide section for IndexToString transformer

2015-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12159:


Assignee: Apache Spark

> Add user guide section for IndexToString transformer
> 
>
> Key: SPARK-12159
> URL: https://issues.apache.org/jira/browse/SPARK-12159
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> Add a user guide section for the IndexToString transformer as reported on the 
> mailing list ( 
> https://www.mail-archive.com/dev@spark.apache.org/msg12263.html )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12159) Add user guide section for IndexToString transformer

2015-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12159:


Assignee: (was: Apache Spark)

> Add user guide section for IndexToString transformer
> 
>
> Key: SPARK-12159
> URL: https://issues.apache.org/jira/browse/SPARK-12159
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Add a user guide section for the IndexToString transformer as reported on the 
> mailing list ( 
> https://www.mail-archive.com/dev@spark.apache.org/msg12263.html )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12159) Add user guide section for IndexToString transformer

2015-12-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15043995#comment-15043995
 ] 

Apache Spark commented on SPARK-12159:
--

User 'BenFradet' has created a pull request for this issue:
https://github.com/apache/spark/pull/10166

> Add user guide section for IndexToString transformer
> 
>
> Key: SPARK-12159
> URL: https://issues.apache.org/jira/browse/SPARK-12159
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Add a user guide section for the IndexToString transformer as reported on the 
> mailing list ( 
> https://www.mail-archive.com/dev@spark.apache.org/msg12263.html )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12164) [SQL] Display the binary/encoded values

2015-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12164:


Assignee: (was: Apache Spark)

> [SQL] Display the binary/encoded values
> ---
>
> Key: SPARK-12164
> URL: https://issues.apache.org/jira/browse/SPARK-12164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> So far, we are using decimal format to output the encoded contents. This way 
> is rare when the data is in binary. 
> For example, 
> implicit val kryoEncoder = Encoders.kryo[KryoClassData]
> val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
> KryoClassData("c", 3)).toDS()
> ds.show(20, false);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12164) [SQL] Display the binary/encoded values

2015-12-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15043983#comment-15043983
 ] 

Apache Spark commented on SPARK-12164:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/10165

> [SQL] Display the binary/encoded values
> ---
>
> Key: SPARK-12164
> URL: https://issues.apache.org/jira/browse/SPARK-12164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> So far, we are using decimal format to output the encoded contents. This way 
> is rare when the data is in binary. 
> For example, 
> implicit val kryoEncoder = Encoders.kryo[KryoClassData]
> val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
> KryoClassData("c", 3)).toDS()
> ds.show(20, false);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12164) [SQL] Display the binary/encoded values

2015-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12164:


Assignee: Apache Spark

> [SQL] Display the binary/encoded values
> ---
>
> Key: SPARK-12164
> URL: https://issues.apache.org/jira/browse/SPARK-12164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> So far, we are using decimal format to output the encoded contents. This way 
> is rare when the data is in binary. 
> For example, 
> implicit val kryoEncoder = Encoders.kryo[KryoClassData]
> val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
> KryoClassData("c", 3)).toDS()
> ds.show(20, false);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12153) Word2Vec uses a fixed length for sentences which is not reasonable for reality, and similarity functions and fields are not accessible

2015-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12153:
--
  Labels:   (was: patch)
Priority: Minor  (was: Major)

(I don't think this can be considered major)

> Word2Vec uses a fixed length for sentences which is not reasonable for 
> reality, and similarity functions and fields are not accessible
> --
>
> Key: SPARK-12153
> URL: https://issues.apache.org/jira/browse/SPARK-12153
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: YongGang Cao
>Priority: Minor
>
> sentence boundary matters for sliding window, we shouldn't train model from a 
> window across sentences. the current 100 word as a hard split for sentences 
> doesn't really make sense.
> And the cosinesimilarity functions is private which is useless for caller. 
> we may need to access the vocabulary and wordindex table as well, those need 
> getters
> I made changes to address above issues.
> here is the pull request: https://github.com/apache/spark/pull/10152



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12162) Embedded Spark on JBoss server cause crashing due system.exit when SparkUncaughtExceptionHandler called

2015-12-06 Thread Sasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15043782#comment-15043782
 ] 

Sasi commented on SPARK-12162:
--

Hey,
Thanks for the quick response. 
On my JBoss i'm running only  new SparkContext(sparkConf) and  
SQLContext(sparkContext);
I have other machine that run my workers and on the same machine i'm running 
the master.

Is that the right way or am I missing something?

Thanks a lot!
Sasi


> Embedded Spark on JBoss server cause crashing due system.exit when 
> SparkUncaughtExceptionHandler called
> ---
>
> Key: SPARK-12162
> URL: https://issues.apache.org/jira/browse/SPARK-12162
> Project: Spark
>  Issue Type: Bug
>Reporter: Sasi
>Priority: Critical
>
> Hello,
> I'm running Spark on JBoss and some times i'm getting the following exception:
> {code}
> ERROR : (org.apache.spark.util.SparkUncaughtExceptionHandler:96) 
> -[appclient-registration-retry-thread] Uncaught exception in thread 
> Thread[appclient-registration-retry-thread,5,jboss]
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.FutureTask@4e33f83e rejected from 
> java.util.concurrent.ThreadPoolExecutor@35eed68e[Running, pool size = 1, 
> active threads = 0, queued tasks = 0, completed tasks = 3]
> {code}
> Then my JBoss crashed, so I take a look on the source of 
> SparkUncaughtExceptionHandler and I notes that when the exception called it 
> do System.exit(SparkExitCode.UNCAUGHT_EXCEPTION).
> [https://github.com/apache/spark/blob/3bd77b213a9cd177c3ea3c61d37e5098e55f75a5/core/src/main/scala/org/apache/spark/util/SparkUncaughtExceptionHandler.scala]
> Since the System.exit(...) called then my JBoss crash.
> Any workaround/fix that can help me?
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9603) Re-enable complex R package test in SparkSubmitSuite

2015-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9603:
-
Fix Version/s: (was: 1.6.0)

> Re-enable complex R package test in SparkSubmitSuite
> 
>
> Key: SPARK-9603
> URL: https://issues.apache.org/jira/browse/SPARK-9603
> Project: Spark
>  Issue Type: Test
>  Components: Deploy, SparkR, Tests
>Affects Versions: 1.5.0
>Reporter: Burak Yavuz
>Assignee: Sun Rui
>
> For building complex Spark Packages that contain R code in addition to Scala, 
> we have a complex procedure, where R source code is shipped inside a jar. The 
> source code is extracted, built, and is added as a library among SparkR.
> The end to end test in SparkSubmitSuite ("correctly builds R packages 
> included in a jar with --packages") can't run on Jenkins now, because the 
> pull request builder is not built with SparkR. Once the PR Builder is built 
> with SparkR, we should re-enable the test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12163) FPGrowth unusable on some datasets without extensive tweaking of the support threshold

2015-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12163:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> FPGrowth unusable on some datasets without extensive tweaking of the support 
> threshold
> --
>
> Key: SPARK-12163
> URL: https://issues.apache.org/jira/browse/SPARK-12163
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Jaroslav Kuchar
>Priority: Minor
>
> This problem occurs on standard machine learning UCI datasets. 
> Details for "audiology" dataset follows: It contains only 226 transactions 
> and 70 attributes. Mining of frequent itemsets with support threshold 0.95 
> will produce 73.162.705 itemsets., for support 0.94 – 366.880.771 itemsets.
> More details about experiment: 
> https://gist.github.com/jaroslav-kuchar/edbcbe72c5a884136db1
> The number of generated itemsets rapidly growths with a number of unique 
> items in transactions. Considering the combinatorial explosion, it can cause 
> performing CPU-intensive and long running tasks for various settings of the 
> support threshold. This extensive tweaking of the support threshold makes the 
> usage of the FPGrowth implementation unusable even for a small dataset.
> It would be useful to implement additional stopping criterions to control the 
> explosion of itemsets’ count in FPGrowth. We propose to implement optional 
> limit for maximum number of generated itemsets or maximum number of items per 
> itemset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12136) rddToFileName does not properly handle prefix and suffix parameters

2015-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12136:
--
Labels: starter  (was: )

[~naveenminchu] Just read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark and 
make a pull request then. Should be trivial to separately handle the prefix and 
suffix.

> rddToFileName does not properly handle prefix and suffix parameters
> ---
>
> Key: SPARK-12136
> URL: https://issues.apache.org/jira/browse/SPARK-12136
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Brian Webb
>Priority: Minor
>  Labels: starter
>
> See code here: 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala#L894
>   private[streaming] def rddToFileName[T](prefix: String, suffix: String, 
> time: Time): String = {
> if (prefix == null) {
>   time.milliseconds.toString
> } else if (suffix == null || suffix.length ==0) {
>   prefix + "-" + time.milliseconds
> } else {
>   prefix + "-" + time.milliseconds + "." + suffix
> }
>   }
> This code does not seem to properly handle the cases where the prefix is 
> null, but suffix is not null - the suffix should be used but is not.
> Also, the check for length == 0 is only applied to the suffix, bot the 
> prefix. It seems the check should be consistent between the two.
> Is there a reason not to address these two issues and change the code?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12162) Embedded Spark on JBoss server cause crashing due system.exit when SparkUncaughtExceptionHandler called

2015-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12162.
---
Resolution: Not A Problem

I think the answer is that you can't in general "embed" the Spark executor 
processes.

> Embedded Spark on JBoss server cause crashing due system.exit when 
> SparkUncaughtExceptionHandler called
> ---
>
> Key: SPARK-12162
> URL: https://issues.apache.org/jira/browse/SPARK-12162
> Project: Spark
>  Issue Type: Bug
>Reporter: Sasi
>Priority: Critical
>
> Hello,
> I'm running Spark on JBoss and some times i'm getting the following exception:
> {code}
> ERROR : (org.apache.spark.util.SparkUncaughtExceptionHandler:96) 
> -[appclient-registration-retry-thread] Uncaught exception in thread 
> Thread[appclient-registration-retry-thread,5,jboss]
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.FutureTask@4e33f83e rejected from 
> java.util.concurrent.ThreadPoolExecutor@35eed68e[Running, pool size = 1, 
> active threads = 0, queued tasks = 0, completed tasks = 3]
> {code}
> Then my JBoss crashed, so I take a look on the source of 
> SparkUncaughtExceptionHandler and I notes that when the exception called it 
> do System.exit(SparkExitCode.UNCAUGHT_EXCEPTION).
> [https://github.com/apache/spark/blob/3bd77b213a9cd177c3ea3c61d37e5098e55f75a5/core/src/main/scala/org/apache/spark/util/SparkUncaughtExceptionHandler.scala]
> Since the System.exit(...) called then my JBoss crash.
> Any workaround/fix that can help me?
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12163) FPGrowth unusable on some datasets without extensive tweaking of the support threshold

2015-12-06 Thread Jaroslav Kuchar (JIRA)
Jaroslav Kuchar created SPARK-12163:
---

 Summary: FPGrowth unusable on some datasets without extensive 
tweaking of the support threshold
 Key: SPARK-12163
 URL: https://issues.apache.org/jira/browse/SPARK-12163
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Jaroslav Kuchar


This problem occurs on standard machine learning UCI datasets. 
Details for "audiology" dataset follows: It contains only 226 transactions and 
70 attributes. Mining of frequent itemsets with support threshold 0.95 will 
produce 73.162.705 itemsets., for support 0.94 – 366.880.771 itemsets.
More details about experiment: 
https://gist.github.com/jaroslav-kuchar/edbcbe72c5a884136db1

The number of generated itemsets rapidly growths with a number of unique items 
in transactions. Considering the combinatorial explosion, it can cause 
performing CPU-intensive and long running tasks for various settings of the 
support threshold. This extensive tweaking of the support threshold makes the 
usage of the FPGrowth implementation unusable even for a small dataset.

It would be useful to implement additional stopping criterions to control the 
explosion of itemsets’ count in FPGrowth. We propose to implement optional 
limit for maximum number of generated itemsets or maximum number of items per 
itemset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12125) pull out nondeterministic expressions from Join

2015-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12125:
--
Affects Version/s: (was: 1.5.1)
   (was: 1.5.0)
 Target Version/s:   (was: 1.6.0)
 Priority: Minor  (was: Major)
Fix Version/s: (was: 1.6.0)

[~iward] don't set target/fix version. Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

> pull out nondeterministic expressions from Join
> ---
>
> Key: SPARK-12125
> URL: https://issues.apache.org/jira/browse/SPARK-12125
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.2
>Reporter: iward
>Priority: Minor
>
> Currently,*nondeterministic expressions* are only allowed in *Project* or 
> *Filter*,And only when we use nondeterministic expressions in *UnaryNode* can 
> be pulled out.
> But,Sometime in many case,we will use nondeterministic expressions to process 
> *join keys* avoiding data skew.for example:
> {noformat}
> select * 
> from tableA a 
> join 
> (select * from tableB) b 
> on upper((case when (a.brand_code is null or a.brand_code = '' ) then cast( 
> (-rand() * 1000 ) as string ) else a.brand_code end ))  = b.brand_code
> {noformat}
> This PR introduce a mechanism to pull out nondeterministic expressions from 
> *Join*,so we can use nondeterministic expression in *Join* appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-12-06 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044223#comment-15044223
 ] 

hujiayin commented on SPARK-4036:
-

Hi Andrew,

The code is implemented by Scala and integrated with Spark. I tested it after I 
implemented it. I also verify it with some papers listed inside the code and 
this jira. Could you send me your features and models that I can have further 
testing?



> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-12-06 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044223#comment-15044223
 ] 

hujiayin edited comment on SPARK-4036 at 12/7/15 1:00 AM:
--

Hi Andrew,

The code is implemented by Scala and integrated with Spark. I tested it after I 
implemented it. I also verified it with some papers listed inside the code and 
this jira. Could you send me your features and models that I can have further 
testing? 




was (Author: hujiayin):
Hi Andrew,

The code is implemented by Scala and integrated with Spark. I tested it after I 
implemented it. I also verify it with some papers listed inside the code and 
this jira. Could you send me your features and models that I can have further 
testing?



> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12155) Execution OOM after a relative large dataset cached in the cluster.

2015-12-06 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-12155:
--

Assignee: Josh Rosen

> Execution OOM after a relative large dataset cached in the cluster.
> ---
>
> Key: SPARK-12155
> URL: https://issues.apache.org/jira/browse/SPARK-12155
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Yin Huai
>Assignee: Josh Rosen
>Priority: Blocker
>
> I have a cluster with relative 80GB of mem. Then, I cached a 43GB dataframe. 
> When I start to consume the query. I got the following exception (I added 
> more logs to the code).
> {code}
> 15/12/05 00:33:43 INFO UnifiedMemoryManager: Creating UnifedMemoryManager for 
> 4 cores with 16929521664 maxMemory, 8464760832 storageRegionSize.
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 1048576 bytes of free space for 
> block rdd_94_37(free: 3253659951, max: 16798973952)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 5142008 bytes of free space for 
> block rdd_94_37(free: 3252611375, max: 16798973952)
> 15/12/05 01:20:50 INFO Executor: Finished task 36.0 in stage 4.0 (TID 109). 
> 3028 bytes result sent to driver
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98948238 bytes of free space for 
> block rdd_94_37(free: 3314840375, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98675713 bytes of free space for 
> block rdd_94_37(free: 3215892137, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 197347565 bytes of free space 
> for block rdd_94_37(free: 3117216424, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 295995553 bytes of free space 
> for block rdd_94_37(free: 2919868859, max: 16866344960)
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 394728479 bytes of free space 
> for block rdd_94_37(free: 2687050010, max: 16929521664)
> 15/12/05 01:20:51 INFO Executor: Finished task 32.0 in stage 4.0 (TID 106). 
> 3028 bytes result sent to driver
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 591258816 bytes of free space 
> for block rdd_94_37(free: 2292321531, max: 16929521664)
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 901645182 bytes of free space 
> for block rdd_94_37(free: 1701062715, max: 16929521664)
> 15/12/05 01:20:52 INFO MemoryStore: Ensuring 1302179076 bytes of free space 
> for block rdd_94_37(free: 799417533, max: 16929521664)
> 15/12/05 01:20:52 INFO MemoryStore: Will not store rdd_94_37 as it would 
> require dropping another block from the same RDD
> 15/12/05 01:20:52 WARN MemoryStore: Not enough space to cache rdd_94_37 in 
> memory! (computed 2.4 GB so far)
> 15/12/05 01:20:52 INFO MemoryStore: Memory use = 12.6 GB (blocks) + 2.4 GB 
> (scratch space shared across 13 tasks(s)) = 15.0 GB. Storage limit = 15.8 GB.
> 15/12/05 01:20:52 INFO BlockManager: Found block rdd_94_37 locally
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 262144 bytes 
> memory. But, on-heap execution memory poll only has 0 bytes free memory.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
> 8464760832, storageMemoryPool.poolSize 16929521664, storageRegionSize 
> 8464760832.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from 
> storage memory pool.
> 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 262144 bytes free memory 
> space from StorageMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 262144 bytes of memory 
> from storage memory pool.Adding them back to onHeapExecutionMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 67108864 bytes 
> memory. But, on-heap execution memory poll only has 0 bytes free memory.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
> 8464498688, storageMemoryPool.poolSize 16929259520, storageRegionSize 
> 8464760832.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from 
> storage memory pool.
> 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 67108864 bytes free memory 
> space from StorageMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 67108864 bytes of 
> memory from storage memory pool.Adding them back to onHeapExecutionMemoryPool.
> 15/12/05 01:20:54 INFO Executor: Finished task 37.0 in stage 4.0 (TID 110). 
> 3077 bytes result sent to driver
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 120
> 15/12/05 01:20:56 INFO Executor: Running task 1.0 in stage 5.0 (TID 120)
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 124
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 128
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 132
> 15/12/05 01:20:56 INFO Executor: Running task 9.0 in stage 5.0 

[jira] [Resolved] (SPARK-12152) Speed up Scalastyle by only running one SBT command instead of four

2015-12-06 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-12152.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10151
[https://github.com/apache/spark/pull/10151]

> Speed up Scalastyle by only running one SBT command instead of four
> ---
>
> Key: SPARK-12152
> URL: https://issues.apache.org/jira/browse/SPARK-12152
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> dev/scalastyle runs four SBT commands when only one would suffice. We should 
> fix this in order to speed up pull request builds by about 60 seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12166) Unset hadoop related environment in testing

2015-12-06 Thread Jeff Zhang (JIRA)
Jeff Zhang created SPARK-12166:
--

 Summary: Unset hadoop related environment in testing 
 Key: SPARK-12166
 URL: https://issues.apache.org/jira/browse/SPARK-12166
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 1.5.2
Reporter: Jeff Zhang


I try to do test on HiveSparkSubmitSuite on local box, but fails. The cause is 
that spark is still using my local single node cluster hadoop when doing the 
unit test. I don't think it make sense to do that. These environment variable 
should be unset before the testing. And I suspect dev/run-tests also
didn't do that either. 

Here's the error message:

{code}
Cause: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch 
dir: /tmp/hive on HDFS should be writable. Current permissions are: rwxr-xr-x
[info]   at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
[info]   at 
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
[info]   at 
org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
[info]   at 
org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12125) pull out nondeterministic expressions from Join

2015-12-06 Thread iward (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044272#comment-15044272
 ] 

iward commented on SPARK-12125:
---

Ok,get.Thanks a lot.

> pull out nondeterministic expressions from Join
> ---
>
> Key: SPARK-12125
> URL: https://issues.apache.org/jira/browse/SPARK-12125
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: iward
>Priority: Minor
>
> Currently,*nondeterministic expressions* are only allowed in *Project* or 
> *Filter*,And only when we use nondeterministic expressions in *UnaryNode* can 
> be pulled out.
> But,Sometime in many case,we will use nondeterministic expressions to process 
> *join keys* avoiding data skew.for example:
> {noformat}
> select * 
> from tableA a 
> join 
> (select * from tableB) b 
> on upper((case when (a.brand_code is null or a.brand_code = '' ) then cast( 
> (-rand() * 1000 ) as string ) else a.brand_code end ))  = b.brand_code
> {noformat}
> This PR introduce a mechanism to pull out nondeterministic expressions from 
> *Join*,so we can use nondeterministic expression in *Join* appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12155) Execution OOM after a relative large dataset cached in the cluster.

2015-12-06 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044237#comment-15044237
 ] 

Josh Rosen commented on SPARK-12155:


I'm working on fixing this issue. I have a regression test for this bug at 
https://github.com/apache/spark/commit/4c8110ddeee990507c9347700dec557fc22a55a5.
 While investigating this, I found a closely-related bug which impacts eviction 
of storage memory in cases where you have only a single running task on an 
executor (this bug, SPARK-12155, is triggered by having multiple running tasks 
on an executor). I'm going to break down the task of fixing this bug into a 
series of smaller patches in order to lessen the review burden.

> Execution OOM after a relative large dataset cached in the cluster.
> ---
>
> Key: SPARK-12155
> URL: https://issues.apache.org/jira/browse/SPARK-12155
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Yin Huai
>Assignee: Josh Rosen
>Priority: Blocker
>
> I have a cluster with relative 80GB of mem. Then, I cached a 43GB dataframe. 
> When I start to consume the query. I got the following exception (I added 
> more logs to the code).
> {code}
> 15/12/05 00:33:43 INFO UnifiedMemoryManager: Creating UnifedMemoryManager for 
> 4 cores with 16929521664 maxMemory, 8464760832 storageRegionSize.
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 1048576 bytes of free space for 
> block rdd_94_37(free: 3253659951, max: 16798973952)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 5142008 bytes of free space for 
> block rdd_94_37(free: 3252611375, max: 16798973952)
> 15/12/05 01:20:50 INFO Executor: Finished task 36.0 in stage 4.0 (TID 109). 
> 3028 bytes result sent to driver
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98948238 bytes of free space for 
> block rdd_94_37(free: 3314840375, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98675713 bytes of free space for 
> block rdd_94_37(free: 3215892137, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 197347565 bytes of free space 
> for block rdd_94_37(free: 3117216424, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 295995553 bytes of free space 
> for block rdd_94_37(free: 2919868859, max: 16866344960)
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 394728479 bytes of free space 
> for block rdd_94_37(free: 2687050010, max: 16929521664)
> 15/12/05 01:20:51 INFO Executor: Finished task 32.0 in stage 4.0 (TID 106). 
> 3028 bytes result sent to driver
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 591258816 bytes of free space 
> for block rdd_94_37(free: 2292321531, max: 16929521664)
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 901645182 bytes of free space 
> for block rdd_94_37(free: 1701062715, max: 16929521664)
> 15/12/05 01:20:52 INFO MemoryStore: Ensuring 1302179076 bytes of free space 
> for block rdd_94_37(free: 799417533, max: 16929521664)
> 15/12/05 01:20:52 INFO MemoryStore: Will not store rdd_94_37 as it would 
> require dropping another block from the same RDD
> 15/12/05 01:20:52 WARN MemoryStore: Not enough space to cache rdd_94_37 in 
> memory! (computed 2.4 GB so far)
> 15/12/05 01:20:52 INFO MemoryStore: Memory use = 12.6 GB (blocks) + 2.4 GB 
> (scratch space shared across 13 tasks(s)) = 15.0 GB. Storage limit = 15.8 GB.
> 15/12/05 01:20:52 INFO BlockManager: Found block rdd_94_37 locally
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 262144 bytes 
> memory. But, on-heap execution memory poll only has 0 bytes free memory.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
> 8464760832, storageMemoryPool.poolSize 16929521664, storageRegionSize 
> 8464760832.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from 
> storage memory pool.
> 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 262144 bytes free memory 
> space from StorageMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 262144 bytes of memory 
> from storage memory pool.Adding them back to onHeapExecutionMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 67108864 bytes 
> memory. But, on-heap execution memory poll only has 0 bytes free memory.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
> 8464498688, storageMemoryPool.poolSize 16929259520, storageRegionSize 
> 8464760832.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from 
> storage memory pool.
> 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 67108864 bytes free memory 
> space from StorageMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 67108864 bytes of 
> memory from storage memory pool.Adding them back to onHeapExecutionMemoryPool.
> 15/12/05 

[jira] [Commented] (SPARK-12060) Avoid memory copy in JavaSerializerInstance.serialize

2015-12-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044304#comment-15044304
 ] 

Apache Spark commented on SPARK-12060:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/10167

> Avoid memory copy in JavaSerializerInstance.serialize
> -
>
> Key: SPARK-12060
> URL: https://issues.apache.org/jira/browse/SPARK-12060
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Critical
>
> JavaSerializerInstance.serialize uses ByteArrayOutputStream.toByteArray to 
> get the serialized data. ByteArrayOutputStream.toByteArray needs to copy the 
> content in the internal array to a new array. However, since the array will 
> be converted to ByteBuffer at once, we can avoid the memory copy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12138) Escape \u in the generated comments.

2015-12-06 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-12138.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 10155
[https://github.com/apache/spark/pull/10155]

> Escape \u in the generated comments.
> 
>
> Key: SPARK-12138
> URL: https://issues.apache.org/jira/browse/SPARK-12138
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
> Fix For: 1.6.0
>
>
> https://spark-tests.appspot.com/test-logs/12683942



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12138) Escape \u in the generated comments.

2015-12-06 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12138:
-
Assignee: Xiao Li

> Escape \u in the generated comments.
> 
>
> Key: SPARK-12138
> URL: https://issues.apache.org/jira/browse/SPARK-12138
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Xiao Li
> Fix For: 1.6.0
>
>
> https://spark-tests.appspot.com/test-logs/12683942



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12040) Add toJson/fromJson to Vector/Vectors for PySpark

2015-12-06 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044145#comment-15044145
 ] 

holdenk commented on SPARK-12040:
-

So this came out of wanting to have a matching API between Scala/Python since 
the toJson/fromJson methods are public.

We could use the wrappers for any of the models which are implemented in Scala 
- but if we do any models/transformers in Python copying the vector to/from the 
JVM to write to JSON would be a waste.

> Add toJson/fromJson to Vector/Vectors for PySpark
> -
>
> Key: SPARK-12040
> URL: https://issues.apache.org/jira/browse/SPARK-12040
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>Priority: Trivial
>  Labels: starter
>
> Add toJson/fromJson to Vector/Vectors for PySpark, please refer the Scala one 
> SPARK-11766.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12169) SparkR 2.0

2015-12-06 Thread Sun Rui (JIRA)
Sun Rui created SPARK-12169:
---

 Summary: SparkR 2.0
 Key: SPARK-12169
 URL: https://issues.apache.org/jira/browse/SPARK-12169
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Sun Rui


This is an umbrella issue addressing all SparkR related issues corresponding to 
Spark 2.0 being planned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12158) [R] [SQL] Fix 'sample' functions that break R unit test cases

2015-12-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12158:

Component/s: (was: R)
 SparkR

> [R] [SQL] Fix 'sample' functions that break R unit test cases
> -
>
> Key: SPARK-12158
> URL: https://issues.apache.org/jira/browse/SPARK-12158
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Priority: Critical
>
> The existing sample functions miss the parameter 'seed', however, the 
> corresponding function interface in `generics` has such a parameter.  
> This could cause SparkR unit tests failed. For example, I hit it in one PR:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47213/consoleFull



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12169) SparkR 2.0

2015-12-06 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044418#comment-15044418
 ] 

Shivaram Venkataraman commented on SPARK-12169:
---

Thanks [~sunrui] for starting this. You can assign all the issues to me if you 
want to avoid people picking these up while we discuss them.

> SparkR 2.0
> --
>
> Key: SPARK-12169
> URL: https://issues.apache.org/jira/browse/SPARK-12169
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Sun Rui
>
> This is an umbrella issue addressing all SparkR related issues corresponding 
> to Spark 2.0 being planned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12167) Invoke the right sameResult function when plan is warpped with SubQueries

2015-12-06 Thread Yadong Qi (JIRA)
Yadong Qi created SPARK-12167:
-

 Summary: Invoke the right sameResult function when plan is warpped 
with SubQueries
 Key: SPARK-12167
 URL: https://issues.apache.org/jira/browse/SPARK-12167
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
Reporter: Yadong Qi


I find this bug when I use cache table,
```
spark-sql> create table src_p(key int, value int) stored as parquet;
OK
Time taken: 3.144 seconds
spark-sql> cache table src_p;
Time taken: 1.452 seconds
spark-sql> explain extended select count(*) from src_p;
```
I got the wrong physical plan
```
== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
output=[_c0#28L])
 TungstenExchange SinglePartition
  TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#33L])
   Scan ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][]
```
and the right physical plan is
```
== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
output=[_c0#47L])
 TungstenExchange SinglePartition
  TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#62L])
   InMemoryColumnarTableScan (InMemoryRelation [key#45,value#46], true, 1, 
StorageLevel(true, true, false, true, 1), (Scan 
ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][key#9,value#10]),
 Some(src_p))
```

When the implementation classes of `MultiInstanceRelation`(eg. 
`LogicalRelation`, `LocalRelation`) are warpped with SubQueries, they can't 
invoke the right `sameResult` function in their own implementation. So we need 
to eliminate SubQueries first and then try to invoke `sameResult` function in 
their own implementation.
Like:
When plan is 
`Subquery(LogicalRelation(relation:ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p],
 expectedOutputAttributes:Some(ArrayBuffer(key#0, value#1`, first eliminate 
SubQueries, and then will invoke the `sameResult` function in `LogicalRelation` 
instead of `LogicalPlan`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12165) Execution memory requests may fail to evict storage blocks if storage memory usage is below max memory

2015-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12165:


Assignee: Apache Spark  (was: Josh Rosen)

> Execution memory requests may fail to evict storage blocks if storage memory 
> usage is below max memory
> --
>
> Key: SPARK-12165
> URL: https://issues.apache.org/jira/browse/SPARK-12165
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Blocker
>
> Consider a scenario where storage memory usage has grown past the size of the 
> unevictable storage region ({{spark.memory.storageFraction}} * maxMemory) and 
> a task needs to acquire more execution memory by reclaiming evictable storage 
> memory. If the storage memory usage is less than maxMemory, then there's a 
> possibility that no storage blocks will be evicted. This is caused by how 
> {{MemoryStore.ensureFreeSpace()}} is called inside of 
> {{StorageMemoryPool.shrinkPoolToReclaimSpace()}}.
> Here's a failing regression test which demonstrates this bug: 
> https://github.com/apache/spark/commit/b519fe628a9a2b8238dfedbfd9b74bdd2ddc0de4?diff=unified#diff-b3a7cd2e011e048908d70f743c0ed7cfR155



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12168) Need test for masked function

2015-12-06 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-12168:


 Summary: Need test for masked function
 Key: SPARK-12168
 URL: https://issues.apache.org/jira/browse/SPARK-12168
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.5.2
Reporter: Felix Cheung
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12172) Remove SparkR internal RDD APIs

2015-12-06 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-12172:


 Summary: Remove SparkR internal RDD APIs
 Key: SPARK-12172
 URL: https://issues.apache.org/jira/browse/SPARK-12172
 Project: Spark
  Issue Type: Sub-task
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12173) Consider supporting DataSet API in SparkR

2015-12-06 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-12173:


 Summary: Consider supporting DataSet API in SparkR
 Key: SPARK-12173
 URL: https://issues.apache.org/jira/browse/SPARK-12173
 Project: Spark
  Issue Type: Sub-task
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12171) Support DataSet API in SparkR

2015-12-06 Thread Sun Rui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sun Rui updated SPARK-12171:

Component/s: Spark Submit

> Support DataSet API in SparkR
> -
>
> Key: SPARK-12171
> URL: https://issues.apache.org/jira/browse/SPARK-12171
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Sun Rui
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12171) Support DataSet API in SparkR

2015-12-06 Thread Sun Rui (JIRA)
Sun Rui created SPARK-12171:
---

 Summary: Support DataSet API in SparkR
 Key: SPARK-12171
 URL: https://issues.apache.org/jira/browse/SPARK-12171
 Project: Spark
  Issue Type: New Feature
Reporter: Sun Rui






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12171) Support DataSet API in SparkR

2015-12-06 Thread Sun Rui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sun Rui updated SPARK-12171:

Component/s: (was: Spark Submit)
 SparkR

> Support DataSet API in SparkR
> -
>
> Key: SPARK-12171
> URL: https://issues.apache.org/jira/browse/SPARK-12171
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Sun Rui
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12169) SparkR 2.0

2015-12-06 Thread Sun Rui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sun Rui updated SPARK-12169:

Issue Type: Improvement  (was: Bug)

> SparkR 2.0
> --
>
> Key: SPARK-12169
> URL: https://issues.apache.org/jira/browse/SPARK-12169
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Sun Rui
>
> This is an umbrella issue addressing all SparkR related issues corresponding 
> to Spark 2.0 being planned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12172) Remove SparkR internal RDD APIs

2015-12-06 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044393#comment-15044393
 ] 

Sun Rui commented on SPARK-12172:
-

Not sure if we want to remove RDD API. Need more discussion.

> Remove SparkR internal RDD APIs
> ---
>
> Key: SPARK-12172
> URL: https://issues.apache.org/jira/browse/SPARK-12172
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12166) Unset hadoop related environment in testing

2015-12-06 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-12166:
---
Priority: Minor  (was: Major)

> Unset hadoop related environment in testing 
> 
>
> Key: SPARK-12166
> URL: https://issues.apache.org/jira/browse/SPARK-12166
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 1.5.2
>Reporter: Jeff Zhang
>Priority: Minor
>
> I try to do test on HiveSparkSubmitSuite on local box, but fails. The cause 
> is that spark is still using my local single node cluster hadoop when doing 
> the unit test. I don't think it make sense to do that. These environment 
> variable should be unset before the testing. And I suspect dev/run-tests also
> didn't do that either. 
> Here's the error message:
> {code}
> Cause: java.lang.RuntimeException: java.lang.RuntimeException: The root 
> scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: 
> rwxr-xr-x
> [info]   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
> [info]   at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12167) Invoke the right sameResult function when plan is warpped with SubQueries

2015-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12167:


Assignee: Apache Spark

> Invoke the right sameResult function when plan is warpped with SubQueries
> -
>
> Key: SPARK-12167
> URL: https://issues.apache.org/jira/browse/SPARK-12167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yadong Qi
>Assignee: Apache Spark
>
> I find this bug when I use cache table,
> ```
> spark-sql> create table src_p(key int, value int) stored as parquet;
> OK
> Time taken: 3.144 seconds
> spark-sql> cache table src_p;
> Time taken: 1.452 seconds
> spark-sql> explain extended select count(*) from src_p;
> ```
> I got the wrong physical plan
> ```
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[_c0#28L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#33L])
>Scan ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][]
> ```
> and the right physical plan is
> ```
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[_c0#47L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#62L])
>InMemoryColumnarTableScan (InMemoryRelation [key#45,value#46], true, 
> 1, StorageLevel(true, true, false, true, 1), (Scan 
> ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][key#9,value#10]),
>  Some(src_p))
> ```
> When the implementation classes of `MultiInstanceRelation`(eg. 
> `LogicalRelation`, `LocalRelation`) are warpped with SubQueries, they can't 
> invoke the right `sameResult` function in their own implementation. So we 
> need to eliminate SubQueries first and then try to invoke `sameResult` 
> function in their own implementation.
> Like:
> When plan is 
> `Subquery(LogicalRelation(relation:ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p],
>  expectedOutputAttributes:Some(ArrayBuffer(key#0, value#1`, first 
> eliminate SubQueries, and then will invoke the `sameResult` function in 
> `LogicalRelation` instead of `LogicalPlan`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12167) Invoke the right sameResult function when plan is warpped with SubQueries

2015-12-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044335#comment-15044335
 ] 

Apache Spark commented on SPARK-12167:
--

User 'watermen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10169

> Invoke the right sameResult function when plan is warpped with SubQueries
> -
>
> Key: SPARK-12167
> URL: https://issues.apache.org/jira/browse/SPARK-12167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yadong Qi
>
> I find this bug when I use cache table,
> ```
> spark-sql> create table src_p(key int, value int) stored as parquet;
> OK
> Time taken: 3.144 seconds
> spark-sql> cache table src_p;
> Time taken: 1.452 seconds
> spark-sql> explain extended select count(*) from src_p;
> ```
> I got the wrong physical plan
> ```
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[_c0#28L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#33L])
>Scan ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][]
> ```
> and the right physical plan is
> ```
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[_c0#47L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#62L])
>InMemoryColumnarTableScan (InMemoryRelation [key#45,value#46], true, 
> 1, StorageLevel(true, true, false, true, 1), (Scan 
> ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][key#9,value#10]),
>  Some(src_p))
> ```
> When the implementation classes of `MultiInstanceRelation`(eg. 
> `LogicalRelation`, `LocalRelation`) are warpped with SubQueries, they can't 
> invoke the right `sameResult` function in their own implementation. So we 
> need to eliminate SubQueries first and then try to invoke `sameResult` 
> function in their own implementation.
> Like:
> When plan is 
> `Subquery(LogicalRelation(relation:ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p],
>  expectedOutputAttributes:Some(ArrayBuffer(key#0, value#1`, first 
> eliminate SubQueries, and then will invoke the `sameResult` function in 
> `LogicalRelation` instead of `LogicalPlan`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12167) Invoke the right sameResult function when plan is warpped with SubQueries

2015-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12167:


Assignee: (was: Apache Spark)

> Invoke the right sameResult function when plan is warpped with SubQueries
> -
>
> Key: SPARK-12167
> URL: https://issues.apache.org/jira/browse/SPARK-12167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yadong Qi
>
> I find this bug when I use cache table,
> ```
> spark-sql> create table src_p(key int, value int) stored as parquet;
> OK
> Time taken: 3.144 seconds
> spark-sql> cache table src_p;
> Time taken: 1.452 seconds
> spark-sql> explain extended select count(*) from src_p;
> ```
> I got the wrong physical plan
> ```
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[_c0#28L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#33L])
>Scan ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][]
> ```
> and the right physical plan is
> ```
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[_c0#47L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#62L])
>InMemoryColumnarTableScan (InMemoryRelation [key#45,value#46], true, 
> 1, StorageLevel(true, true, false, true, 1), (Scan 
> ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][key#9,value#10]),
>  Some(src_p))
> ```
> When the implementation classes of `MultiInstanceRelation`(eg. 
> `LogicalRelation`, `LocalRelation`) are warpped with SubQueries, they can't 
> invoke the right `sameResult` function in their own implementation. So we 
> need to eliminate SubQueries first and then try to invoke `sameResult` 
> function in their own implementation.
> Like:
> When plan is 
> `Subquery(LogicalRelation(relation:ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p],
>  expectedOutputAttributes:Some(ArrayBuffer(key#0, value#1`, first 
> eliminate SubQueries, and then will invoke the `sameResult` function in 
> `LogicalRelation` instead of `LogicalPlan`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12169) SparkR 2.0

2015-12-06 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044390#comment-15044390
 ] 

Felix Cheung commented on SPARK-12169:
--

Great thanks for opening this.
I think we should definitely consider removing RDD APIs - that would help 
getting to a smaller code base.

> SparkR 2.0
> --
>
> Key: SPARK-12169
> URL: https://issues.apache.org/jira/browse/SPARK-12169
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Sun Rui
>
> This is an umbrella issue addressing all SparkR related issues corresponding 
> to Spark 2.0 being planned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12167) Invoke the right sameResult function when plan is warpped with SubQueries

2015-12-06 Thread Yadong Qi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044488#comment-15044488
 ] 

Yadong Qi commented on SPARK-12167:
---

SPARK-11246 has already fixed it

> Invoke the right sameResult function when plan is warpped with SubQueries
> -
>
> Key: SPARK-12167
> URL: https://issues.apache.org/jira/browse/SPARK-12167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yadong Qi
>
> I find this bug when I use cache table,
> ```
> spark-sql> create table src_p(key int, value int) stored as parquet;
> OK
> Time taken: 3.144 seconds
> spark-sql> cache table src_p;
> Time taken: 1.452 seconds
> spark-sql> explain extended select count(*) from src_p;
> ```
> I got the wrong physical plan
> ```
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[_c0#28L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#33L])
>Scan ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][]
> ```
> and the right physical plan is
> ```
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[_c0#47L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#62L])
>InMemoryColumnarTableScan (InMemoryRelation [key#45,value#46], true, 
> 1, StorageLevel(true, true, false, true, 1), (Scan 
> ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][key#9,value#10]),
>  Some(src_p))
> ```
> When the implementation classes of `MultiInstanceRelation`(eg. 
> `LogicalRelation`, `LocalRelation`) are warpped with SubQueries, they can't 
> invoke the right `sameResult` function in their own implementation. So we 
> need to eliminate SubQueries first and then try to invoke `sameResult` 
> function in their own implementation.
> Like:
> When plan is 
> `Subquery(LogicalRelation(relation:ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p],
>  expectedOutputAttributes:Some(ArrayBuffer(key#0, value#1`, first 
> eliminate SubQueries, and then will invoke the `sameResult` function in 
> `LogicalRelation` instead of `LogicalPlan`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12155) Execution OOM after a relative large dataset cached in the cluster.

2015-12-06 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044348#comment-15044348
 ] 

Josh Rosen commented on SPARK-12155:


I think this is blocked by the fix for SPARK-12165, a closely-related bug which 
impacts the eviction of storage memory in a single-concurrent-task case.

> Execution OOM after a relative large dataset cached in the cluster.
> ---
>
> Key: SPARK-12155
> URL: https://issues.apache.org/jira/browse/SPARK-12155
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Yin Huai
>Assignee: Josh Rosen
>Priority: Blocker
>
> I have a cluster with relative 80GB of mem. Then, I cached a 43GB dataframe. 
> When I start to consume the query. I got the following exception (I added 
> more logs to the code).
> {code}
> 15/12/05 00:33:43 INFO UnifiedMemoryManager: Creating UnifedMemoryManager for 
> 4 cores with 16929521664 maxMemory, 8464760832 storageRegionSize.
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 1048576 bytes of free space for 
> block rdd_94_37(free: 3253659951, max: 16798973952)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 5142008 bytes of free space for 
> block rdd_94_37(free: 3252611375, max: 16798973952)
> 15/12/05 01:20:50 INFO Executor: Finished task 36.0 in stage 4.0 (TID 109). 
> 3028 bytes result sent to driver
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98948238 bytes of free space for 
> block rdd_94_37(free: 3314840375, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98675713 bytes of free space for 
> block rdd_94_37(free: 3215892137, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 197347565 bytes of free space 
> for block rdd_94_37(free: 3117216424, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 295995553 bytes of free space 
> for block rdd_94_37(free: 2919868859, max: 16866344960)
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 394728479 bytes of free space 
> for block rdd_94_37(free: 2687050010, max: 16929521664)
> 15/12/05 01:20:51 INFO Executor: Finished task 32.0 in stage 4.0 (TID 106). 
> 3028 bytes result sent to driver
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 591258816 bytes of free space 
> for block rdd_94_37(free: 2292321531, max: 16929521664)
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 901645182 bytes of free space 
> for block rdd_94_37(free: 1701062715, max: 16929521664)
> 15/12/05 01:20:52 INFO MemoryStore: Ensuring 1302179076 bytes of free space 
> for block rdd_94_37(free: 799417533, max: 16929521664)
> 15/12/05 01:20:52 INFO MemoryStore: Will not store rdd_94_37 as it would 
> require dropping another block from the same RDD
> 15/12/05 01:20:52 WARN MemoryStore: Not enough space to cache rdd_94_37 in 
> memory! (computed 2.4 GB so far)
> 15/12/05 01:20:52 INFO MemoryStore: Memory use = 12.6 GB (blocks) + 2.4 GB 
> (scratch space shared across 13 tasks(s)) = 15.0 GB. Storage limit = 15.8 GB.
> 15/12/05 01:20:52 INFO BlockManager: Found block rdd_94_37 locally
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 262144 bytes 
> memory. But, on-heap execution memory poll only has 0 bytes free memory.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
> 8464760832, storageMemoryPool.poolSize 16929521664, storageRegionSize 
> 8464760832.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from 
> storage memory pool.
> 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 262144 bytes free memory 
> space from StorageMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 262144 bytes of memory 
> from storage memory pool.Adding them back to onHeapExecutionMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 67108864 bytes 
> memory. But, on-heap execution memory poll only has 0 bytes free memory.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
> 8464498688, storageMemoryPool.poolSize 16929259520, storageRegionSize 
> 8464760832.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from 
> storage memory pool.
> 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 67108864 bytes free memory 
> space from StorageMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 67108864 bytes of 
> memory from storage memory pool.Adding them back to onHeapExecutionMemoryPool.
> 15/12/05 01:20:54 INFO Executor: Finished task 37.0 in stage 4.0 (TID 110). 
> 3077 bytes result sent to driver
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 120
> 15/12/05 01:20:56 INFO Executor: Running task 1.0 in stage 5.0 (TID 120)
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 124
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: 

[jira] [Commented] (SPARK-12045) Use joda's DateTime to replace Calendar

2015-12-06 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044360#comment-15044360
 ] 

Wenchen Fan commented on SPARK-12045:
-

1. We need to process on UTF8String(an internal string presentation in spark 
SQL) , and that's why we need to do it by hand(see DateTimeUtils.stringToDate, 
DateTimeUtils.stringToTimestamp). And yes, you can turn UTF8String to String 
first and call third-party library like joda to help us, but that will be 
inefficient.
2. We follow hive to return null for invalid format string, but I agree throw 
exception by default seems more reasonable. 

cc [~marmbrus]

> Use joda's DateTime to replace Calendar
> ---
>
> Key: SPARK-12045
> URL: https://issues.apache.org/jira/browse/SPARK-12045
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Jeff Zhang
>
> Currently spark use Calendar to build the Date when convert from string to 
> Date. But Calendar can not detect the invalid date format (e.g. 2011-02-29).
> Although we can use Calendar.setLenient(false) to enable Calendar to detect 
> the invalid date format, but found the error message very confusing. So I 
> suggest to use joda's DateTime to replace Calendar. 
> Besides that, I found that there's already some format checking logic when 
> casting string to date. And if it is invalid format, it would return None. I 
> don't think it make sense to just return None without telling users.  I think 
> by default should just throw exception, and user can set property to allow it 
> return None if invalid format. 
> {code}
> if (i == 0 && j != 4) {
>   // year should have exact four digits
>   return None
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12168) Need test for conflicted function in R

2015-12-06 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-12168:
-
Summary: Need test for conflicted function in R  (was: Need test for masked 
function)

> Need test for conflicted function in R
> --
>
> Key: SPARK-12168
> URL: https://issues.apache.org/jira/browse/SPARK-12168
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Priority: Minor
>
> Currently it is hard to know if a function in base or stats packages are 
> masked when add new function in SparkR.
> Having an automated test would make it easier to track such changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12168) Need test for masked function

2015-12-06 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-12168:
-
Description: 
Currently it is hard to know if a function in base or stats packages are masked 
when add new function in SparkR.
Having an automated test would make it easier to track such changes.

> Need test for masked function
> -
>
> Key: SPARK-12168
> URL: https://issues.apache.org/jira/browse/SPARK-12168
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Priority: Minor
>
> Currently it is hard to know if a function in base or stats packages are 
> masked when add new function in SparkR.
> Having an automated test would make it easier to track such changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12170) Deprecate the JAVA-specific deserialized storage levels

2015-12-06 Thread Sun Rui (JIRA)
Sun Rui created SPARK-12170:
---

 Summary: Deprecate the JAVA-specific deserialized storage levels
 Key: SPARK-12170
 URL: https://issues.apache.org/jira/browse/SPARK-12170
 Project: Spark
  Issue Type: Sub-task
Reporter: Sun Rui


This is to be consistent with SPARK-12091 which is for pySpark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12169) SparkR 2.0

2015-12-06 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044394#comment-15044394
 ] 

Felix Cheung commented on SPARK-12169:
--

For those who's reading - we shouldn't open PR yet. See SPARK-11806

> SparkR 2.0
> --
>
> Key: SPARK-12169
> URL: https://issues.apache.org/jira/browse/SPARK-12169
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Sun Rui
>
> This is an umbrella issue addressing all SparkR related issues corresponding 
> to Spark 2.0 being planned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12165) Execution memory requests may fail to evict storage blocks if storage memory usage is below max memory

2015-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12165:


Assignee: Josh Rosen  (was: Apache Spark)

> Execution memory requests may fail to evict storage blocks if storage memory 
> usage is below max memory
> --
>
> Key: SPARK-12165
> URL: https://issues.apache.org/jira/browse/SPARK-12165
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> Consider a scenario where storage memory usage has grown past the size of the 
> unevictable storage region ({{spark.memory.storageFraction}} * maxMemory) and 
> a task needs to acquire more execution memory by reclaiming evictable storage 
> memory. If the storage memory usage is less than maxMemory, then there's a 
> possibility that no storage blocks will be evicted. This is caused by how 
> {{MemoryStore.ensureFreeSpace()}} is called inside of 
> {{StorageMemoryPool.shrinkPoolToReclaimSpace()}}.
> Here's a failing regression test which demonstrates this bug: 
> https://github.com/apache/spark/commit/b519fe628a9a2b8238dfedbfd9b74bdd2ddc0de4?diff=unified#diff-b3a7cd2e011e048908d70f743c0ed7cfR155



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12165) Execution memory requests may fail to evict storage blocks if storage memory usage is below max memory

2015-12-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044342#comment-15044342
 ] 

Apache Spark commented on SPARK-12165:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10170

> Execution memory requests may fail to evict storage blocks if storage memory 
> usage is below max memory
> --
>
> Key: SPARK-12165
> URL: https://issues.apache.org/jira/browse/SPARK-12165
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> Consider a scenario where storage memory usage has grown past the size of the 
> unevictable storage region ({{spark.memory.storageFraction}} * maxMemory) and 
> a task needs to acquire more execution memory by reclaiming evictable storage 
> memory. If the storage memory usage is less than maxMemory, then there's a 
> possibility that no storage blocks will be evicted. This is caused by how 
> {{MemoryStore.ensureFreeSpace()}} is called inside of 
> {{StorageMemoryPool.shrinkPoolToReclaimSpace()}}.
> Here's a failing regression test which demonstrates this bug: 
> https://github.com/apache/spark/commit/b519fe628a9a2b8238dfedbfd9b74bdd2ddc0de4?diff=unified#diff-b3a7cd2e011e048908d70f743c0ed7cfR155



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12168) Need test for conflicted function in R

2015-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12168:


Assignee: (was: Apache Spark)

> Need test for conflicted function in R
> --
>
> Key: SPARK-12168
> URL: https://issues.apache.org/jira/browse/SPARK-12168
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Priority: Minor
>
> Currently it is hard to know if a function in base or stats packages are 
> masked when add new function in SparkR.
> Having an automated test would make it easier to track such changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12171) Support DataSet API in SparkR

2015-12-06 Thread Sun Rui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sun Rui closed SPARK-12171.
---
Resolution: Duplicate

> Support DataSet API in SparkR
> -
>
> Key: SPARK-12171
> URL: https://issues.apache.org/jira/browse/SPARK-12171
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Sun Rui
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6990) Add Java linting script

2015-12-06 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-6990:
--
Assignee: Dmitry Erastov

> Add Java linting script
> ---
>
> Key: SPARK-6990
> URL: https://issues.apache.org/jira/browse/SPARK-6990
> Project: Spark
>  Issue Type: New Feature
>  Components: Project Infra
>Reporter: Josh Rosen
>Assignee: Dmitry Erastov
>Priority: Minor
>  Labels: starter
> Fix For: 2.0.0
>
>
> It would be nice to add a {{dev/lint-java}} script to enforce style rules for 
> Spark's Java code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12167) Invoke the right sameResult function when plan is warpped with SubQueries

2015-12-06 Thread Yadong Qi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yadong Qi closed SPARK-12167.
-
Resolution: Duplicate

> Invoke the right sameResult function when plan is warpped with SubQueries
> -
>
> Key: SPARK-12167
> URL: https://issues.apache.org/jira/browse/SPARK-12167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yadong Qi
>
> I find this bug when I use cache table,
> ```
> spark-sql> create table src_p(key int, value int) stored as parquet;
> OK
> Time taken: 3.144 seconds
> spark-sql> cache table src_p;
> Time taken: 1.452 seconds
> spark-sql> explain extended select count(*) from src_p;
> ```
> I got the wrong physical plan
> ```
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[_c0#28L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#33L])
>Scan ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][]
> ```
> and the right physical plan is
> ```
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[_c0#47L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#62L])
>InMemoryColumnarTableScan (InMemoryRelation [key#45,value#46], true, 
> 1, StorageLevel(true, true, false, true, 1), (Scan 
> ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][key#9,value#10]),
>  Some(src_p))
> ```
> When the implementation classes of `MultiInstanceRelation`(eg. 
> `LogicalRelation`, `LocalRelation`) are warpped with SubQueries, they can't 
> invoke the right `sameResult` function in their own implementation. So we 
> need to eliminate SubQueries first and then try to invoke `sameResult` 
> function in their own implementation.
> Like:
> When plan is 
> `Subquery(LogicalRelation(relation:ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p],
>  expectedOutputAttributes:Some(ArrayBuffer(key#0, value#1`, first 
> eliminate SubQueries, and then will invoke the `sameResult` function in 
> `LogicalRelation` instead of `LogicalPlan`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12166) Unset hadoop related environment in testing

2015-12-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044531#comment-15044531
 ] 

Apache Spark commented on SPARK-12166:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/10172

> Unset hadoop related environment in testing 
> 
>
> Key: SPARK-12166
> URL: https://issues.apache.org/jira/browse/SPARK-12166
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 1.5.2
>Reporter: Jeff Zhang
>Priority: Minor
>
> I try to do test on HiveSparkSubmitSuite on local box, but fails. The cause 
> is that spark is still using my local single node cluster hadoop when doing 
> the unit test. I don't think it make sense to do that. These environment 
> variable should be unset before the testing. And I suspect dev/run-tests also
> didn't do that either. 
> Here's the error message:
> {code}
> Cause: java.lang.RuntimeException: java.lang.RuntimeException: The root 
> scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: 
> rwxr-xr-x
> [info]   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
> [info]   at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12166) Unset hadoop related environment in testing

2015-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12166:


Assignee: Apache Spark

> Unset hadoop related environment in testing 
> 
>
> Key: SPARK-12166
> URL: https://issues.apache.org/jira/browse/SPARK-12166
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 1.5.2
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> I try to do test on HiveSparkSubmitSuite on local box, but fails. The cause 
> is that spark is still using my local single node cluster hadoop when doing 
> the unit test. I don't think it make sense to do that. These environment 
> variable should be unset before the testing. And I suspect dev/run-tests also
> didn't do that either. 
> Here's the error message:
> {code}
> Cause: java.lang.RuntimeException: java.lang.RuntimeException: The root 
> scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: 
> rwxr-xr-x
> [info]   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
> [info]   at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12166) Unset hadoop related environment in testing

2015-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12166:


Assignee: (was: Apache Spark)

> Unset hadoop related environment in testing 
> 
>
> Key: SPARK-12166
> URL: https://issues.apache.org/jira/browse/SPARK-12166
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 1.5.2
>Reporter: Jeff Zhang
>Priority: Minor
>
> I try to do test on HiveSparkSubmitSuite on local box, but fails. The cause 
> is that spark is still using my local single node cluster hadoop when doing 
> the unit test. I don't think it make sense to do that. These environment 
> variable should be unset before the testing. And I suspect dev/run-tests also
> didn't do that either. 
> Here's the error message:
> {code}
> Cause: java.lang.RuntimeException: java.lang.RuntimeException: The root 
> scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: 
> rwxr-xr-x
> [info]   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
> [info]   at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org