[jira] [Updated] (SPARK-19007) Speedup and optimize the GradientBoostedTrees in the "data>memory" scene

2016-12-26 Thread zhangdenghui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangdenghui updated SPARK-19007:
-
Description: 
Test data:80G CTR training data from 
criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/
 ) ,I used 1 of the 24 days' data.Some  features needed to be repalced by new 
generated continuous features,the way to generate the new features refers to 
the way mentioned in the xgboost's paper.

Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per 
executor.

Parameters: numIterations 10, maxdepth  8,   the rest parameters are default

I tested the GradientBoostedTrees algorithm in mllib  using 80G CTR data 
mentioned above.

It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT 
rounds later.Without these task failures and task retry it can be much faster 
,which can save about half the time. I think it's caused by the RDD named 
predError in the while loop of  the boost method at 
GradientBoostedTrees.scala,because the lineage of the RDD named predError is 
growing after every GBT round, and then it caused failures like this :

(ExecutorLostFailure (executor 6 exited caused by one of the running tasks) 
Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.).  

I tried to boosting spark.yarn.executor.memoryOverhead  but the meomry it 
needed is too much (even increase half the memory  can't solve the problem) so 
i think it's not a proper method. 

Although it can set the predCheckpoint  Interval  smaller  to cut the line of 
the lineage  but it increases IO cost a lot. 

I tried  another way to solve this problem.I persisted the RDD named predError 
every round  and use  pre_predError to record the previous RDD  and unpersist 
it  because it's useless anymore.

Finally it costs about 45 min after i tried my method and no task failure 
occured and no more memeory added. 

So when the data is much larger than memory, my little improvement can speedup  
the  GradientBoostedTrees  1~2 times.


  was:
Test data:80G CTR training data from 
criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/
 ) ,I used 1 of the 24 days' data.Some  features needed to be repalced by new 
generated continuous features,the way to generate the new features refers to 
the way mentioned in the xgboost's paper.

Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per 
executor.

I tested the GradientBoostedTrees algorithm in mllib  using 80G CTR data 
mentioned above.

It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT 
rounds later.Without these task failures and task retry it can be much faster 
,which can save about half the time. I think it's caused by the RDD named 
predError in the while loop of  the boost method at 
GradientBoostedTrees.scala,because the lineage of the RDD named predError is 
growing after every GBT round, and then it caused failures like this :

(ExecutorLostFailure (executor 6 exited caused by one of the running tasks) 
Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.).  

I tried to boosting spark.yarn.executor.memoryOverhead  but the meomry it 
needed is too much (even increase half the memory  can't solve the problem) so 
i think it's not a proper method. 

Although it can set the predCheckpoint  Interval  smaller  to cut the line of 
the lineage  but it increases IO cost a lot. 

I tried  another way to solve this problem.I persisted the RDD named predError 
every round  and use  pre_predError to record the previous RDD  and unpersist 
it  because it's useless anymore.

Finally it costs about 45 min after i tried my method and no task failure 
occured and no more memeory added. 

So when the data is much larger than memory, my little improvement can speedup  
the  GradientBoostedTrees  1~2 times.



> Speedup and optimize the GradientBoostedTrees in the "data>memory" scene
> 
>
> Key: SPARK-19007
> URL: https://issues.apache.org/jira/browse/SPARK-19007
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 
> 2.0.1, 2.0.2, 2.1.0
> Environment: A CDH cluster consists of 3 redhat server ,(120G 
> memory、40 cores、43TB disk per server).
>Reporter: zhangdenghui
> Fix For: 2.1.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Test data:80G CTR training data from 
> criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/
>  ) ,I used 1 of the 24 days' data.Some  

[jira] [Updated] (SPARK-19004) Fix `JDBCWriteSuite.testH2Dialect` by removing `getCatalystType`

2016-12-26 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-19004:
--
Description: 
JDBCSuite and JDBCWriterSuite have its own testH2Dialect for their testing 
purposes.

This issue fixes testH2Dialect in JDBCWriterSuite by removing getCatalystType 
implementation in order to return correct types. Currently, it returns 
Some(StringType) incorrectly. For the testH2Dialect in JDBCSuite, it's 
intentional because of the test case Remap types via JdbcDialects.

  was:
`JdbcDialect` subclasses should return `None` by default. However, 
`testH2Dialect.getCatalystType` returns always `Some(StringType)` incorrectly.

This issue fixes `testH2Dialect` by removing the wrong `getCatalystType` 
implementation.


> Fix `JDBCWriteSuite.testH2Dialect` by removing `getCatalystType`
> 
>
> Key: SPARK-19004
> URL: https://issues.apache.org/jira/browse/SPARK-19004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> JDBCSuite and JDBCWriterSuite have its own testH2Dialect for their testing 
> purposes.
> This issue fixes testH2Dialect in JDBCWriterSuite by removing getCatalystType 
> implementation in order to return correct types. Currently, it returns 
> Some(StringType) incorrectly. For the testH2Dialect in JDBCSuite, it's 
> intentional because of the test case Remap types via JdbcDialects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19004) Fix `JDBCWriteSuite.testH2Dialect` by removing `getCatalystType`

2016-12-26 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-19004:
--
Summary: Fix `JDBCWriteSuite.testH2Dialect` by removing `getCatalystType`  
(was: Fix `testH2Dialect` by removing `getCatalystType`)

> Fix `JDBCWriteSuite.testH2Dialect` by removing `getCatalystType`
> 
>
> Key: SPARK-19004
> URL: https://issues.apache.org/jira/browse/SPARK-19004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `JdbcDialect` subclasses should return `None` by default. However, 
> `testH2Dialect.getCatalystType` returns always `Some(StringType)` incorrectly.
> This issue fixes `testH2Dialect` by removing the wrong `getCatalystType` 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19007) Speedup and optimize the GradientBoostedTrees in the "data>memory" scene

2016-12-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19007:


Assignee: Apache Spark

> Speedup and optimize the GradientBoostedTrees in the "data>memory" scene
> 
>
> Key: SPARK-19007
> URL: https://issues.apache.org/jira/browse/SPARK-19007
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 
> 2.0.1, 2.0.2, 2.1.0
> Environment: A CDH cluster consists of 3 redhat server ,(120G 
> memory、40 cores、43TB disk per server).
>Reporter: zhangdenghui
>Assignee: Apache Spark
> Fix For: 2.1.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Test data:80G CTR training data from 
> criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/
>  ) ,I used 1 of the 24 days' data.Some  features needed to be repalced by new 
> generated continuous features,the way to generate the new features refers to 
> the way mentioned in the xgboost's paper.
> Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per 
> executor.
> I tested the GradientBoostedTrees algorithm in mllib  using 80G CTR data 
> mentioned above.
> It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT 
> rounds later.Without these task failures and task retry it can be much faster 
> ,which can save about half the time. I think it's caused by the RDD named 
> predError in the while loop of  the boost method at 
> GradientBoostedTrees.scala,because the lineage of the RDD named predError is 
> growing after every GBT round, and then it caused failures like this :
> (ExecutorLostFailure (executor 6 exited caused by one of the running tasks) 
> Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 
> GB physical memory used. Consider boosting 
> spark.yarn.executor.memoryOverhead.).  
> I tried to boosting spark.yarn.executor.memoryOverhead  but the meomry it 
> needed is too much (even increase half the memory  can't solve the problem) 
> so i think it's not a proper method. 
> Although it can set the predCheckpoint  Interval  smaller  to cut the line of 
> the lineage  but it increases IO cost a lot. 
> I tried  another way to solve this problem.I persisted the RDD named 
> predError every round  and use  pre_predError to record the previous RDD  and 
> unpersist it  because it's useless anymore.
> Finally it costs about 45 min after i tried my method and no task failure 
> occured and no more memeory added. 
> So when the data is much larger than memory, my little improvement can 
> speedup  the  GradientBoostedTrees  1~2 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19007) Speedup and optimize the GradientBoostedTrees in the "data>memory" scene

2016-12-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779790#comment-15779790
 ] 

Apache Spark commented on SPARK-19007:
--

User 'zdh2292390' has created a pull request for this issue:
https://github.com/apache/spark/pull/16415

> Speedup and optimize the GradientBoostedTrees in the "data>memory" scene
> 
>
> Key: SPARK-19007
> URL: https://issues.apache.org/jira/browse/SPARK-19007
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 
> 2.0.1, 2.0.2, 2.1.0
> Environment: A CDH cluster consists of 3 redhat server ,(120G 
> memory、40 cores、43TB disk per server).
>Reporter: zhangdenghui
> Fix For: 2.1.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Test data:80G CTR training data from 
> criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/
>  ) ,I used 1 of the 24 days' data.Some  features needed to be repalced by new 
> generated continuous features,the way to generate the new features refers to 
> the way mentioned in the xgboost's paper.
> Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per 
> executor.
> I tested the GradientBoostedTrees algorithm in mllib  using 80G CTR data 
> mentioned above.
> It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT 
> rounds later.Without these task failures and task retry it can be much faster 
> ,which can save about half the time. I think it's caused by the RDD named 
> predError in the while loop of  the boost method at 
> GradientBoostedTrees.scala,because the lineage of the RDD named predError is 
> growing after every GBT round, and then it caused failures like this :
> (ExecutorLostFailure (executor 6 exited caused by one of the running tasks) 
> Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 
> GB physical memory used. Consider boosting 
> spark.yarn.executor.memoryOverhead.).  
> I tried to boosting spark.yarn.executor.memoryOverhead  but the meomry it 
> needed is too much (even increase half the memory  can't solve the problem) 
> so i think it's not a proper method. 
> Although it can set the predCheckpoint  Interval  smaller  to cut the line of 
> the lineage  but it increases IO cost a lot. 
> I tried  another way to solve this problem.I persisted the RDD named 
> predError every round  and use  pre_predError to record the previous RDD  and 
> unpersist it  because it's useless anymore.
> Finally it costs about 45 min after i tried my method and no task failure 
> occured and no more memeory added. 
> So when the data is much larger than memory, my little improvement can 
> speedup  the  GradientBoostedTrees  1~2 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19007) Speedup and optimize the GradientBoostedTrees in the "data>memory" scene

2016-12-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19007:


Assignee: (was: Apache Spark)

> Speedup and optimize the GradientBoostedTrees in the "data>memory" scene
> 
>
> Key: SPARK-19007
> URL: https://issues.apache.org/jira/browse/SPARK-19007
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 
> 2.0.1, 2.0.2, 2.1.0
> Environment: A CDH cluster consists of 3 redhat server ,(120G 
> memory、40 cores、43TB disk per server).
>Reporter: zhangdenghui
> Fix For: 2.1.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Test data:80G CTR training data from 
> criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/
>  ) ,I used 1 of the 24 days' data.Some  features needed to be repalced by new 
> generated continuous features,the way to generate the new features refers to 
> the way mentioned in the xgboost's paper.
> Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per 
> executor.
> I tested the GradientBoostedTrees algorithm in mllib  using 80G CTR data 
> mentioned above.
> It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT 
> rounds later.Without these task failures and task retry it can be much faster 
> ,which can save about half the time. I think it's caused by the RDD named 
> predError in the while loop of  the boost method at 
> GradientBoostedTrees.scala,because the lineage of the RDD named predError is 
> growing after every GBT round, and then it caused failures like this :
> (ExecutorLostFailure (executor 6 exited caused by one of the running tasks) 
> Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 
> GB physical memory used. Consider boosting 
> spark.yarn.executor.memoryOverhead.).  
> I tried to boosting spark.yarn.executor.memoryOverhead  but the meomry it 
> needed is too much (even increase half the memory  can't solve the problem) 
> so i think it's not a proper method. 
> Although it can set the predCheckpoint  Interval  smaller  to cut the line of 
> the lineage  but it increases IO cost a lot. 
> I tried  another way to solve this problem.I persisted the RDD named 
> predError every round  and use  pre_predError to record the previous RDD  and 
> unpersist it  because it's useless anymore.
> Finally it costs about 45 min after i tried my method and no task failure 
> occured and no more memeory added. 
> So when the data is much larger than memory, my little improvement can 
> speedup  the  GradientBoostedTrees  1~2 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-12-26 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779766#comment-15779766
 ] 

Yuming Wang commented on SPARK-15044:
-

I've tested  on v2.1.0-rc5,  it works fine if 
{{spark.sql.hive.verifyPartitionPath=true}}:
{code:sql}
create table spark_15044  (n string) partitioned by (p string);

insert overwrite table spark_15044 partition(p='2016-12-27') values('27');
insert overwrite table spark_15044 partition(p='2016-12-26') values('26');
insert overwrite table spark_15044 partition(p='2016-12-25') values('25');

dfs -rmr /user/hive/warehouse/spark_15044/p=2016-12-25;

set spark.sql.hive.verifyPartitionPath=true;
select * from spark_15044;
{code}

> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Assigned] (SPARK-19009) Add doc for Streaming Rest API

2016-12-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19009:


Assignee: (was: Apache Spark)

> Add doc for Streaming Rest API
> --
>
> Key: SPARK-19009
> URL: https://issues.apache.org/jira/browse/SPARK-19009
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19009) Add doc for Streaming Rest API

2016-12-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779762#comment-15779762
 ] 

Apache Spark commented on SPARK-19009:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16414

> Add doc for Streaming Rest API
> --
>
> Key: SPARK-19009
> URL: https://issues.apache.org/jira/browse/SPARK-19009
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19009) Add doc for Streaming Rest API

2016-12-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19009:


Assignee: Apache Spark

> Add doc for Streaming Rest API
> --
>
> Key: SPARK-19009
> URL: https://issues.apache.org/jira/browse/SPARK-19009
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18710) Add offset to GeneralizedLinearRegression models

2016-12-26 Thread Wayne Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779760#comment-15779760
 ] 

Wayne Zhang commented on SPARK-18710:
-

Thanks for the comment, Yanbo. In IRLS, the fit method expects RDD[Instance]. 
Does it still work if one feeds a RDD[GLRInstance] object to it? 

{code}
  def fit(instances: RDD[Instance]): IterativelyReweightedLeastSquaresModel = {
{code}

> Add offset to GeneralizedLinearRegression models
> 
>
> Key: SPARK-18710
> URL: https://issues.apache.org/jira/browse/SPARK-18710
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>  Labels: features
>   Original Estimate: 10h
>  Remaining Estimate: 10h
>
> The current GeneralizedLinearRegression model does not support offset. The 
> offset can be useful to take into account exposure, or for testing 
> incremental effect of new variables. It is possible to use weights in current 
> environment to achieve the same effect of specifying offset for certain 
> models, e.g., Poisson & Binomial with log offset, it is desirable to have the 
> offset option to work with more general cases, e.g., negative offset or 
> offset that is hard to specify using weights (e.g., offset to the probability 
> rather than odds in logistic regression).
> Effort would involve:
> * update regression class to support offsetCol
> * update IWLS to take into account of offset
> * add test case for offset
> I can start working on this if the community approves this feature. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19009) Add doc for Streaming Rest API

2016-12-26 Thread Genmao Yu (JIRA)
Genmao Yu created SPARK-19009:
-

 Summary: Add doc for Streaming Rest API
 Key: SPARK-19009
 URL: https://issues.apache.org/jira/browse/SPARK-19009
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 2.1.0, 2.0.2
Reporter: Genmao Yu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19007) Speedup and optimize the GradientBoostedTrees in the "data>memory" scene

2016-12-26 Thread zhangdenghui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangdenghui updated SPARK-19007:
-
Description: 
Test data:80G CTR training data from 
criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/
 ) ,I used 1 of the 24 days' data.Some  features needed to be repalced by new 
generated continuous features,the way to generate the new features refers to 
the way mentioned in the xgboost's paper.

Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per 
executor.

I tested the GradientBoostedTrees algorithm in mllib  using 80G CTR data 
mentioned above.

It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT 
rounds later.Without these task failures and task retry it can be much faster 
,which can save about half the time. I think it's caused by the RDD named 
predError in the while loop of  the boost method at 
GradientBoostedTrees.scala,because the lineage of the RDD named predError is 
growing after every GBT round, and then it caused failures like this :

(ExecutorLostFailure (executor 6 exited caused by one of the running tasks) 
Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.).  

I tried to boosting spark.yarn.executor.memoryOverhead  but the meomry it 
needed is too much (even increase half the memory  can't solve the problem) so 
i think it's not a proper method. 

Although it can set the predCheckpoint  Interval  smaller  to cut the line of 
the lineage  but it increases IO cost a lot. 

I tried  another way to solve this problem.I persisted the RDD named predError 
every round  and use  pre_predError to record the previous RDD  and unpersist 
it  because it's useless anymore.

Finally it costs about 45 min after i tried my method and no task failure 
occured and no more memeory added. 

So when the data is much larger than memory, my little improvement can speedup  
the  GradientBoostedTrees  1~2 times.


  was:
Test data:80G CTR training data from 
criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/
 ) ,I used 1 of the 24 days' data.Some  features needed to be repalced by new 
generated continuous features,the way to generate the new features refers to 
the way mentioned in the xgboost's paper.

Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per 
executor.

I tested the GradientBoostedTrees algorithm in mllib  using 80G CTR data 
mentioned above.
It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT 
rounds later.Without these task failures and task retry it can be much faster 
,which can save about half the time. I think it's caused by the RDD named 
predError in the while loop of  the boost method at 
GradientBoostedTrees.scala,because the lineage of the RDD named predError is 
growing after every GBT round, and then it caused failures like this :
(ExecutorLostFailure (executor 6 exited caused by one of the running tasks) 
Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.).  

I tried to boosting spark.yarn.executor.memoryOverhead  but the meomry it 
needed is too much (even increase half the memory  can't solve the problem) so 
i think it's not a proper method. 

Although it can set the predCheckpoint  Interval  smaller  to cut the line of 
the lineage  but it increases IO cost a lot. 

I tried  another way to solve this problem.I persisted the RDD named predError 
every round  and use  pre_predError to record the previous RDD  and unpersist 
it  because it's useless anymore.

Finally it costs about 45 min after i tried my method and no task failure 
occured and no more memeory added. 

So when the data is much larger than memory, my little improvement can speedup  
the  GradientBoostedTrees  1~2 times.



> Speedup and optimize the GradientBoostedTrees in the "data>memory" scene
> 
>
> Key: SPARK-19007
> URL: https://issues.apache.org/jira/browse/SPARK-19007
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 
> 2.0.1, 2.0.2, 2.1.0
> Environment: A CDH cluster consists of 3 redhat server ,(120G 
> memory、40 cores、43TB disk per server).
>Reporter: zhangdenghui
> Fix For: 2.1.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Test data:80G CTR training data from 
> criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/
>  ) ,I used 1 of the 24 days' data.Some  features needed to be repalced by new 
> generated continuous features,the way to 

[jira] [Commented] (SPARK-18955) Add ability to emit kafka events to DStream or KafkaDStream

2016-12-26 Thread Russell Jurney (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779554#comment-15779554
 ] 

Russell Jurney commented on SPARK-18955:


Can I please get feedback as to whether this patch would be accepted? I don't 
want to do the work if it isn't even something that would be merged.

> Add ability to emit kafka events to DStream or KafkaDStream
> ---
>
> Key: SPARK-18955
> URL: https://issues.apache.org/jira/browse/SPARK-18955
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams, PySpark
>Affects Versions: 2.0.2
>Reporter: Russell Jurney
>  Labels: features, newbie
>
> Any I/O that needs doing in Spark Streaming seems to have to be done in a 
> DStream.foreachRDD loop. For instance, in PySpark if I want to emit Kafka 
> events for each record... I have to DStream.foreachRDD and use kafka-python 
> to emit a Kafka event for each record.
> This really seems like I/O like this should be part of the pyspark.streaming 
> or pyspark.streaming.kafka API and the equivalent Scala APIs. Something like 
> DStream.emitKafkaEvents or KafkaDStream.emitKafkaEvents would seem to make 
> sense.
> If this is a good idea, and it seems feasible, I'd like to take a crack at it 
> as my first patch for Spark. Advice would be appreciated. What would need to 
> be modified to make this happen?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19008) Avoid boxing/unboxing overhead of calling a lambda with primitive type from Dataset program

2016-12-26 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779544#comment-15779544
 ] 

Kazuaki Ishizaki commented on SPARK-19008:
--

I will work for this

> Avoid boxing/unboxing overhead of calling a lambda with primitive type from 
> Dataset program
> ---
>
> Key: SPARK-19008
> URL: https://issues.apache.org/jira/browse/SPARK-19008
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> In a 
> [discussion|https://github.com/apache/spark/pull/16391#discussion_r93788919] 
> between [~cloud_fan] and [~kiszk], we noticed an opportunity to avoid 
> boxing/unboxing overhead when a Dataset program calls a lambda, which 
> operates on a primitive type, written in Scala.
> In such a case, Catalyst can directly call a method {{ 
> apply();}} instead of {{Object apply(Object);}}.
> Of course, the best solution seems to be 
> [here|https://issues.apache.org/jira/browse/SPARK-14083].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19008) Avoid boxing/unboxing overhead of calling a lambda with primitive type from Dataset program

2016-12-26 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-19008:
-
Description: 
In a 
[discussion|https://github.com/apache/spark/pull/16391#discussion_r93788919] 
between [~cloud_fan] and [~kiszk], we noticed an opportunity to avoid 
boxing/unboxing overhead when a Dataset program calls a lambda, which operates 
on a primitive type, written in Scala.
In such a case, Catalyst can directly call a method {{ 
apply();}} instead of {{Object apply(Object);}}.

Of course, the best solution seems to be 
[here|https://issues.apache.org/jira/browse/SPARK-14083].

  was:
In this 
[discussion|https://github.com/apache/spark/pull/16391#discussion_r93788919] 
betweem [~cloud_fan] and [~kiszk], we noticed an opportunity to avoid 
boxing/unboxing overhead when a Dataset program calls a lambda, which operates 
on a primitive type, written in Scala.
In such a case, Catalyst can directly call a method {{ 
apply();}} instead of {{Object apply(Object);}}.

Of course, the best solution seems to be 
[here|https://issues.apache.org/jira/browse/SPARK-14083].


> Avoid boxing/unboxing overhead of calling a lambda with primitive type from 
> Dataset program
> ---
>
> Key: SPARK-19008
> URL: https://issues.apache.org/jira/browse/SPARK-19008
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> In a 
> [discussion|https://github.com/apache/spark/pull/16391#discussion_r93788919] 
> between [~cloud_fan] and [~kiszk], we noticed an opportunity to avoid 
> boxing/unboxing overhead when a Dataset program calls a lambda, which 
> operates on a primitive type, written in Scala.
> In such a case, Catalyst can directly call a method {{ 
> apply();}} instead of {{Object apply(Object);}}.
> Of course, the best solution seems to be 
> [here|https://issues.apache.org/jira/browse/SPARK-14083].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19008) Avoid boxing/unboxing overhead of calling a lambda with primitive type from Dataset program

2016-12-26 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-19008:


 Summary: Avoid boxing/unboxing overhead of calling a lambda with 
primitive type from Dataset program
 Key: SPARK-19008
 URL: https://issues.apache.org/jira/browse/SPARK-19008
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Kazuaki Ishizaki


In this 
[discussion|https://github.com/apache/spark/pull/16391#discussion_r93788919] 
betweem [~cloud_fan] and [~kiszk], we noticed an opportunity to avoid 
boxing/unboxing overhead when a Dataset program calls a lambda, which operates 
on a primitive type, written in Scala.
In such a case, Catalyst can directly call a method {{ 
apply();}} instead of {{Object apply(Object);}}.

Of course, the best solution seems to be 
[here|https://issues.apache.org/jira/browse/SPARK-14083].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19007) Speedup and optimize the GradientBoostedTrees in the "data>memory" scene

2016-12-26 Thread zhangdenghui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangdenghui updated SPARK-19007:
-
Description: 
Test data:80G CTR training data from 
criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/
 ) ,I used 1 of the 24 days' data.Some  features needed to be repalced by new 
generated continuous features,the way to generate the new features refers to 
the way mentioned in the xgboost's paper.

Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per 
executor.

I tested the GradientBoostedTrees algorithm in mllib  using 80G CTR data 
mentioned above.
It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT 
rounds later.Without these task failures and task retry it can be much faster 
,which can save about half the time. I think it's caused by the RDD named 
predError in the while loop of  the boost method at 
GradientBoostedTrees.scala,because the lineage of the RDD named predError is 
growing after every GBT round, and then it caused failures like this :
(ExecutorLostFailure (executor 6 exited caused by one of the running tasks) 
Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.).  

I tried to boosting spark.yarn.executor.memoryOverhead  but the meomry it 
needed is too much (even increase half the memory  can't solve the problem) so 
i think it's not a proper method. 

Although it can set the predCheckpoint  Interval  smaller  to cut the line of 
the lineage  but it increases IO cost a lot. 

I tried  another way to solve this problem.I persisted the RDD named predError 
every round  and use  pre_predError to record the previous RDD  and unpersist 
it  because it's useless anymore.

Finally it costs about 45 min after i tried my method and no task failure 
occured and no more memeory added. 

So when the data is much larger than memory, my little improvement can speedup  
the  GradientBoostedTrees  1~2 times.


  was:
Test data:80G CTR training data from 
criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/),I
 used 1 of the 24 days' data.Some  features needed to be repalced by new 
generated continuous features,the way to generate the new features refers to 
the way mentioned in the xgboost's paper.

Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per 
executor.

I tested the GradientBoostedTrees algorithm in mllib  using 80G CTR data 
mentioned above.
It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT 
rounds later.Without these task failures and task retry it can be much faster 
,which can save about half the time. I think it's caused by the RDD named 
predError in the while loop of  the boost method at 
GradientBoostedTrees.scala,because the lineage of the RDD named predError is 
growing after every GBT round, and then it caused failures like this :
(ExecutorLostFailure (executor 6 exited caused by one of the running tasks) 
Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.).  

I tried to boosting spark.yarn.executor.memoryOverhead  but the meomry it 
needed is too much (even increase half the memory  can't solve the problem) so 
i think it's not a proper method. 

Although it can set the predCheckpoint  Interval  smaller  to cut the line of 
the lineage  but it increases IO cost a lot. 

I tried  another way to solve this problem.I persisted the RDD named predError 
every round  and use  pre_predError to record the previous RDD  and unpersist 
it  because it's useless anymore.

Finally it costs about 45 min after i tried my method and no task failure 
occured and no more memeory added. 

So when the data is much larger than memory, my little improvement can speedup  
the  GradientBoostedTrees  1~2 times.



> Speedup and optimize the GradientBoostedTrees in the "data>memory" scene
> 
>
> Key: SPARK-19007
> URL: https://issues.apache.org/jira/browse/SPARK-19007
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 
> 2.0.1, 2.0.2, 2.1.0
> Environment: A CDH cluster consists of 3 redhat server ,(120G 
> memory、40 cores、43TB disk per server).
>Reporter: zhangdenghui
> Fix For: 2.1.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Test data:80G CTR training data from 
> criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/
>  ) ,I used 1 of the 24 days' data.Some  features needed to be repalced by new 
> generated continuous features,the way to 

[jira] [Created] (SPARK-19007) Speedup and optimize the GradientBoostedTrees in the "data>memory" scene

2016-12-26 Thread zhangdenghui (JIRA)
zhangdenghui created SPARK-19007:


 Summary: Speedup and optimize the GradientBoostedTrees in the 
"data>memory" scene
 Key: SPARK-19007
 URL: https://issues.apache.org/jira/browse/SPARK-19007
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0, 1.6.3, 1.6.2, 1.6.1, 1.6.0, 
1.5.2, 1.5.1, 1.5.0
 Environment: A CDH cluster consists of 3 redhat server ,(120G 
memory、40 cores、43TB disk per server).

Reporter: zhangdenghui
 Fix For: 2.1.0


Test data:80G CTR training data from 
criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/),I
 used 1 of the 24 days' data.Some  features needed to be repalced by new 
generated continuous features,the way to generate the new features refers to 
the way mentioned in the xgboost's paper.

Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per 
executor.

I tested the GradientBoostedTrees algorithm in mllib  using 80G CTR data 
mentioned above.
It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT 
rounds later.Without these task failures and task retry it can be much faster 
,which can save about half the time. I think it's caused by the RDD named 
predError in the while loop of  the boost method at 
GradientBoostedTrees.scala,because the lineage of the RDD named predError is 
growing after every GBT round, and then it caused failures like this :
(ExecutorLostFailure (executor 6 exited caused by one of the running tasks) 
Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.).  

I tried to boosting spark.yarn.executor.memoryOverhead  but the meomry it 
needed is too much (even increase half the memory  can't solve the problem) so 
i think it's not a proper method. 

Although it can set the predCheckpoint  Interval  smaller  to cut the line of 
the lineage  but it increases IO cost a lot. 

I tried  another way to solve this problem.I persisted the RDD named predError 
every round  and use  pre_predError to record the previous RDD  and unpersist 
it  because it's useless anymore.

Finally it costs about 45 min after i tried my method and no task failure 
occured and no more memeory added. 

So when the data is much larger than memory, my little improvement can speedup  
the  GradientBoostedTrees  1~2 times.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18931) Create empty staging directory in partitioned table on insert

2016-12-26 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779439#comment-15779439
 ] 

Hyukjin Kwon commented on SPARK-18931:
--

Could we resolve this as a duplicate?

> Create empty staging directory in partitioned table on insert
> -
>
> Key: SPARK-18931
> URL: https://issues.apache.org/jira/browse/SPARK-18931
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> CREATE TABLE temp.test_partitioning_4 (
>   num string
>  ) 
> PARTITIONED BY (
>   day string)
>   stored as parquet
> On every 
> INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day)
> select day, count(*) as num from 
> hss.session where year=2016 and month=4 
> group by day
> new directory 
> ".hive-staging_hive_2016-12-19_15-55-11_298_3412488541559534475-4" created on 
> HDFS.  It's big issue, because I insert every day and bunch of empty dirs on 
> HDFS is very bad for HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL

2016-12-26 Thread Umesh Chaudhary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779436#comment-15779436
 ] 

Umesh Chaudhary commented on SPARK-3528:


[~gip] looks like when we provide file:/// as URI, executors should be able to 
find it from their respective host, so makes sense for NODE_LOCAL. Found 
this[1] user list discussion worth mentioning. IMHO, executors read the data 
into same JVM (from the mentioned URI) and after that data becomes 
PROCESS_LOCAL for the executors. 

[1] 
http://apache-spark-user-list.1001560.n3.nabble.com/When-does-Spark-switch-from-PROCESS-LOCAL-to-NODE-LOCAL-or-RACK-LOCAL-td7091.html



> Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
> 
>
> Key: SPARK-3528
> URL: https://issues.apache.org/jira/browse/SPARK-3528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Priority: Critical
>
> Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task
> {noformat}
> spark> sc.textFile("pom.xml").count
> ...
> 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
> localhost, PROCESS_LOCAL, 1191 bytes)
> 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
> localhost, PROCESS_LOCAL, 1191 bytes)
> 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
> 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
> file:/Users/aash/git/spark/pom.xml:20862+20863
> 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
> file:/Users/aash/git/spark/pom.xml:0+20862
> {noformat}
> There is an outstanding TODO in {{HadoopRDD.scala}} that may be related:
> {noformat}
>   override def getPreferredLocations(split: Partition): Seq[String] = {
> // TODO: Filtering out "localhost" in case of file:// URLs
> val hadoopSplit = split.asInstanceOf[HadoopPartition]
> hadoopSplit.inputSplit.value.getLocations.filter(_ != "localhost")
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18997) Recommended upgrade libthrift to 0.9.3

2016-12-26 Thread meiyoula (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779389#comment-15779389
 ] 

meiyoula commented on SPARK-18997:
--

I think it is mainly consist in Hive. Now Hive has upgrade to 0.9.3

> Recommended upgrade libthrift  to 0.9.3
> ---
>
> Key: SPARK-18997
> URL: https://issues.apache.org/jira/browse/SPARK-18997
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: meiyoula
>Priority: Critical
>
> libthrift 0.9.2 has a serious security vulnerability:CVE-2015-3254



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19002) Check pep8 against dev/*.py scripts

2016-12-26 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-19002:
-
Summary: Check pep8 against dev/*.py scripts  (was: Check pep8 against 
merge_spark_pr.py script)

> Check pep8 against dev/*.py scripts
> ---
>
> Key: SPARK-19002
> URL: https://issues.apache.org/jira/browse/SPARK-19002
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> We can check pep8 against merge_spark_pr.py script. There are already several 
> python scripts there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19002) Check pep8 against dev/*.py scripts

2016-12-26 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-19002:
-
Description: We can check pep8 against dev/*.py scripts. There are already 
several python scripts being checked.  (was: We can check pep8 against 
merge_spark_pr.py script. There are already several python scripts there.)

> Check pep8 against dev/*.py scripts
> ---
>
> Key: SPARK-19002
> URL: https://issues.apache.org/jira/browse/SPARK-19002
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> We can check pep8 against dev/*.py scripts. There are already several python 
> scripts being checked.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19006) should mentioned the max value allowed for spark.kryoserializer.buffer.max in doc

2016-12-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19006:


Assignee: Apache Spark

> should mentioned the max value allowed for spark.kryoserializer.buffer.max in 
> doc
> -
>
> Key: SPARK-19006
> URL: https://issues.apache.org/jira/browse/SPARK-19006
> Project: Spark
>  Issue Type: Documentation
>Reporter: Yuexin Zhang
>Assignee: Apache Spark
>
> On configuration doc 
> page:https://spark.apache.org/docs/latest/configuration.html
> We mentioned spark.kryoserializer.buffer.max : Maximum allowable size of Kryo 
> serialization buffer. This must be larger than any object you attempt to 
> serialize. Increase this if you get a "buffer limit exceeded" exception 
> inside Kryo.
> from source code, it has hard coded upper limit :
> val maxBufferSizeMb = conf.getSizeAsMb("spark.kryoserializer.buffer.max", 
> "64m").toInt
>   if (maxBufferSizeMb >= ByteUnit.GiB.toMiB(2)) {
> throw new IllegalArgumentException("spark.kryoserializer.buffer.max must 
> be less than " +
>   s"2048 mb, got: + $maxBufferSizeMb mb.")
>   }
> We should mention "this value must be less than 2048 mb" on the config page 
> as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19006) should mentioned the max value allowed for spark.kryoserializer.buffer.max in doc

2016-12-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19006:


Assignee: (was: Apache Spark)

> should mentioned the max value allowed for spark.kryoserializer.buffer.max in 
> doc
> -
>
> Key: SPARK-19006
> URL: https://issues.apache.org/jira/browse/SPARK-19006
> Project: Spark
>  Issue Type: Documentation
>Reporter: Yuexin Zhang
>
> On configuration doc 
> page:https://spark.apache.org/docs/latest/configuration.html
> We mentioned spark.kryoserializer.buffer.max : Maximum allowable size of Kryo 
> serialization buffer. This must be larger than any object you attempt to 
> serialize. Increase this if you get a "buffer limit exceeded" exception 
> inside Kryo.
> from source code, it has hard coded upper limit :
> val maxBufferSizeMb = conf.getSizeAsMb("spark.kryoserializer.buffer.max", 
> "64m").toInt
>   if (maxBufferSizeMb >= ByteUnit.GiB.toMiB(2)) {
> throw new IllegalArgumentException("spark.kryoserializer.buffer.max must 
> be less than " +
>   s"2048 mb, got: + $maxBufferSizeMb mb.")
>   }
> We should mention "this value must be less than 2048 mb" on the config page 
> as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19006) should mentioned the max value allowed for spark.kryoserializer.buffer.max in doc

2016-12-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779334#comment-15779334
 ] 

Apache Spark commented on SPARK-19006:
--

User 'cnZach' has created a pull request for this issue:
https://github.com/apache/spark/pull/16412

> should mentioned the max value allowed for spark.kryoserializer.buffer.max in 
> doc
> -
>
> Key: SPARK-19006
> URL: https://issues.apache.org/jira/browse/SPARK-19006
> Project: Spark
>  Issue Type: Documentation
>Reporter: Yuexin Zhang
>
> On configuration doc 
> page:https://spark.apache.org/docs/latest/configuration.html
> We mentioned spark.kryoserializer.buffer.max : Maximum allowable size of Kryo 
> serialization buffer. This must be larger than any object you attempt to 
> serialize. Increase this if you get a "buffer limit exceeded" exception 
> inside Kryo.
> from source code, it has hard coded upper limit :
> val maxBufferSizeMb = conf.getSizeAsMb("spark.kryoserializer.buffer.max", 
> "64m").toInt
>   if (maxBufferSizeMb >= ByteUnit.GiB.toMiB(2)) {
> throw new IllegalArgumentException("spark.kryoserializer.buffer.max must 
> be less than " +
>   s"2048 mb, got: + $maxBufferSizeMb mb.")
>   }
> We should mention "this value must be less than 2048 mb" on the config page 
> as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19006) should mentioned the max value allowed for spark.kryoserializer.buffer.max in doc

2016-12-26 Thread Yuexin Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuexin Zhang updated SPARK-19006:
-
Description: 
On configuration doc 
page:https://spark.apache.org/docs/latest/configuration.html
We mentioned spark.kryoserializer.buffer.max : Maximum allowable size of Kryo 
serialization buffer. This must be larger than any object you attempt to 
serialize. Increase this if you get a "buffer limit exceeded" exception inside 
Kryo.

from source code, it has hard coded upper limit :
val maxBufferSizeMb = conf.getSizeAsMb("spark.kryoserializer.buffer.max", 
"64m").toInt
  if (maxBufferSizeMb >= ByteUnit.GiB.toMiB(2)) {
throw new IllegalArgumentException("spark.kryoserializer.buffer.max must be 
less than " +
  s"2048 mb, got: + $maxBufferSizeMb mb.")
  }

We should mention "this value must be less than 2048 mb" on the config page as 
well.

> should mentioned the max value allowed for spark.kryoserializer.buffer.max in 
> doc
> -
>
> Key: SPARK-19006
> URL: https://issues.apache.org/jira/browse/SPARK-19006
> Project: Spark
>  Issue Type: Documentation
>Reporter: Yuexin Zhang
>
> On configuration doc 
> page:https://spark.apache.org/docs/latest/configuration.html
> We mentioned spark.kryoserializer.buffer.max : Maximum allowable size of Kryo 
> serialization buffer. This must be larger than any object you attempt to 
> serialize. Increase this if you get a "buffer limit exceeded" exception 
> inside Kryo.
> from source code, it has hard coded upper limit :
> val maxBufferSizeMb = conf.getSizeAsMb("spark.kryoserializer.buffer.max", 
> "64m").toInt
>   if (maxBufferSizeMb >= ByteUnit.GiB.toMiB(2)) {
> throw new IllegalArgumentException("spark.kryoserializer.buffer.max must 
> be less than " +
>   s"2048 mb, got: + $maxBufferSizeMb mb.")
>   }
> We should mention "this value must be less than 2048 mb" on the config page 
> as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17984) Add support for numa aware feature

2016-12-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779305#comment-15779305
 ] 

Apache Spark commented on SPARK-17984:
--

User 'xiaochang-wu' has created a pull request for this issue:
https://github.com/apache/spark/pull/16411

> Add support for numa aware feature
> --
>
> Key: SPARK-17984
> URL: https://issues.apache.org/jira/browse/SPARK-17984
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Mesos, YARN
>Affects Versions: 2.0.1
> Environment: Cluster Topo: 1 Master + 4 Slaves
> CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores)
> Memory: 128GB(2 NUMA Nodes)
> SW Version: Hadoop-5.7.0 + Spark-2.0.0
>Reporter: quanfuwang
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> This Jira is target to add support numa aware feature which can help improve 
> performance by making core access local memory rather than remote one. 
>  A patch is being developed, see https://github.com/apache/spark/pull/15524.
> And the whole task includes 3 subtasks and will be developed iteratively:
> Numa aware support for Yarn based deployment mode
> Numa aware support for Mesos based deployment mode
> Numa aware support for Standalone based deployment mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19006) should mentioned the max value allowed for spark.kryoserializer.buffer.max in doc

2016-12-26 Thread Yuexin Zhang (JIRA)
Yuexin Zhang created SPARK-19006:


 Summary: should mentioned the max value allowed for 
spark.kryoserializer.buffer.max in doc
 Key: SPARK-19006
 URL: https://issues.apache.org/jira/browse/SPARK-19006
 Project: Spark
  Issue Type: Documentation
Reporter: Yuexin Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions

2016-12-26 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779288#comment-15779288
 ] 

Kazuaki Ishizaki commented on SPARK-14083:
--

[Here|https://github.com/apache/spark/pull/16391#discussion_r93788919] is 
another motivation to apply this optimization cc:[~cloud_fan]

> Analyze JVM bytecode and turn closures into Catalyst expressions
> 
>
> Key: SPARK-14083
> URL: https://issues.apache.org/jira/browse/SPARK-14083
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> One big advantage of the Dataset API is the type safety, at the cost of 
> performance due to heavy reliance on user-defined closures/lambdas. These 
> closures are typically slower than expressions because we have more 
> flexibility to optimize expressions (known data types, no virtual function 
> calls, etc). In many cases, it's actually not going to be very difficult to 
> look into the byte code of these closures and figure out what they are trying 
> to do. If we can understand them, then we can turn them directly into 
> Catalyst expressions for more optimized executions.
> Some examples are:
> {code}
> df.map(_.name)  // equivalent to expression col("name")
> ds.groupBy(_.gender)  // equivalent to expression col("gender")
> df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), 
> lit(18)
> df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
> {code}
> The goal of this ticket is to design a small framework for byte code analysis 
> and use that to convert closures/lambdas into Catalyst expressions in order 
> to speed up Dataset execution. It is a little bit futuristic, but I believe 
> it is very doable. The framework should be easy to reason about (e.g. similar 
> to Catalyst).
> Note that a big emphasis on "small" and "easy to reason about". A patch 
> should be rejected if it is too complicated or difficult to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19005) Keep column ordering when a schema is explicitly specified

2016-12-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19005:


Assignee: Apache Spark

>  Keep column ordering when a schema is explicitly specified
> ---
>
> Key: SPARK-19005
> URL: https://issues.apache.org/jira/browse/SPARK-19005
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> This ticket is to keep column ordering when a schema is explicitly specified.
> A concrete example is as follows;
> {code}
> scala> import org.apache.spark.sql.types._
> scala> case class A(a: Long, b: Int)
> scala> val as = Seq(A(1, 2))
> scala> 
> spark.createDataFrame(as).write.parquet("/Users/maropu/Desktop/data/a=1/")
> scala> val df = spark.read.parquet("/Users/maropu/Desktop/data/")
> scala> df.printSchema
> root
>  |-- a: integer (nullable = true)
>  |-- b: integer (nullable = true)
> scala> val schema = new StructType().add("a", LongType).add("b", IntegerType)
> scala> val df = 
> spark.read.schema(schema).parquet("/Users/maropu/Desktop/data/")
> scala> df.printSchema
> root
>  |-- b: integer (nullable = true)
>  |-- a: long (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19005) Keep column ordering when a schema is explicitly specified

2016-12-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779235#comment-15779235
 ] 

Apache Spark commented on SPARK-19005:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/16410

>  Keep column ordering when a schema is explicitly specified
> ---
>
> Key: SPARK-19005
> URL: https://issues.apache.org/jira/browse/SPARK-19005
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This ticket is to keep column ordering when a schema is explicitly specified.
> A concrete example is as follows;
> {code}
> scala> import org.apache.spark.sql.types._
> scala> case class A(a: Long, b: Int)
> scala> val as = Seq(A(1, 2))
> scala> 
> spark.createDataFrame(as).write.parquet("/Users/maropu/Desktop/data/a=1/")
> scala> val df = spark.read.parquet("/Users/maropu/Desktop/data/")
> scala> df.printSchema
> root
>  |-- a: integer (nullable = true)
>  |-- b: integer (nullable = true)
> scala> val schema = new StructType().add("a", LongType).add("b", IntegerType)
> scala> val df = 
> spark.read.schema(schema).parquet("/Users/maropu/Desktop/data/")
> scala> df.printSchema
> root
>  |-- b: integer (nullable = true)
>  |-- a: long (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19005) Keep column ordering when a schema is explicitly specified

2016-12-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19005:


Assignee: (was: Apache Spark)

>  Keep column ordering when a schema is explicitly specified
> ---
>
> Key: SPARK-19005
> URL: https://issues.apache.org/jira/browse/SPARK-19005
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This ticket is to keep column ordering when a schema is explicitly specified.
> A concrete example is as follows;
> {code}
> scala> import org.apache.spark.sql.types._
> scala> case class A(a: Long, b: Int)
> scala> val as = Seq(A(1, 2))
> scala> 
> spark.createDataFrame(as).write.parquet("/Users/maropu/Desktop/data/a=1/")
> scala> val df = spark.read.parquet("/Users/maropu/Desktop/data/")
> scala> df.printSchema
> root
>  |-- a: integer (nullable = true)
>  |-- b: integer (nullable = true)
> scala> val schema = new StructType().add("a", LongType).add("b", IntegerType)
> scala> val df = 
> spark.read.schema(schema).parquet("/Users/maropu/Desktop/data/")
> scala> df.printSchema
> root
>  |-- b: integer (nullable = true)
>  |-- a: long (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19005) Keep column ordering when a schema is explicitly specified

2016-12-26 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-19005:


 Summary:  Keep column ordering when a schema is explicitly 
specified
 Key: SPARK-19005
 URL: https://issues.apache.org/jira/browse/SPARK-19005
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Takeshi Yamamuro
Priority: Minor


This ticket is to keep column ordering when a schema is explicitly specified.
A concrete example is as follows;


{code}
scala> import org.apache.spark.sql.types._
scala> case class A(a: Long, b: Int)
scala> val as = Seq(A(1, 2))
scala> 
spark.createDataFrame(as).write.parquet("/Users/maropu/Desktop/data/a=1/")
scala> val df = spark.read.parquet("/Users/maropu/Desktop/data/")
scala> df.printSchema
root
 |-- a: integer (nullable = true)
 |-- b: integer (nullable = true)

scala> val schema = new StructType().add("a", LongType).add("b", IntegerType)
scala> val df = spark.read.schema(schema).parquet("/Users/maropu/Desktop/data/")
scala> df.printSchema
root
 |-- b: integer (nullable = true)
 |-- a: long (nullable = true)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18991) Change ContextCleaner.referenceBuffer to ConcurrentHashMap to make it faster

2016-12-26 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779002#comment-15779002
 ] 

Shixiong Zhu commented on SPARK-18991:
--

[~prashant_] FYI, this may also be necessary for your Kafka benchmark. I 
observed our internal jobs failed due to OOM because ContextCleaner was too 
slow even after disabling blocking clean up.

> Change ContextCleaner.referenceBuffer to ConcurrentHashMap to make it faster
> 
>
> Key: SPARK-18991
> URL: https://issues.apache.org/jira/browse/SPARK-18991
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.1, 2.2.0
>
>
> Right now `ContextCleaner.referenceBuffer` is ConcurrentLinkedQueue and the 
> time complexity of the `remove` action is O ( n ). It can be changed to use 
> ConcurrentHashMap whose `remove` is O(1).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18997) Recommended upgrade libthrift to 0.9.3

2016-12-26 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778946#comment-15778946
 ] 

Dongjoon Hyun commented on SPARK-18997:
---

+1

> Recommended upgrade libthrift  to 0.9.3
> ---
>
> Key: SPARK-18997
> URL: https://issues.apache.org/jira/browse/SPARK-18997
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: meiyoula
>Priority: Critical
>
> libthrift 0.9.2 has a serious security vulnerability:CVE-2015-3254



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19000) Spark beeline: table was created at default database even though specifing a database name

2016-12-26 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778943#comment-15778943
 ] 

Dongjoon Hyun edited comment on SPARK-19000 at 12/26/16 8:32 PM:
-

Thank you for reporting, [~ouyangxc].
Yep. It was correct. Also, the issue was resolved at SPARK-17819 since 2.0.2.
Could you try the latest Apache Spark 2.1.0 again?

http://www-us.apache.org/dist/spark/spark-2.1.0/


was (Author: dongjoon):
Thank you for reporting, [~ouyangxc].
Yep. It was correct. The issue is resolved at SPARK-17819 since 2.0.2.
Could you try the latest Apache Spark 2.1.0 again?

http://www-us.apache.org/dist/spark/spark-2.1.0/

> Spark beeline: table was created  at default database  even though specifing 
> a database name
> 
>
> Key: SPARK-19000
> URL: https://issues.apache.org/jira/browse/SPARK-19000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: spark2.0.0
>Reporter: Xiaochen Ouyang
>
> After running this command as follow:
> ${SPARK_HOME}/bin/beeline -u jdbc:hive2://192.168.156.5:10001/mydb -n mr -e 
> "create table test_1(id int ,name string) row format delimited fields 
> terminated by ',' stored as textfile;"
> I found the table "test_1" was create at default database other than mydb 
> database that i specified。



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19000) Spark beeline: table was created at default database even though specifing a database name

2016-12-26 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-19000.
---
Resolution: Duplicate

> Spark beeline: table was created  at default database  even though specifing 
> a database name
> 
>
> Key: SPARK-19000
> URL: https://issues.apache.org/jira/browse/SPARK-19000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: spark2.0.0
>Reporter: Xiaochen Ouyang
>
> After running this command as follow:
> ${SPARK_HOME}/bin/beeline -u jdbc:hive2://192.168.156.5:10001/mydb -n mr -e 
> "create table test_1(id int ,name string) row format delimited fields 
> terminated by ',' stored as textfile;"
> I found the table "test_1" was create at default database other than mydb 
> database that i specified。



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19000) Spark beeline: table was created at default database even though specifing a database name

2016-12-26 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778943#comment-15778943
 ] 

Dongjoon Hyun commented on SPARK-19000:
---

Thank you for reporting, [~ouyangxc].
Yep. It was correct. The issue is resolved at SPARK-17819 since 2.0.2.
Could you try the latest Apache Spark 2.1.0 again?

http://www-us.apache.org/dist/spark/spark-2.1.0/

> Spark beeline: table was created  at default database  even though specifing 
> a database name
> 
>
> Key: SPARK-19000
> URL: https://issues.apache.org/jira/browse/SPARK-19000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: spark2.0.0
>Reporter: Xiaochen Ouyang
>
> After running this command as follow:
> ${SPARK_HOME}/bin/beeline -u jdbc:hive2://192.168.156.5:10001/mydb -n mr -e 
> "create table test_1(id int ,name string) row format delimited fields 
> terminated by ',' stored as textfile;"
> I found the table "test_1" was create at default database other than mydb 
> database that i specified。



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18989) DESC TABLE should not fail with format class not found

2016-12-26 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-18989.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16388
[https://github.com/apache/spark/pull/16388]

> DESC TABLE should not fail with format class not found
> --
>
> Key: SPARK-18989
> URL: https://issues.apache.org/jira/browse/SPARK-18989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19004) Fix `testH2Dialect` by removing `getCatalystType`

2016-12-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19004:


Assignee: (was: Apache Spark)

> Fix `testH2Dialect` by removing `getCatalystType`
> -
>
> Key: SPARK-19004
> URL: https://issues.apache.org/jira/browse/SPARK-19004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `JdbcDialect` subclasses should return `None` by default. However, 
> `testH2Dialect.getCatalystType` returns always `Some(StringType)` incorrectly.
> This issue fixes `testH2Dialect` by removing the wrong `getCatalystType` 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19004) Fix `testH2Dialect` by removing `getCatalystType`

2016-12-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778847#comment-15778847
 ] 

Apache Spark commented on SPARK-19004:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/16409

> Fix `testH2Dialect` by removing `getCatalystType`
> -
>
> Key: SPARK-19004
> URL: https://issues.apache.org/jira/browse/SPARK-19004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `JdbcDialect` subclasses should return `None` by default. However, 
> `testH2Dialect.getCatalystType` returns always `Some(StringType)` incorrectly.
> This issue fixes `testH2Dialect` by removing the wrong `getCatalystType` 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19004) Fix `testH2Dialect` by removing `getCatalystType`

2016-12-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19004:


Assignee: Apache Spark

> Fix `testH2Dialect` by removing `getCatalystType`
> -
>
> Key: SPARK-19004
> URL: https://issues.apache.org/jira/browse/SPARK-19004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> `JdbcDialect` subclasses should return `None` by default. However, 
> `testH2Dialect.getCatalystType` returns always `Some(StringType)` incorrectly.
> This issue fixes `testH2Dialect` by removing the wrong `getCatalystType` 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19004) Fix `testH2Dialect` by removing `getCatalystType`

2016-12-26 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-19004:
-

 Summary: Fix `testH2Dialect` by removing `getCatalystType`
 Key: SPARK-19004
 URL: https://issues.apache.org/jira/browse/SPARK-19004
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 2.1.0
Reporter: Dongjoon Hyun
Priority: Minor


`JdbcDialect` subclasses should return `None` by default. However, 
`testH2Dialect.getCatalystType` returns always `Some(StringType)` incorrectly.

This issue fixes `testH2Dialect` by removing the wrong `getCatalystType` 
implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19003) Add Java examples in "Spark Streaming Guide", section "Design Patterns for using foreachRDD"

2016-12-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19003:


Assignee: (was: Apache Spark)

> Add Java examples in "Spark Streaming  Guide", section "Design Patterns for 
> using foreachRDD"
> -
>
> Key: SPARK-19003
> URL: https://issues.apache.org/jira/browse/SPARK-19003
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Tushar Adeshara
>Priority: Minor
>
> The page http://spark.apache.org/docs/latest/streaming-programming-guide.html 
> is missing Java example in section "Design Patterns for using foreachRDD". 
> Except this section, the page has Scala, Java and Python examples for all 
> other sections, so would be good to add for consistency. 
> I have made required code changes, will raise a pull request against this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19003) Add Java examples in "Spark Streaming Guide", section "Design Patterns for using foreachRDD"

2016-12-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778698#comment-15778698
 ] 

Apache Spark commented on SPARK-19003:
--

User 'adesharatushar' has created a pull request for this issue:
https://github.com/apache/spark/pull/16408

> Add Java examples in "Spark Streaming  Guide", section "Design Patterns for 
> using foreachRDD"
> -
>
> Key: SPARK-19003
> URL: https://issues.apache.org/jira/browse/SPARK-19003
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Tushar Adeshara
>Priority: Minor
>
> The page http://spark.apache.org/docs/latest/streaming-programming-guide.html 
> is missing Java example in section "Design Patterns for using foreachRDD". 
> Except this section, the page has Scala, Java and Python examples for all 
> other sections, so would be good to add for consistency. 
> I have made required code changes, will raise a pull request against this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19003) Add Java examples in "Spark Streaming Guide", section "Design Patterns for using foreachRDD"

2016-12-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19003:


Assignee: Apache Spark

> Add Java examples in "Spark Streaming  Guide", section "Design Patterns for 
> using foreachRDD"
> -
>
> Key: SPARK-19003
> URL: https://issues.apache.org/jira/browse/SPARK-19003
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Tushar Adeshara
>Assignee: Apache Spark
>Priority: Minor
>
> The page http://spark.apache.org/docs/latest/streaming-programming-guide.html 
> is missing Java example in section "Design Patterns for using foreachRDD". 
> Except this section, the page has Scala, Java and Python examples for all 
> other sections, so would be good to add for consistency. 
> I have made required code changes, will raise a pull request against this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19003) Add Java examples in "Spark Streaming Guide", section "Design Patterns for using foreachRDD"

2016-12-26 Thread Tushar Adeshara (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778699#comment-15778699
 ] 

Tushar Adeshara commented on SPARK-19003:
-

Pull request https://github.com/apache/spark/pull/16408 

> Add Java examples in "Spark Streaming  Guide", section "Design Patterns for 
> using foreachRDD"
> -
>
> Key: SPARK-19003
> URL: https://issues.apache.org/jira/browse/SPARK-19003
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Tushar Adeshara
>Priority: Minor
>
> The page http://spark.apache.org/docs/latest/streaming-programming-guide.html 
> is missing Java example in section "Design Patterns for using foreachRDD". 
> Except this section, the page has Scala, Java and Python examples for all 
> other sections, so would be good to add for consistency. 
> I have made required code changes, will raise a pull request against this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19003) Add Java examples in "Spark Streaming Guide", section "Design Patterns for using foreachRDD"

2016-12-26 Thread Tushar Adeshara (JIRA)
Tushar Adeshara created SPARK-19003:
---

 Summary: Add Java examples in "Spark Streaming  Guide", section 
"Design Patterns for using foreachRDD"
 Key: SPARK-19003
 URL: https://issues.apache.org/jira/browse/SPARK-19003
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.2.0
Reporter: Tushar Adeshara
Priority: Minor


The page http://spark.apache.org/docs/latest/streaming-programming-guide.html 
is missing Java example in section "Design Patterns for using foreachRDD". 

Except this section, the page has Scala, Java and Python examples for all other 
sections, so would be good to add for consistency. 

I have made required code changes, will raise a pull request against this.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18980) implement Aggregator with TypedImperativeAggregate

2016-12-26 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18980.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> implement Aggregator with TypedImperativeAggregate
> --
>
> Key: SPARK-18980
> URL: https://issues.apache.org/jira/browse/SPARK-18980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19001) Worker will submit multiply CleanWorkDir and SendHeartbeat task with each RegisterWorker response

2016-12-26 Thread liujianhui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liujianhui updated SPARK-19001:
---
Description: 
h2. scene
worker will submit multiply task of SendHeartbeat and CleanWorkDir if the 
worker register itself again, this code as follow
{code}
private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = 
synchronized {
msg match {
  case RegisteredWorker(masterRef, masterWebUiUrl) =>
logInfo("Successfully registered with master " + 
masterRef.address.toSparkURL)
registered = true
changeMaster(masterRef, masterWebUiUrl)
forwordMessageScheduler.scheduleAtFixedRate(new Runnable {
  override def run(): Unit = Utils.tryLogNonFatalError {
self.send(SendHeartbeat)
  }
}, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS)
if (CLEANUP_ENABLED) {
  logInfo(
s"Worker cleanup enabled; old application directories will be 
deleted in: $workDir")
  forwordMessageScheduler.scheduleAtFixedRate(new Runnable {
override def run(): Unit = Utils.tryLogNonFatalError {
  self.send(WorkDirCleanup)
}
  }, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, 
TimeUnit.MILLISECONDS)
}
{code} 

log as follow
{code}
2016-12-20 20:23:30,030 | Successfully registered with master 
spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58)
2016-12-26 20:41:58,058 | Successfully registered with master 
spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58)
{code}

if the worker Send RegisterWorker event multiply many times when the master 
found the heartbeat of the worker expired, then it will submit task multiply 
times

  was:
h2. scene
worker will submit multiply task of SendHeartbeat and CleanWorkDir if the 
worker register itself again, this code as follow
{code}
private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = 
synchronized {
msg match {
  case RegisteredWorker(masterRef, masterWebUiUrl) =>
logInfo("Successfully registered with master " + 
masterRef.address.toSparkURL)
registered = true
changeMaster(masterRef, masterWebUiUrl)
forwordMessageScheduler.scheduleAtFixedRate(new Runnable {
  override def run(): Unit = Utils.tryLogNonFatalError {
self.send(SendHeartbeat)
  }
}, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS)
if (CLEANUP_ENABLED) {
  logInfo(
s"Worker cleanup enabled; old application directories will be 
deleted in: $workDir")
  forwordMessageScheduler.scheduleAtFixedRate(new Runnable {
override def run(): Unit = Utils.tryLogNonFatalError {
  self.send(WorkDirCleanup)
}
  }, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, 
TimeUnit.MILLISECONDS)
}
{code} 

log as follow
{code}
2016-12-20 20:23:30,030 | Successfully registered with master 
spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58)
2016-12-26 20:41:58,058 | Successfully registered with master 
spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58)
{code}

if the worker Send RegisterWorker event multiply many times if the master found 
the heartbeat of the worker expired, then it will submit task multiply times


> Worker will submit multiply CleanWorkDir and SendHeartbeat task with each 
> RegisterWorker response
> -
>
> Key: SPARK-19001
> URL: https://issues.apache.org/jira/browse/SPARK-19001
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.1
>Reporter: liujianhui
>Priority: Minor
>
> h2. scene
> worker will submit multiply task of SendHeartbeat and CleanWorkDir if the 
> worker register itself again, this code as follow
> {code}
> private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = 
> synchronized {
> msg match {
>   case RegisteredWorker(masterRef, masterWebUiUrl) =>
> logInfo("Successfully registered with master " + 
> masterRef.address.toSparkURL)
> registered = true
> changeMaster(masterRef, masterWebUiUrl)
> forwordMessageScheduler.scheduleAtFixedRate(new Runnable {
>   override def run(): Unit = Utils.tryLogNonFatalError {
> self.send(SendHeartbeat)
>   }
> }, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS)
> if (CLEANUP_ENABLED) {
>   logInfo(
> s"Worker cleanup enabled; old application directories will be 
> deleted in: $workDir")
>   forwordMessageScheduler.scheduleAtFixedRate(new Runnable {
> override def run(): Unit = Utils.tryLogNonFatalError {
>   

[jira] [Commented] (SPARK-19001) Worker will submit multiply CleanWorkDir and SendHeartbeat task with each RegisterWorker response

2016-12-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778382#comment-15778382
 ] 

Apache Spark commented on SPARK-19001:
--

User 'liujianhuiouc' has created a pull request for this issue:
https://github.com/apache/spark/pull/16406

> Worker will submit multiply CleanWorkDir and SendHeartbeat task with each 
> RegisterWorker response
> -
>
> Key: SPARK-19001
> URL: https://issues.apache.org/jira/browse/SPARK-19001
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.1
>Reporter: liujianhui
>Priority: Minor
>
> h2. scene
> worker will submit multiply task of SendHeartbeat and CleanWorkDir if the 
> worker register itself again, this code as follow
> {code}
> private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = 
> synchronized {
> msg match {
>   case RegisteredWorker(masterRef, masterWebUiUrl) =>
> logInfo("Successfully registered with master " + 
> masterRef.address.toSparkURL)
> registered = true
> changeMaster(masterRef, masterWebUiUrl)
> forwordMessageScheduler.scheduleAtFixedRate(new Runnable {
>   override def run(): Unit = Utils.tryLogNonFatalError {
> self.send(SendHeartbeat)
>   }
> }, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS)
> if (CLEANUP_ENABLED) {
>   logInfo(
> s"Worker cleanup enabled; old application directories will be 
> deleted in: $workDir")
>   forwordMessageScheduler.scheduleAtFixedRate(new Runnable {
> override def run(): Unit = Utils.tryLogNonFatalError {
>   self.send(WorkDirCleanup)
> }
>   }, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, 
> TimeUnit.MILLISECONDS)
> }
> {code} 
> log as follow
> {code}
> 2016-12-20 20:23:30,030 | Successfully registered with master 
> spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58)
> 2016-12-26 20:41:58,058 | Successfully registered with master 
> spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58)
> {code}
> if the worker Send RegisterWorker event multiply many times if the master 
> found the heartbeat of the worker expired, then it will submit task multiply 
> times



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19001) Worker will submit multiply CleanWorkDir and SendHeartbeat task with each RegisterWorker response

2016-12-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19001:


Assignee: (was: Apache Spark)

> Worker will submit multiply CleanWorkDir and SendHeartbeat task with each 
> RegisterWorker response
> -
>
> Key: SPARK-19001
> URL: https://issues.apache.org/jira/browse/SPARK-19001
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.1
>Reporter: liujianhui
>Priority: Minor
>
> h2. scene
> worker will submit multiply task of SendHeartbeat and CleanWorkDir if the 
> worker register itself again, this code as follow
> {code}
> private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = 
> synchronized {
> msg match {
>   case RegisteredWorker(masterRef, masterWebUiUrl) =>
> logInfo("Successfully registered with master " + 
> masterRef.address.toSparkURL)
> registered = true
> changeMaster(masterRef, masterWebUiUrl)
> forwordMessageScheduler.scheduleAtFixedRate(new Runnable {
>   override def run(): Unit = Utils.tryLogNonFatalError {
> self.send(SendHeartbeat)
>   }
> }, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS)
> if (CLEANUP_ENABLED) {
>   logInfo(
> s"Worker cleanup enabled; old application directories will be 
> deleted in: $workDir")
>   forwordMessageScheduler.scheduleAtFixedRate(new Runnable {
> override def run(): Unit = Utils.tryLogNonFatalError {
>   self.send(WorkDirCleanup)
> }
>   }, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, 
> TimeUnit.MILLISECONDS)
> }
> {code} 
> log as follow
> {code}
> 2016-12-20 20:23:30,030 | Successfully registered with master 
> spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58)
> 2016-12-26 20:41:58,058 | Successfully registered with master 
> spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58)
> {code}
> if the worker Send RegisterWorker event multiply many times if the master 
> found the heartbeat of the worker expired, then it will submit task multiply 
> times



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19001) Worker will submit multiply CleanWorkDir and SendHeartbeat task with each RegisterWorker response

2016-12-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19001:


Assignee: Apache Spark

> Worker will submit multiply CleanWorkDir and SendHeartbeat task with each 
> RegisterWorker response
> -
>
> Key: SPARK-19001
> URL: https://issues.apache.org/jira/browse/SPARK-19001
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.1
>Reporter: liujianhui
>Assignee: Apache Spark
>Priority: Minor
>
> h2. scene
> worker will submit multiply task of SendHeartbeat and CleanWorkDir if the 
> worker register itself again, this code as follow
> {code}
> private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = 
> synchronized {
> msg match {
>   case RegisteredWorker(masterRef, masterWebUiUrl) =>
> logInfo("Successfully registered with master " + 
> masterRef.address.toSparkURL)
> registered = true
> changeMaster(masterRef, masterWebUiUrl)
> forwordMessageScheduler.scheduleAtFixedRate(new Runnable {
>   override def run(): Unit = Utils.tryLogNonFatalError {
> self.send(SendHeartbeat)
>   }
> }, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS)
> if (CLEANUP_ENABLED) {
>   logInfo(
> s"Worker cleanup enabled; old application directories will be 
> deleted in: $workDir")
>   forwordMessageScheduler.scheduleAtFixedRate(new Runnable {
> override def run(): Unit = Utils.tryLogNonFatalError {
>   self.send(WorkDirCleanup)
> }
>   }, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, 
> TimeUnit.MILLISECONDS)
> }
> {code} 
> log as follow
> {code}
> 2016-12-20 20:23:30,030 | Successfully registered with master 
> spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58)
> 2016-12-26 20:41:58,058 | Successfully registered with master 
> spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58)
> {code}
> if the worker Send RegisterWorker event multiply many times if the master 
> found the heartbeat of the worker expired, then it will submit task multiply 
> times



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19002) Check pep8 against merge_spark_pr.py script

2016-12-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19002:


Assignee: (was: Apache Spark)

> Check pep8 against merge_spark_pr.py script
> ---
>
> Key: SPARK-19002
> URL: https://issues.apache.org/jira/browse/SPARK-19002
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> We can check pep8 against merge_spark_pr.py script. There are already several 
> python scripts there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19002) Check pep8 against merge_spark_pr.py script

2016-12-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19002:


Assignee: Apache Spark

> Check pep8 against merge_spark_pr.py script
> ---
>
> Key: SPARK-19002
> URL: https://issues.apache.org/jira/browse/SPARK-19002
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Trivial
>
> We can check pep8 against merge_spark_pr.py script. There are already several 
> python scripts there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19002) Check pep8 against merge_spark_pr.py script

2016-12-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778313#comment-15778313
 ] 

Apache Spark commented on SPARK-19002:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/16405

> Check pep8 against merge_spark_pr.py script
> ---
>
> Key: SPARK-19002
> URL: https://issues.apache.org/jira/browse/SPARK-19002
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> We can check pep8 against merge_spark_pr.py script. There are already several 
> python scripts there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19002) Check pep8 against merge_spark_pr.py script

2016-12-26 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-19002:


 Summary: Check pep8 against merge_spark_pr.py script
 Key: SPARK-19002
 URL: https://issues.apache.org/jira/browse/SPARK-19002
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Hyukjin Kwon
Priority: Trivial


We can check pep8 against merge_spark_pr.py script. There are already several 
python scripts there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19001) Worker will submit multiply CleanWorkDir and SendHeartbeat task with each RegisterWorker response

2016-12-26 Thread liujianhui (JIRA)
liujianhui created SPARK-19001:
--

 Summary: Worker will submit multiply CleanWorkDir and 
SendHeartbeat task with each RegisterWorker response
 Key: SPARK-19001
 URL: https://issues.apache.org/jira/browse/SPARK-19001
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.6.1
Reporter: liujianhui
Priority: Minor


h2. scene
worker will submit multiply task of SendHeartbeat and CleanWorkDir if the 
worker register itself again, this code as follow
{code}
private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = 
synchronized {
msg match {
  case RegisteredWorker(masterRef, masterWebUiUrl) =>
logInfo("Successfully registered with master " + 
masterRef.address.toSparkURL)
registered = true
changeMaster(masterRef, masterWebUiUrl)
forwordMessageScheduler.scheduleAtFixedRate(new Runnable {
  override def run(): Unit = Utils.tryLogNonFatalError {
self.send(SendHeartbeat)
  }
}, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS)
if (CLEANUP_ENABLED) {
  logInfo(
s"Worker cleanup enabled; old application directories will be 
deleted in: $workDir")
  forwordMessageScheduler.scheduleAtFixedRate(new Runnable {
override def run(): Unit = Utils.tryLogNonFatalError {
  self.send(WorkDirCleanup)
}
  }, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, 
TimeUnit.MILLISECONDS)
}
{code} 

log as follow
{code}
2016-12-20 20:23:30,030 | Successfully registered with master 
spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58)
2016-12-26 20:41:58,058 | Successfully registered with master 
spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58)
{code}

if the worker Send RegisterWorker event multiply many times if the master found 
the heartbeat of the worker expired, then it will submit task multiply times



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19000) Spark beeline: table was created at default database even though specifing a database name

2016-12-26 Thread Xiaochen Ouyang (JIRA)
Xiaochen Ouyang created SPARK-19000:
---

 Summary: Spark beeline: table was created  at default database  
even though specifing a database name
 Key: SPARK-19000
 URL: https://issues.apache.org/jira/browse/SPARK-19000
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
 Environment: spark2.0.0
Reporter: Xiaochen Ouyang


After running this command as follow:
${SPARK_HOME}/bin/beeline -u jdbc:hive2://192.168.156.5:10001/mydb -n mr -e 
"create table test_1(id int ,name string) row format delimited fields 
terminated by ',' stored as textfile;"
I found the table "test_1" was create at default database other than mydb 
database that i specified。




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18969) PullOutNondeterministic should work for Aggregate operator

2016-12-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778112#comment-15778112
 ] 

Apache Spark commented on SPARK-18969:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/16404

> PullOutNondeterministic should work for Aggregate operator
> --
>
> Key: SPARK-18969
> URL: https://issues.apache.org/jira/browse/SPARK-18969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This test case should pass:
> {code}
> checkAnalysis(
>   r.groupBy(rnd)(rnd),
>   r.select(a, b, rnd).groupBy(rndref)(rndref)
> )
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18819) Double alignment on ARM71 platform

2016-12-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18819:


Assignee: (was: Apache Spark)

> Double alignment on ARM71 platform
> --
>
> Key: SPARK-18819
> URL: https://issues.apache.org/jira/browse/SPARK-18819
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark
>Affects Versions: 2.0.2
> Environment: Ubuntu 14.04 LTS on ARM 7.1
>Reporter: Michael Kamprath
>Priority: Critical
>
> _Note - Updated the ticket title to be reflective of what was found to be the 
> underlying issue_
> When I create a data frame in PySpark with a small row count (less than 
> number executors), then write it to a parquet file, then load that parquet 
> file into a new data frame, and finally do any sort of read against the 
> loaded new data frame, Spark fails with an {{ExecutorLostFailure}}.
> Example code to replicate this issue:
> {code}
> from pyspark.sql.types import *
> rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')])
> my_schema = StructType([
> StructField("id", StringType(), True),
> StructField("value1", IntegerType(), True),
> StructField("value2", DoubleType(), True),
> StructField("name",StringType(), True)
> ])
> df = spark.createDataFrame( rdd, schema=my_schema)
> df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite')
> newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
> newdf.take(1)
> {code}
> The error I get when the {{take}} step runs is:
> {code}
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>   1 newdf = 
> spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
> > 2 newdf.take(1)
> /usr/local/spark/python/pyspark/sql/dataframe.py in take(self, num)
> 346 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
> 347 """
> --> 348 return self.limit(num).collect()
> 349 
> 350 @since(1.3)
> /usr/local/spark/python/pyspark/sql/dataframe.py in collect(self)
> 308 """
> 309 with SCCallSiteSync(self._sc) as css:
> --> 310 port = self._jdf.collectToPython()
> 311 return list(_load_from_socket(port, 
> BatchedSerializer(PickleSerializer(
> 312 
> /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134 
>1135 for temp_arg in temp_args:
> /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o54.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 6, 10.10.10.4): ExecutorLostFailure (executor 2 exited caused by one of 
> the running tasks) Reason: Remote RPC client disassociated. Likely due to 
> containers exceeding thresholds, or network issues. Check driver logs for 
> WARN messages.
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at 

[jira] [Assigned] (SPARK-18819) Double alignment on ARM71 platform

2016-12-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18819:


Assignee: Apache Spark

> Double alignment on ARM71 platform
> --
>
> Key: SPARK-18819
> URL: https://issues.apache.org/jira/browse/SPARK-18819
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark
>Affects Versions: 2.0.2
> Environment: Ubuntu 14.04 LTS on ARM 7.1
>Reporter: Michael Kamprath
>Assignee: Apache Spark
>Priority: Critical
>
> _Note - Updated the ticket title to be reflective of what was found to be the 
> underlying issue_
> When I create a data frame in PySpark with a small row count (less than 
> number executors), then write it to a parquet file, then load that parquet 
> file into a new data frame, and finally do any sort of read against the 
> loaded new data frame, Spark fails with an {{ExecutorLostFailure}}.
> Example code to replicate this issue:
> {code}
> from pyspark.sql.types import *
> rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')])
> my_schema = StructType([
> StructField("id", StringType(), True),
> StructField("value1", IntegerType(), True),
> StructField("value2", DoubleType(), True),
> StructField("name",StringType(), True)
> ])
> df = spark.createDataFrame( rdd, schema=my_schema)
> df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite')
> newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
> newdf.take(1)
> {code}
> The error I get when the {{take}} step runs is:
> {code}
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>   1 newdf = 
> spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
> > 2 newdf.take(1)
> /usr/local/spark/python/pyspark/sql/dataframe.py in take(self, num)
> 346 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
> 347 """
> --> 348 return self.limit(num).collect()
> 349 
> 350 @since(1.3)
> /usr/local/spark/python/pyspark/sql/dataframe.py in collect(self)
> 308 """
> 309 with SCCallSiteSync(self._sc) as css:
> --> 310 port = self._jdf.collectToPython()
> 311 return list(_load_from_socket(port, 
> BatchedSerializer(PickleSerializer(
> 312 
> /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134 
>1135 for temp_arg in temp_args:
> /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o54.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 6, 10.10.10.4): ExecutorLostFailure (executor 2 exited caused by one of 
> the running tasks) Reason: Remote RPC client disassociated. Likely due to 
> containers exceeding thresholds, or network issues. Check driver logs for 
> WARN messages.
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   

[jira] [Commented] (SPARK-18819) Double alignment on ARM71 platform

2016-12-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778107#comment-15778107
 ] 

Apache Spark commented on SPARK-18819:
--

User 'michaelkamprath' has created a pull request for this issue:
https://github.com/apache/spark/pull/16403

> Double alignment on ARM71 platform
> --
>
> Key: SPARK-18819
> URL: https://issues.apache.org/jira/browse/SPARK-18819
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark
>Affects Versions: 2.0.2
> Environment: Ubuntu 14.04 LTS on ARM 7.1
>Reporter: Michael Kamprath
>Priority: Critical
>
> _Note - Updated the ticket title to be reflective of what was found to be the 
> underlying issue_
> When I create a data frame in PySpark with a small row count (less than 
> number executors), then write it to a parquet file, then load that parquet 
> file into a new data frame, and finally do any sort of read against the 
> loaded new data frame, Spark fails with an {{ExecutorLostFailure}}.
> Example code to replicate this issue:
> {code}
> from pyspark.sql.types import *
> rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')])
> my_schema = StructType([
> StructField("id", StringType(), True),
> StructField("value1", IntegerType(), True),
> StructField("value2", DoubleType(), True),
> StructField("name",StringType(), True)
> ])
> df = spark.createDataFrame( rdd, schema=my_schema)
> df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite')
> newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
> newdf.take(1)
> {code}
> The error I get when the {{take}} step runs is:
> {code}
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>   1 newdf = 
> spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
> > 2 newdf.take(1)
> /usr/local/spark/python/pyspark/sql/dataframe.py in take(self, num)
> 346 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
> 347 """
> --> 348 return self.limit(num).collect()
> 349 
> 350 @since(1.3)
> /usr/local/spark/python/pyspark/sql/dataframe.py in collect(self)
> 308 """
> 309 with SCCallSiteSync(self._sc) as css:
> --> 310 port = self._jdf.collectToPython()
> 311 return list(_load_from_socket(port, 
> BatchedSerializer(PickleSerializer(
> 312 
> /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134 
>1135 for temp_arg in temp_args:
> /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o54.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 6, 10.10.10.4): ExecutorLostFailure (executor 2 exited caused by one of 
> the running tasks) Reason: Remote RPC client disassociated. Likely due to 
> containers exceeding thresholds, or network issues. Check driver logs for 
> WARN messages.
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at 
> 

[jira] [Commented] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode

2016-12-26 Thread vishal agrawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778011#comment-15778011
 ] 

vishal agrawal commented on SPARK-18857:


we have built Spark from 2.0.2 source code by changing 
SparkExecuteStatementOperation.scala to pre SPARK-16563 version. this version 
works fine without causing any thrift server issues.

> SparkSQL ThriftServer hangs while extracting huge data volumes in incremental 
> collect mode
> --
>
> Key: SPARK-18857
> URL: https://issues.apache.org/jira/browse/SPARK-18857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: vishal agrawal
> Attachments: GC-spark-1.6.3, GC-spark-2.0.2
>
>
> We are trying to run a sql query on our spark cluster and extracting around 
> 200 million records through SparkSQL ThriftServer interface. This query works 
> fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs 
> after fetching data from a few partitions (we are using incremental collect 
> mode with 400 partitions). As per documentation max memory taken up by thrift 
> server should be what is required by the biggest data partition. But we 
> observed that Thrift server is not releasing the old partitions memory 
> whenever the GC occurs even though it has moved to next partition data 
> fetches. which is not the case with 1.6.3 version.
> On further investigation we found that SparkExecuteStatementOperation.scala 
> was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults 
> bug" and result set iterator was duplicated to keep a reference to the first 
> set.
> +  val (itra, itrb) = iter.duplicate
> +  iterHeader = itra
> +  iter = itrb
> We suspect that this is resulting in the memory not being cleared on GC. To 
> confirm this we created an iterator in our test class and fetched the data 
> once without duplicating and second time with creating a duplicate. we could 
> see that in first instance it ran fine and fetched the entire data set while 
> in second instance driver hanged after fetching data from a few partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org