date:20160717

[jira] [Commented] (SPARK-16508) Fix documentation warnings found by R CMD check

2016-07-17 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381794#comment-15381794
 ] 

Felix Cheung commented on SPARK-16508:
--

I will check this since SPARK-16507 has been merged

> Fix documentation warnings found by R CMD check
> ---
>
> Key: SPARK-16508
> URL: https://issues.apache.org/jira/browse/SPARK-16508
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> A full list of warnings after the fixes in SPARK-16507 is at 
> https://gist.github.com/shivaram/62866c4ca59c5d34b8963939cf04b5eb 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14816) Update MLlib, GraphX, SparkR websites for 2.0

2016-07-17 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381792#comment-15381792
 ] 

Felix Cheung commented on SPARK-14816:
--

could be a nice show case to have a website? 
I do spend more time on the programming guide though (like 
http://spark.apache.org/docs/latest/sparkr.html, 
http://spark.apache.org/docs/latest/mllib-guide.html)

> Update MLlib, GraphX, SparkR websites for 2.0
> -
>
> Key: SPARK-14816
> URL: https://issues.apache.org/jira/browse/SPARK-14816
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Update the sub-projects' websites to include new features in this release.
> For MLlib, make it clear that the DataFrame-based API is the primary one now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16319) Non-linear (DAG) pipelines need better explanation

2016-07-17 Thread Max Moroz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381788#comment-15381788
 ] 

Max Moroz commented on SPARK-16319:
---

[~srowen] I'd love to, but best as I understand, the entire mention of DAG 
should be removed. It seems to do nothing. To your point that it checks the DAG 
property, I couldn't find anything like this in the code. It seems the pipeline 
is just executed one step after another, completely ignoring the information 
about in/out columns.

I hope I'm wrong, so if anyone can correct me please lmk.

> Non-linear (DAG) pipelines need better explanation
> --
>
> Key: SPARK-16319
> URL: https://issues.apache.org/jira/browse/SPARK-16319
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Max Moroz
>Priority: Minor
>
> There's a 
> [paragraph|http://spark.apache.org/docs/2.0.0-preview/ml-guide.html#details] 
> about non-linear pipeline in the ML docs, but it's not clear how DAG pipeline 
> differs from a linear pipeline, and in fact, it seems that a "DAG Pipeline" 
> results in the behavior identical to that of a regular linear pipeline (the 
> stages are simply applied in the order provided when the pipeline is 
> created). In addition, no checks of input and output columns seem to occur 
> when the pipeline.fit() or pipeline.transform() is called.
> It would be better to clarify in the docs and/or remove that paragraph.
> I'd be happy to write it up, but I have no idea what the intention of this 
> concept is at this point.
> [Additional reference on 
> SO|http://stackoverflow.com/questions/37541668/non-linear-dag-ml-pipelines-in-apache-spark]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16560) Spark-submit fails without an error

2016-07-17 Thread Chaitanya (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381778#comment-15381778
 ] 

Chaitanya commented on SPARK-16560:
---

I found a hint in the following paragraph:-
"You must be running in cluster mode. The Spark Master accepts client mode 
submissions on port 7077 and cluster mode submissions on port 6066. This is 
because standalone cluster mode uses a REST API to submit applications by 
default. If you submit to port 6066 instead the warning should go away."


> Spark-submit fails without an error
> ---
>
> Key: SPARK-16560
> URL: https://issues.apache.org/jira/browse/SPARK-16560
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.2
> Environment: Raspbian Jessie
>Reporter: Chaitanya
>
> I used the following command to run the spark java example of wordcount:-
> time spark-submit --deploy-mode cluster --master spark://192.168.0.7:7077 
> --class org.apache.spark.examples.JavaWordCount 
> /home/pi/Desktop/example/new/target/javaword.jar /books_500.txt 
> I have copied the same jar file into all nodes in the same location. (Copying 
> into HDFS didn't work for me.) When I run it, the following is the output:-
> Running Spark using the REST application submission protocol.
> 16/07/14 16:32:18 INFO rest.RestSubmissionClient: Submitting a request to 
> launch an application in spark://192.168.0.7:7077.
> 16/07/14 16:32:30 WARN rest.RestSubmissionClient: Unable to connect to server 
> spark://192.168.0.7:7077.
> Warning: Master endpoint spark://192.168.0.7:7077 was not a REST server. 
> Falling back to legacy submission gateway instead.
> 16/07/14 16:32:30 WARN util.Utils: Your hostname, master02 resolves to a 
> loopback address: 127.0.1.1; using 192.168.0.7 instead (on interface wlan0)
> 16/07/14 16:32:30 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 16/07/14 16:32:31 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> It just stops there, quits the job and waits for the next command on 
> terminal. I didn't understand this error without an error message. Help 
> needed please...!!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-16560) Spark-submit fails without an error

2016-07-17 Thread Chaitanya (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chaitanya closed SPARK-16560.
-
Resolution: Information Provided

> Spark-submit fails without an error
> ---
>
> Key: SPARK-16560
> URL: https://issues.apache.org/jira/browse/SPARK-16560
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.2
> Environment: Raspbian Jessie
>Reporter: Chaitanya
>
> I used the following command to run the spark java example of wordcount:-
> time spark-submit --deploy-mode cluster --master spark://192.168.0.7:7077 
> --class org.apache.spark.examples.JavaWordCount 
> /home/pi/Desktop/example/new/target/javaword.jar /books_500.txt 
> I have copied the same jar file into all nodes in the same location. (Copying 
> into HDFS didn't work for me.) When I run it, the following is the output:-
> Running Spark using the REST application submission protocol.
> 16/07/14 16:32:18 INFO rest.RestSubmissionClient: Submitting a request to 
> launch an application in spark://192.168.0.7:7077.
> 16/07/14 16:32:30 WARN rest.RestSubmissionClient: Unable to connect to server 
> spark://192.168.0.7:7077.
> Warning: Master endpoint spark://192.168.0.7:7077 was not a REST server. 
> Falling back to legacy submission gateway instead.
> 16/07/14 16:32:30 WARN util.Utils: Your hostname, master02 resolves to a 
> loopback address: 127.0.1.1; using 192.168.0.7 instead (on interface wlan0)
> 16/07/14 16:32:30 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 16/07/14 16:32:31 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> It just stops there, quits the job and waits for the next command on 
> terminal. I didn't understand this error without an error message. Help 
> needed please...!!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16595) Spark History server Rest Api gives Application not found error for yarn-cluster mode

2016-07-17 Thread Yesha Vora (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381760#comment-15381760
 ] 

Yesha Vora commented on SPARK-16595:


[~sowen], SPARK-15923 is referring to yarn-client mode.  I opened this jira 
because spark HS rest api threw  app not found error in  yarn-cluster mode too. 
Sorry for not explicitly mentioning earlier. Since this issue is different from 
SPARK-15923. Thus, reopening this jira. 

> Spark History server Rest Api gives Application not found error for 
> yarn-cluster mode
> -
>
> Key: SPARK-16595
> URL: https://issues.apache.org/jira/browse/SPARK-16595
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>
> Scenario:
> * Start SparkPi application in Spark1 using yarn-cluster mode 
> (application_1468686376753_0041) 
> * After application finishes validate application exists in respective Spark 
> History server.
> {code}
> Error loading url 
> http://xx.xx.xx.xx:18080/api/v1/applications/application_1468686376753_0041/1/executors
> HTTP Code: 404
> HTTP Data: no such app: application_1468686376753_0041{code}
> {code:title=spark HS log}
> 16/07/16 15:55:29 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0049.inprogress
> 16/07/16 15:56:20 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0049
> 16/07/16 16:23:14 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0061.inprogress
> 16/07/16 16:24:14 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0061
> 16/07/16 17:42:32 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/local-1468690940553.inprogress
> 16/07/16 17:43:22 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/local-1468690940553
> 16/07/16 17:43:44 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/local-1468691017376.inprogress
> 16/07/16 17:44:34 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/local-1468691017376
> 16/07/16 18:53:10 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0041_1.inprogress
> 16/07/16 19:03:26 INFO PackagesResourceConfig: Scanning for root resource and 
> provider classes in the packages:
>   org.apache.spark.status.api.v1
> 16/07/16 19:03:35 INFO ScanningResourceConfig: Root resource classes found:
>   class org.apache.spark.status.api.v1.ApiRootResource
> 16/07/16 19:03:35 INFO ScanningResourceConfig: Provider classes found:
>   class org.apache.spark.status.api.v1.JacksonMessageWriter
> 16/07/16 19:03:35 INFO WebApplicationImpl: Initiating Jersey application, 
> version 'Jersey: 1.9 09/02/2011 11:17 AM'
> 16/07/16 19:03:36 INFO SecurityManager: Changing view acls to: spark
> 16/07/16 19:03:36 INFO SecurityManager: Changing modify acls to: spark
> 16/07/16 19:03:36 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(spark); users 
> with modify permissions: Set(spark)
> 16/07/16 19:03:36 INFO ApplicationCache: Failed to load application attempt 
> application_1468686376753_0041/Some(1)
> 16/07/16 19:04:21 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0043.inprogress
> 16/07/16 19:12:02 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0043
> 16/07/16 19:16:11 INFO SecurityManager: Changing view acls to: spark
> 16/07/16 19:16:11 INFO SecurityManager: Changing modify acls to: spark
> 16/07/16 19:16:11 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(spark); users 
> with modify permissions: Set(spark)
> 16/07/16 19:16:11 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0043
> 16/07/16 19:16:22 INFO SecurityManager: Changing acls enabled to: false
> 16/07/16 19:16:22 INFO SecurityManager: Changing admin acls to:
> 16/07/16 19:16:22 INFO SecurityManager: Changing view acls to: hrt_qa{code}
> {code}
> hdfs@xxx:/var/log/spark$ hdfs dfs -ls /spark-history/
> Found 8 items
> -rwxrwx---   3 hrt_qa hadoop  28793 2016-07-16 15:56 
> /spark-history/application_1468678823755_0049
> -rwxrwx---   3 hrt_qa hadoop  28763 2016-07-16 16:24 
> /spark-history/application_1468678823755_0061
> -rwxrwx---   3 hrt_qa hadoop   58868885 2016-07-16 18:59 
> /spark-history/application_1468686376753_0041_1
> -r

[jira] [Resolved] (SPARK-16588) Deprecate monotonicallyIncreasingId in Scala

2016-07-17 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16588.
-
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

> Deprecate monotonicallyIncreasingId in Scala
> 
>
> Key: SPARK-16588
> URL: https://issues.apache.org/jira/browse/SPARK-16588
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Reynold Xin
>Priority: Trivial
> Fix For: 2.0.1, 2.1.0
>
>
> It seems the function {{monotonicallyIncreasingId}} was missed in 
> https://issues.apache.org/jira/browse/SPARK-10621 
> The registered name is 
> [{{monotonically_increasing_id}}|https://github.com/apache/spark/blob/56bd399a86c4e92be412d151200cb5e4a5f6a48a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L369]
>  but 
> [{{monotonicallyIncreasingId}}|https://github.com/apache/spark/blob/5f342049cce9102fb62b4de2d8d8fa691c2e8ac4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L981]
>  was still not deprecated and removed. 
> So, this was also missed in https://issues.apache.org/jira/browse/SPARK-12600 
> (removing deprecated APIs).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16588) Deprecate monotonicallyIncreasingId in Scala

2016-07-17 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16588:

Summary: Deprecate monotonicallyIncreasingId in Scala  (was: Deprecate )

> Deprecate monotonicallyIncreasingId in Scala
> 
>
> Key: SPARK-16588
> URL: https://issues.apache.org/jira/browse/SPARK-16588
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> It seems the function {{monotonicallyIncreasingId}} was missed in 
> https://issues.apache.org/jira/browse/SPARK-10621 
> The registered name is 
> [{{monotonically_increasing_id}}|https://github.com/apache/spark/blob/56bd399a86c4e92be412d151200cb5e4a5f6a48a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L369]
>  but 
> [{{monotonicallyIncreasingId}}|https://github.com/apache/spark/blob/5f342049cce9102fb62b4de2d8d8fa691c2e8ac4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L981]
>  was still not deprecated and removed. 
> So, this was also missed in https://issues.apache.org/jira/browse/SPARK-12600 
> (removing deprecated APIs).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16588) Deprecate monotonicallyIncreasingId in Scala

2016-07-17 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-16588:
---

Assignee: Reynold Xin

> Deprecate monotonicallyIncreasingId in Scala
> 
>
> Key: SPARK-16588
> URL: https://issues.apache.org/jira/browse/SPARK-16588
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Reynold Xin
>Priority: Trivial
>
> It seems the function {{monotonicallyIncreasingId}} was missed in 
> https://issues.apache.org/jira/browse/SPARK-10621 
> The registered name is 
> [{{monotonically_increasing_id}}|https://github.com/apache/spark/blob/56bd399a86c4e92be412d151200cb5e4a5f6a48a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L369]
>  but 
> [{{monotonicallyIncreasingId}}|https://github.com/apache/spark/blob/5f342049cce9102fb62b4de2d8d8fa691c2e8ac4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L981]
>  was still not deprecated and removed. 
> So, this was also missed in https://issues.apache.org/jira/browse/SPARK-12600 
> (removing deprecated APIs).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16588) Deprecate

2016-07-17 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16588:

Summary: Deprecate   (was: Missed API fix for a function name mismatched 
between FunctionRegistry and functions.scala)

> Deprecate 
> --
>
> Key: SPARK-16588
> URL: https://issues.apache.org/jira/browse/SPARK-16588
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> It seems the function {{monotonicallyIncreasingId}} was missed in 
> https://issues.apache.org/jira/browse/SPARK-10621 
> The registered name is 
> [{{monotonically_increasing_id}}|https://github.com/apache/spark/blob/56bd399a86c4e92be412d151200cb5e4a5f6a48a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L369]
>  but 
> [{{monotonicallyIncreasingId}}|https://github.com/apache/spark/blob/5f342049cce9102fb62b4de2d8d8fa691c2e8ac4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L981]
>  was still not deprecated and removed. 
> So, this was also missed in https://issues.apache.org/jira/browse/SPARK-12600 
> (removing deprecated APIs).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-16595) Spark History server Rest Api gives Application not found error for yarn-cluster mode

2016-07-17 Thread Yesha Vora (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yesha Vora reopened SPARK-16595:


> Spark History server Rest Api gives Application not found error for 
> yarn-cluster mode
> -
>
> Key: SPARK-16595
> URL: https://issues.apache.org/jira/browse/SPARK-16595
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>
> Scenario:
> * Start SparkPi application in Spark1 using yarn-cluster mode 
> (application_1468686376753_0041) 
> * After application finishes validate application exists in respective Spark 
> History server.
> {code}
> Error loading url 
> http://xx.xx.xx.xx:18080/api/v1/applications/application_1468686376753_0041/1/executors
> HTTP Code: 404
> HTTP Data: no such app: application_1468686376753_0041{code}
> {code:title=spark HS log}
> 16/07/16 15:55:29 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0049.inprogress
> 16/07/16 15:56:20 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0049
> 16/07/16 16:23:14 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0061.inprogress
> 16/07/16 16:24:14 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0061
> 16/07/16 17:42:32 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/local-1468690940553.inprogress
> 16/07/16 17:43:22 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/local-1468690940553
> 16/07/16 17:43:44 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/local-1468691017376.inprogress
> 16/07/16 17:44:34 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/local-1468691017376
> 16/07/16 18:53:10 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0041_1.inprogress
> 16/07/16 19:03:26 INFO PackagesResourceConfig: Scanning for root resource and 
> provider classes in the packages:
>   org.apache.spark.status.api.v1
> 16/07/16 19:03:35 INFO ScanningResourceConfig: Root resource classes found:
>   class org.apache.spark.status.api.v1.ApiRootResource
> 16/07/16 19:03:35 INFO ScanningResourceConfig: Provider classes found:
>   class org.apache.spark.status.api.v1.JacksonMessageWriter
> 16/07/16 19:03:35 INFO WebApplicationImpl: Initiating Jersey application, 
> version 'Jersey: 1.9 09/02/2011 11:17 AM'
> 16/07/16 19:03:36 INFO SecurityManager: Changing view acls to: spark
> 16/07/16 19:03:36 INFO SecurityManager: Changing modify acls to: spark
> 16/07/16 19:03:36 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(spark); users 
> with modify permissions: Set(spark)
> 16/07/16 19:03:36 INFO ApplicationCache: Failed to load application attempt 
> application_1468686376753_0041/Some(1)
> 16/07/16 19:04:21 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0043.inprogress
> 16/07/16 19:12:02 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0043
> 16/07/16 19:16:11 INFO SecurityManager: Changing view acls to: spark
> 16/07/16 19:16:11 INFO SecurityManager: Changing modify acls to: spark
> 16/07/16 19:16:11 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(spark); users 
> with modify permissions: Set(spark)
> 16/07/16 19:16:11 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0043
> 16/07/16 19:16:22 INFO SecurityManager: Changing acls enabled to: false
> 16/07/16 19:16:22 INFO SecurityManager: Changing admin acls to:
> 16/07/16 19:16:22 INFO SecurityManager: Changing view acls to: hrt_qa{code}
> {code}
> hdfs@xxx:/var/log/spark$ hdfs dfs -ls /spark-history/
> Found 8 items
> -rwxrwx---   3 hrt_qa hadoop  28793 2016-07-16 15:56 
> /spark-history/application_1468678823755_0049
> -rwxrwx---   3 hrt_qa hadoop  28763 2016-07-16 16:24 
> /spark-history/application_1468678823755_0061
> -rwxrwx---   3 hrt_qa hadoop   58868885 2016-07-16 18:59 
> /spark-history/application_1468686376753_0041_1
> -rwxrwx---   3 hrt_qa hadoop   58841982 2016-07-16 19:11 
> /spark-history/application_1468686376753_0043
> -rwxrwx---   3 hive   hadoop   5823 2016-07-16 11:38 
> /spark-history/local-1468666932940
> -rwxrwx---   3 hive   hadoop   5757 2016-07-16 22:44 
> /spark-history/local-1468669677840.inprogress
> -rwxrwx---   3 hrt_qa had

[jira] [Updated] (SPARK-16595) Spark History server Rest Api gives Application not found error

2016-07-17 Thread Yesha Vora (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yesha Vora updated SPARK-16595:
---
Description: 
Scenario:
* Start SparkPi application in Spark1 using yarn-cluster mode 
(application_1468686376753_0041) 
* After application finishes validate application exists in respective Spark 
History server.

{code}
Error loading url 
http://xx.xx.xx.xx:18080/api/v1/applications/application_1468686376753_0041/1/executors
HTTP Code: 404
HTTP Data: no such app: application_1468686376753_0041{code}

{code:title=spark HS log}
16/07/16 15:55:29 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0049.inprogress
16/07/16 15:56:20 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0049
16/07/16 16:23:14 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0061.inprogress
16/07/16 16:24:14 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0061
16/07/16 17:42:32 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/local-1468690940553.inprogress
16/07/16 17:43:22 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/local-1468690940553
16/07/16 17:43:44 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/local-1468691017376.inprogress
16/07/16 17:44:34 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/local-1468691017376
16/07/16 18:53:10 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0041_1.inprogress
16/07/16 19:03:26 INFO PackagesResourceConfig: Scanning for root resource and 
provider classes in the packages:
  org.apache.spark.status.api.v1
16/07/16 19:03:35 INFO ScanningResourceConfig: Root resource classes found:
  class org.apache.spark.status.api.v1.ApiRootResource
16/07/16 19:03:35 INFO ScanningResourceConfig: Provider classes found:
  class org.apache.spark.status.api.v1.JacksonMessageWriter
16/07/16 19:03:35 INFO WebApplicationImpl: Initiating Jersey application, 
version 'Jersey: 1.9 09/02/2011 11:17 AM'
16/07/16 19:03:36 INFO SecurityManager: Changing view acls to: spark
16/07/16 19:03:36 INFO SecurityManager: Changing modify acls to: spark
16/07/16 19:03:36 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(spark); users with 
modify permissions: Set(spark)
16/07/16 19:03:36 INFO ApplicationCache: Failed to load application attempt 
application_1468686376753_0041/Some(1)
16/07/16 19:04:21 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0043.inprogress
16/07/16 19:12:02 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0043
16/07/16 19:16:11 INFO SecurityManager: Changing view acls to: spark
16/07/16 19:16:11 INFO SecurityManager: Changing modify acls to: spark
16/07/16 19:16:11 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(spark); users with 
modify permissions: Set(spark)
16/07/16 19:16:11 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0043
16/07/16 19:16:22 INFO SecurityManager: Changing acls enabled to: false
16/07/16 19:16:22 INFO SecurityManager: Changing admin acls to:
16/07/16 19:16:22 INFO SecurityManager: Changing view acls to: hrt_qa{code}

{code}
hdfs@xxx:/var/log/spark$ hdfs dfs -ls /spark-history/
Found 8 items
-rwxrwx---   3 hrt_qa hadoop  28793 2016-07-16 15:56 
/spark-history/application_1468678823755_0049
-rwxrwx---   3 hrt_qa hadoop  28763 2016-07-16 16:24 
/spark-history/application_1468678823755_0061
-rwxrwx---   3 hrt_qa hadoop   58868885 2016-07-16 18:59 
/spark-history/application_1468686376753_0041_1
-rwxrwx---   3 hrt_qa hadoop   58841982 2016-07-16 19:11 
/spark-history/application_1468686376753_0043
-rwxrwx---   3 hive   hadoop   5823 2016-07-16 11:38 
/spark-history/local-1468666932940
-rwxrwx---   3 hive   hadoop   5757 2016-07-16 22:44 
/spark-history/local-1468669677840.inprogress
-rwxrwx---   3 hrt_qa hadoop 484113 2016-07-16 17:43 
/spark-history/local-1468690940553
-rwxrwx---   3 hrt_qa hadoop  57747 2016-07-16 17:44 
/spark-history/local-1468691017376
hdfs@xxx:/var/log/spark$ hdfs dfs -ls 
/spark-history/application_1468686376753_0041_1
-rwxrwx---   3 hrt_qa hadoop   58868885 2016-07-16 18:59 
/spark-history/application_1468686376753_0041_1{code}

  was:
Scenario:
* Start SparkPi application in Spark1 (application_1468686376753_0041) and 
* After application finishes validate application exists in respective Spark 
History se

[jira] [Updated] (SPARK-16595) Spark History server Rest Api gives Application not found error for yarn-cluster mode

2016-07-17 Thread Yesha Vora (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yesha Vora updated SPARK-16595:
---
Summary: Spark History server Rest Api gives Application not found error 
for yarn-cluster mode  (was: Spark History server Rest Api gives Application 
not found error)

> Spark History server Rest Api gives Application not found error for 
> yarn-cluster mode
> -
>
> Key: SPARK-16595
> URL: https://issues.apache.org/jira/browse/SPARK-16595
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>
> Scenario:
> * Start SparkPi application in Spark1 using yarn-cluster mode 
> (application_1468686376753_0041) 
> * After application finishes validate application exists in respective Spark 
> History server.
> {code}
> Error loading url 
> http://xx.xx.xx.xx:18080/api/v1/applications/application_1468686376753_0041/1/executors
> HTTP Code: 404
> HTTP Data: no such app: application_1468686376753_0041{code}
> {code:title=spark HS log}
> 16/07/16 15:55:29 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0049.inprogress
> 16/07/16 15:56:20 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0049
> 16/07/16 16:23:14 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0061.inprogress
> 16/07/16 16:24:14 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0061
> 16/07/16 17:42:32 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/local-1468690940553.inprogress
> 16/07/16 17:43:22 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/local-1468690940553
> 16/07/16 17:43:44 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/local-1468691017376.inprogress
> 16/07/16 17:44:34 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/local-1468691017376
> 16/07/16 18:53:10 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0041_1.inprogress
> 16/07/16 19:03:26 INFO PackagesResourceConfig: Scanning for root resource and 
> provider classes in the packages:
>   org.apache.spark.status.api.v1
> 16/07/16 19:03:35 INFO ScanningResourceConfig: Root resource classes found:
>   class org.apache.spark.status.api.v1.ApiRootResource
> 16/07/16 19:03:35 INFO ScanningResourceConfig: Provider classes found:
>   class org.apache.spark.status.api.v1.JacksonMessageWriter
> 16/07/16 19:03:35 INFO WebApplicationImpl: Initiating Jersey application, 
> version 'Jersey: 1.9 09/02/2011 11:17 AM'
> 16/07/16 19:03:36 INFO SecurityManager: Changing view acls to: spark
> 16/07/16 19:03:36 INFO SecurityManager: Changing modify acls to: spark
> 16/07/16 19:03:36 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(spark); users 
> with modify permissions: Set(spark)
> 16/07/16 19:03:36 INFO ApplicationCache: Failed to load application attempt 
> application_1468686376753_0041/Some(1)
> 16/07/16 19:04:21 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0043.inprogress
> 16/07/16 19:12:02 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0043
> 16/07/16 19:16:11 INFO SecurityManager: Changing view acls to: spark
> 16/07/16 19:16:11 INFO SecurityManager: Changing modify acls to: spark
> 16/07/16 19:16:11 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(spark); users 
> with modify permissions: Set(spark)
> 16/07/16 19:16:11 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0043
> 16/07/16 19:16:22 INFO SecurityManager: Changing acls enabled to: false
> 16/07/16 19:16:22 INFO SecurityManager: Changing admin acls to:
> 16/07/16 19:16:22 INFO SecurityManager: Changing view acls to: hrt_qa{code}
> {code}
> hdfs@xxx:/var/log/spark$ hdfs dfs -ls /spark-history/
> Found 8 items
> -rwxrwx---   3 hrt_qa hadoop  28793 2016-07-16 15:56 
> /spark-history/application_1468678823755_0049
> -rwxrwx---   3 hrt_qa hadoop  28763 2016-07-16 16:24 
> /spark-history/application_1468678823755_0061
> -rwxrwx---   3 hrt_qa hadoop   58868885 2016-07-16 18:59 
> /spark-history/application_1468686376753_0041_1
> -rwxrwx---   3 hrt_qa hadoop   58841982 2016-07-16 19:11 
> /spark-history/application_1468686376753_0043
> -rwxrwx---   3 hive   hadoop   5823 2016-07-16 11:38 
>

[jira] [Commented] (SPARK-16321) Pyspark 2.0 performance drop vs pyspark 1.6

2016-07-17 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381755#comment-15381755
 ] 

Maciej Bryński commented on SPARK-16321:


Yep. 
I'll try.

> Pyspark 2.0 performance drop vs pyspark 1.6
> ---
>
> Key: SPARK-16321
> URL: https://issues.apache.org/jira/browse/SPARK-16321
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is 2x slower.
> {code}
> df = sqlctx.read.parquet(path)
> df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 
> else []).collect()
> {code}
> Spark 1.6 -> 2.3 min
> Spark 2.0 -> 4.6 min (2x slower)
> I used BasicProfiler for this task and cumulative time was:
> Spark 1.6 - 4300 sec
> Spark 2.0 - 5800 sec
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16321) Pyspark 2.0 performance drop vs pyspark 1.6

2016-07-17 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381749#comment-15381749
 ] 

Reynold Xin commented on SPARK-16321:
-

Thanks - that's a great find. Can you take a look at what objects are being 
created? (You can do that with any allocation profiler, e.g. visualvm has one).


> Pyspark 2.0 performance drop vs pyspark 1.6
> ---
>
> Key: SPARK-16321
> URL: https://issues.apache.org/jira/browse/SPARK-16321
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is 2x slower.
> {code}
> df = sqlctx.read.parquet(path)
> df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 
> else []).collect()
> {code}
> Spark 1.6 -> 2.3 min
> Spark 2.0 -> 4.6 min (2x slower)
> I used BasicProfiler for this task and cumulative time was:
> Spark 1.6 - 4300 sec
> Spark 2.0 - 5800 sec
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16321) Pyspark 2.0 performance drop vs pyspark 1.6

2016-07-17 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381747#comment-15381747
 ] 

Maciej Bryński commented on SPARK-16321:


I did some more investigation.

I started to test different GC. 
During initial test I was using G1GC. 
After changing to Parallel GC there is almost no difference in Spark 1.6 but 
time for Spark 2.0 drop down to 3.0 minutes.
I'm thinking there is a problem with creating too many objects in JVM. 
(compared to Spark 1.6)

> Pyspark 2.0 performance drop vs pyspark 1.6
> ---
>
> Key: SPARK-16321
> URL: https://issues.apache.org/jira/browse/SPARK-16321
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is 2x slower.
> {code}
> df = sqlctx.read.parquet(path)
> df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 
> else []).collect()
> {code}
> Spark 1.6 -> 2.3 min
> Spark 2.0 -> 4.6 min (2x slower)
> I used BasicProfiler for this task and cumulative time was:
> Spark 1.6 - 4300 sec
> Spark 2.0 - 5800 sec
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16597) DataFrame DateType is written as an int(Days since epoch) by csv writer

2016-07-17 Thread Dean Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381719#comment-15381719
 ] 

Dean Chen commented on SPARK-16597:
---

Yes, closing as a dupe.

> DataFrame DateType is written as an int(Days since epoch) by csv writer
> ---
>
> Key: SPARK-16597
> URL: https://issues.apache.org/jira/browse/SPARK-16597
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dean Chen
>  Labels: csv
>
> import java.sql.Date
> case class DateClass(date: java.sql.Date)
> val df = spark.createDataFrame(Seq(DateClass(new Date(1468774636000L
> df.write.csv("test.csv")
> file content is 16999, days since epoch instead of 7/17/16



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-16597) DataFrame DateType is written as an int(Days since epoch) by csv writer

2016-07-17 Thread Dean Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dean Chen closed SPARK-16597.
-
Resolution: Duplicate

> DataFrame DateType is written as an int(Days since epoch) by csv writer
> ---
>
> Key: SPARK-16597
> URL: https://issues.apache.org/jira/browse/SPARK-16597
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dean Chen
>  Labels: csv
>
> import java.sql.Date
> case class DateClass(date: java.sql.Date)
> val df = spark.createDataFrame(Seq(DateClass(new Date(1468774636000L
> df.write.csv("test.csv")
> file content is 16999, days since epoch instead of 7/17/16



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16301) Analyzer rule for resolving using joins should respect case sensitivity setting

2016-07-17 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381705#comment-15381705
 ] 

Shivaram Venkataraman commented on SPARK-16301:
---

[~yhuai] [~davies] The PR looks to have been merged, so we can resolve this 
issue ? 

> Analyzer rule for resolving using joins should respect case sensitivity 
> setting
> ---
>
> Key: SPARK-16301
> URL: https://issues.apache.org/jira/browse/SPARK-16301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> Quick repro: Passes on Spark 1.6.x, but fails on 2.0
> {code}
> case class MyColumn(userId: Int, field: String)
> val ds = Seq(MyColumn(1, "a")).toDF
> ds.join(ds, Seq("userid"))
> {code}
> {code}
> stacktrace:
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:313)
>   at scala.None$.get(Option.scala:311)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$88.apply(Analyzer.scala:1844)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$88.apply(Analyzer.scala:1844)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16597) DataFrame DateType is written as an int(Days since epoch) by csv writer

2016-07-17 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381696#comment-15381696
 ] 

Hyukjin Kwon commented on SPARK-16597:
--

I guess this is a duplicated of SPARK-16216.

> DataFrame DateType is written as an int(Days since epoch) by csv writer
> ---
>
> Key: SPARK-16597
> URL: https://issues.apache.org/jira/browse/SPARK-16597
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dean Chen
>  Labels: csv
>
> import java.sql.Date
> case class DateClass(date: java.sql.Date)
> val df = spark.createDataFrame(Seq(DateClass(new Date(1468774636000L
> df.write.csv("test.csv")
> file content is 16999, days since epoch instead of 7/17/16



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16598) Added a test case for verifying the table identifier parsing

2016-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16598:


Assignee: (was: Apache Spark)

> Added a test case for verifying the table identifier parsing
> 
>
> Key: SPARK-16598
> URL: https://issues.apache.org/jira/browse/SPARK-16598
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> So far, the test cases of TableIdentifierParserSuite do not cover the quoted 
> cases. We should add one for avoiding regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16598) Added a test case for verifying the table identifier parsing

2016-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16598:


Assignee: Apache Spark

> Added a test case for verifying the table identifier parsing
> 
>
> Key: SPARK-16598
> URL: https://issues.apache.org/jira/browse/SPARK-16598
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> So far, the test cases of TableIdentifierParserSuite do not cover the quoted 
> cases. We should add one for avoiding regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16598) Added a test case for verifying the table identifier parsing

2016-07-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381667#comment-15381667
 ] 

Apache Spark commented on SPARK-16598:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14244

> Added a test case for verifying the table identifier parsing
> 
>
> Key: SPARK-16598
> URL: https://issues.apache.org/jira/browse/SPARK-16598
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> So far, the test cases of TableIdentifierParserSuite do not cover the quoted 
> cases. We should add one for avoiding regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16598) Added a test case for verifying the table identifier parsing

2016-07-17 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-16598:

Summary: Added a test case for verifying the table identifier parsing  
(was: Added a test case for verifying the table identifier parsing.)

> Added a test case for verifying the table identifier parsing
> 
>
> Key: SPARK-16598
> URL: https://issues.apache.org/jira/browse/SPARK-16598
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> So far, the test cases of TableIdentifierParserSuite do not cover the quoted 
> cases. We should add one for avoiding regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16598) Added a test case for verifying the table identifier parsing.

2016-07-17 Thread Xiao Li (JIRA)

Xiao Li created SPARK-16598:
---

 Summary: Added a test case for verifying the table identifier 
parsing.
 Key: SPARK-16598
 URL: https://issues.apache.org/jira/browse/SPARK-16598
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


So far, the test cases of TableIdentifierParserSuite do not cover the quoted 
cases. We should add one for avoiding regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16597) DataFrame DateType is written as an int(Days since epoch) by csv writer

2016-07-17 Thread Dean Chen (JIRA)

Dean Chen created SPARK-16597:
-

 Summary: DataFrame DateType is written as an int(Days since epoch) 
by csv writer
 Key: SPARK-16597
 URL: https://issues.apache.org/jira/browse/SPARK-16597
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Dean Chen


import java.sql.Date
case class DateClass(date: java.sql.Date)
val df = spark.createDataFrame(Seq(DateClass(new Date(1468774636000L
df.write.csv("test.csv")

file content is 16999, days since epoch instead of 7/17/16




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15810) Aggregator doesn't play nice with Option

2016-07-17 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381637#comment-15381637
 ] 

koert kuipers edited comment on SPARK-15810 at 7/18/16 2:28 AM:


i believe the issue with null as the zero shows up in both scala and java api, 
correct?
i think it makes sense to have a separate issue for it, also under SPARK-16390


was (Author: koert):
i believe the issue null shows up in both scala and java api, correct?
i think it makes sense to have a separate issue for it, also under SPARK-16390

> Aggregator doesn't play nice with Option
> 
>
> Key: SPARK-15810
> URL: https://issues.apache.org/jira/browse/SPARK-15810
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
> Environment: spark 2.0.0-SNAPSHOT
>Reporter: koert kuipers
>
> {code}
> val ds1 = List(("a", 1), ("a", 2), ("a", 3)).toDS
> val ds2 = ds1.map{ case (k, v) => (k, if (v > 1) Some(v) else None) }
> val ds3 = ds2.groupByKey(_._1).agg(new Aggregator[(String, Option[Int]), 
> Option[Int], Option[Int]]{
>   def zero: Option[Int] = None
>   def reduce(b: Option[Int], a: (String, Option[Int])): Option[Int] = 
> b.map(bv => a._2.map(av => bv + av).getOrElse(bv)).orElse(a._2)
>   def merge(b1: Option[Int], b2: Option[Int]): Option[Int] = b1.map(b1v => 
> b2.map(b2v => b1v + b2v).getOrElse(b1v)).orElse(b2)
>   def finish(reduction: Option[Int]): Option[Int] = reduction
>   def bufferEncoder: Encoder[Option[Int]] = implicitly[Encoder[Option[Int]]]
>   def outputEncoder: Encoder[Option[Int]] = implicitly[Encoder[Option[Int]]]
> }.toColumn)
> ds3.printSchema
> ds3.show
> {code}
> i get as output a somewhat odd looking schema, and after that the program 
> just hangs pinning one cpu at 100%. the data never shows.
> output:
> {noformat}
> root
>  |-- value: string (nullable = true)
>  |-- $anon$1(scala.Tuple2): struct (nullable = true)
>  ||-- value: integer (nullable = true)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15810) Aggregator doesn't play nice with Option

2016-07-17 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381637#comment-15381637
 ] 

koert kuipers commented on SPARK-15810:
---

i believe the issue null shows up in both scala and java api, correct?
i think it makes sense to have a separate issue for it, also under SPARK-16390

> Aggregator doesn't play nice with Option
> 
>
> Key: SPARK-15810
> URL: https://issues.apache.org/jira/browse/SPARK-15810
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
> Environment: spark 2.0.0-SNAPSHOT
>Reporter: koert kuipers
>
> {code}
> val ds1 = List(("a", 1), ("a", 2), ("a", 3)).toDS
> val ds2 = ds1.map{ case (k, v) => (k, if (v > 1) Some(v) else None) }
> val ds3 = ds2.groupByKey(_._1).agg(new Aggregator[(String, Option[Int]), 
> Option[Int], Option[Int]]{
>   def zero: Option[Int] = None
>   def reduce(b: Option[Int], a: (String, Option[Int])): Option[Int] = 
> b.map(bv => a._2.map(av => bv + av).getOrElse(bv)).orElse(a._2)
>   def merge(b1: Option[Int], b2: Option[Int]): Option[Int] = b1.map(b1v => 
> b2.map(b2v => b1v + b2v).getOrElse(b1v)).orElse(b2)
>   def finish(reduction: Option[Int]): Option[Int] = reduction
>   def bufferEncoder: Encoder[Option[Int]] = implicitly[Encoder[Option[Int]]]
>   def outputEncoder: Encoder[Option[Int]] = implicitly[Encoder[Option[Int]]]
> }.toColumn)
> ds3.printSchema
> ds3.show
> {code}
> i get as output a somewhat odd looking schema, and after that the program 
> just hangs pinning one cpu at 100%. the data never shows.
> output:
> {noformat}
> root
>  |-- value: string (nullable = true)
>  |-- $anon$1(scala.Tuple2): struct (nullable = true)
>  ||-- value: integer (nullable = true)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15810) Aggregator doesn't play nice with Option

2016-07-17 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381633#comment-15381633
 ] 

koert kuipers commented on SPARK-15810:
---

ok thats an improvement, because i got the same odd schema but no result 
(program hangs)

> Aggregator doesn't play nice with Option
> 
>
> Key: SPARK-15810
> URL: https://issues.apache.org/jira/browse/SPARK-15810
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
> Environment: spark 2.0.0-SNAPSHOT
>Reporter: koert kuipers
>
> {code}
> val ds1 = List(("a", 1), ("a", 2), ("a", 3)).toDS
> val ds2 = ds1.map{ case (k, v) => (k, if (v > 1) Some(v) else None) }
> val ds3 = ds2.groupByKey(_._1).agg(new Aggregator[(String, Option[Int]), 
> Option[Int], Option[Int]]{
>   def zero: Option[Int] = None
>   def reduce(b: Option[Int], a: (String, Option[Int])): Option[Int] = 
> b.map(bv => a._2.map(av => bv + av).getOrElse(bv)).orElse(a._2)
>   def merge(b1: Option[Int], b2: Option[Int]): Option[Int] = b1.map(b1v => 
> b2.map(b2v => b1v + b2v).getOrElse(b1v)).orElse(b2)
>   def finish(reduction: Option[Int]): Option[Int] = reduction
>   def bufferEncoder: Encoder[Option[Int]] = implicitly[Encoder[Option[Int]]]
>   def outputEncoder: Encoder[Option[Int]] = implicitly[Encoder[Option[Int]]]
> }.toColumn)
> ds3.printSchema
> ds3.show
> {code}
> i get as output a somewhat odd looking schema, and after that the program 
> just hangs pinning one cpu at 100%. the data never shows.
> output:
> {noformat}
> root
>  |-- value: string (nullable = true)
>  |-- $anon$1(scala.Tuple2): struct (nullable = true)
>  ||-- value: integer (nullable = true)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16027) Fix SparkR session unit test

2016-07-17 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-16027:
--
Assignee: Felix Cheung  (was: Apache Spark)

> Fix SparkR session unit test
> 
>
> Key: SPARK-16027
> URL: https://issues.apache.org/jira/browse/SPARK-16027
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Shivaram Venkataraman
>Assignee: Felix Cheung
> Fix For: 2.1.0
>
>
> As described in https://github.com/apache/spark/pull/13635/files, the test 
> titled "repeatedly starting and stopping SparkR" does not seem to work 
> consistently with the new sparkR.session code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16027) Fix SparkR session unit test

2016-07-17 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-16027.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14177
[https://github.com/apache/spark/pull/14177]

> Fix SparkR session unit test
> 
>
> Key: SPARK-16027
> URL: https://issues.apache.org/jira/browse/SPARK-16027
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Shivaram Venkataraman
>Assignee: Apache Spark
> Fix For: 2.1.0
>
>
> As described in https://github.com/apache/spark/pull/13635/files, the test 
> titled "repeatedly starting and stopping SparkR" does not seem to work 
> consistently with the new sparkR.session code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16593) Provide a pre-fetch mechanism to accelerate shuffle stage.

2016-07-17 Thread Biao Ma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Ma updated SPARK-16593:

Summary: Provide a pre-fetch mechanism to accelerate shuffle stage.  (was: 
a)

> Provide a pre-fetch mechanism to accelerate shuffle stage.
> --
>
> Key: SPARK-16593
> URL: https://issues.apache.org/jira/browse/SPARK-16593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Biao Ma
>Priority: Minor
>  Labels: features
>
> Currently, the `NettyBlockRpcServer` will reading data through BlockManager, 
> while the block is not cached in memory, the data should be read from DISK 
> first, then into MEM. I wonder if we implement a mechanism add a message 
> contains the blockIds that the same as the openBlock message but one loop 
> ahead, then the `NettyBlockRpcServer ` will load the block ready for transfer 
> to  the reduce side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16593) a

2016-07-17 Thread Biao Ma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Ma updated SPARK-16593:

Summary: a  (was: Provide a pre-fetch mechanism to accelerate  shuffle 
stage.)

> a
> -
>
> Key: SPARK-16593
> URL: https://issues.apache.org/jira/browse/SPARK-16593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Biao Ma
>Priority: Minor
>  Labels: features
>
> Currently, the `NettyBlockRpcServer` will reading data through BlockManager, 
> while the block is not cached in memory, the data should be read from DISK 
> first, then into MEM. I wonder if we implement a mechanism add a message 
> contains the blockIds that the same as the openBlock message but one loop 
> ahead, then the `NettyBlockRpcServer ` will load the block ready for transfer 
> to  the reduce side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12420) Have a built-in CSV data source implementation

2016-07-17 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12420:

Assignee: (was: Hyukjin Kwon)

> Have a built-in CSV data source implementation
> --
>
> Key: SPARK-12420
> URL: https://issues.apache.org/jira/browse/SPARK-12420
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Fix For: 2.0.0
>
> Attachments: Built-in CSV datasource in Spark.pdf
>
>
> CSV is the most common data format in the "small data" world. It is often the 
> first format people want to try when they see Spark on a single node. Making 
> this built-in for the most common source can provide a better experience for 
> first-time users.
> We should consider inlining https://github.com/databricks/spark-csv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16533) Spark application not handling preemption messages

2016-07-17 Thread Emaad Manzoor (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381616#comment-15381616
 ] 

Emaad Manzoor edited comment on SPARK-16533 at 7/18/16 1:39 AM:


I had the same issue running on EC2 with single-core (m3.medium) nodes.

I was able to resolve it using the workaround mentioned in this issue: 
https://issues.apache.org/jira/browse/SPARK-13906

In {{spark-defaults.conf}} set {{spark.rpc.netty.dispatcher.numThreads 2}}.


was (Author: emaadmanzoor):
I had the same issue running on EC2 with single-core nodes.

I was able to resolve it using the workaround mentioned in this issue: 
https://issues.apache.org/jira/browse/SPARK-13906

> Spark application not handling preemption messages
> --
>
> Key: SPARK-16533
> URL: https://issues.apache.org/jira/browse/SPARK-16533
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Input/Output, Optimizer, Scheduler, Spark Submit, 
> YARN
>Affects Versions: 1.6.0
> Environment: Yarn version: Hadoop 2.7.1-amzn-0
> AWS EMR Cluster running:
> 1 x r3.8xlarge (Master)
> 52 x r3.8xlarge (Core)
> Spark version : 1.6.0
> Scala version: 2.10.5
> Java version: 1.8.0_51
> Input size: ~10 tb
> Input coming from S3
> Queue Configuration:
> Dynamic allocation: enabled
> Preemption: enabled
> Q1: 70% capacity with max of 100%
> Q2: 30% capacity with max of 100%
> Job Configuration:
> Driver memory = 10g
> Executor cores = 6
> Executor memory = 10g
> Deploy mode = cluster
> Master = yarn
> maxResultSize = 4g
> Shuffle manager = hash
>Reporter: Lucas Winkelmann
>
> Here is the scenario:
> I launch job 1 into Q1 and allow it to grow to 100% cluster utilization.
> I wait between 15-30 mins ( for this job to complete with 100% of the cluster 
> available takes about 1hr so job 1 is between 25-50% complete). Note that if 
> I wait less time then the issue sometimes does not occur, it appears to be 
> only after the job 1 is at least 25% complete.
> I launch job 2 into Q2 and preemption occurs on the Q1 shrinking the job to 
> allow 70% of cluster utilization.
> At this point job 1 basically halts progress while job 2 continues to execute 
> as normal and finishes. Job 2 either:
> - Fails its attempt and restarts. By the time this attempt fails the other 
> job is already complete meaning the second attempt has full cluster 
> availability and finishes.
> - The job remains at its current progress and simply does not finish ( I have 
> waited ~6 hrs until finally killing the application ).
>  
> Looking into the error log there is this constant error message:
> WARN NettyRpcEndpointRef: Error sending message [message = 
> RemoveExecutor(454,Container container_1468422920649_0001_01_000594 on host: 
> ip-NUMBERS.ec2.internal was preempted.)] in X attempts
>  
> My observations have led me to believe that the application master does not 
> know about this container being killed and continuously asks the container to 
> remove the executor until eventually failing the attempt or continue trying 
> to remove the executor.
>  
> I have done much digging online for anyone else experiencing this issue but 
> have come up with nothing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16533) Spark application not handling preemption messages

2016-07-17 Thread Emaad Manzoor (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381616#comment-15381616
 ] 

Emaad Manzoor commented on SPARK-16533:
---

I had the same issue running on EC2 with single-core nodes.

I was able to resolve it using the workaround mentioned in this issue: 
https://issues.apache.org/jira/browse/SPARK-13906

> Spark application not handling preemption messages
> --
>
> Key: SPARK-16533
> URL: https://issues.apache.org/jira/browse/SPARK-16533
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Input/Output, Optimizer, Scheduler, Spark Submit, 
> YARN
>Affects Versions: 1.6.0
> Environment: Yarn version: Hadoop 2.7.1-amzn-0
> AWS EMR Cluster running:
> 1 x r3.8xlarge (Master)
> 52 x r3.8xlarge (Core)
> Spark version : 1.6.0
> Scala version: 2.10.5
> Java version: 1.8.0_51
> Input size: ~10 tb
> Input coming from S3
> Queue Configuration:
> Dynamic allocation: enabled
> Preemption: enabled
> Q1: 70% capacity with max of 100%
> Q2: 30% capacity with max of 100%
> Job Configuration:
> Driver memory = 10g
> Executor cores = 6
> Executor memory = 10g
> Deploy mode = cluster
> Master = yarn
> maxResultSize = 4g
> Shuffle manager = hash
>Reporter: Lucas Winkelmann
>
> Here is the scenario:
> I launch job 1 into Q1 and allow it to grow to 100% cluster utilization.
> I wait between 15-30 mins ( for this job to complete with 100% of the cluster 
> available takes about 1hr so job 1 is between 25-50% complete). Note that if 
> I wait less time then the issue sometimes does not occur, it appears to be 
> only after the job 1 is at least 25% complete.
> I launch job 2 into Q2 and preemption occurs on the Q1 shrinking the job to 
> allow 70% of cluster utilization.
> At this point job 1 basically halts progress while job 2 continues to execute 
> as normal and finishes. Job 2 either:
> - Fails its attempt and restarts. By the time this attempt fails the other 
> job is already complete meaning the second attempt has full cluster 
> availability and finishes.
> - The job remains at its current progress and simply does not finish ( I have 
> waited ~6 hrs until finally killing the application ).
>  
> Looking into the error log there is this constant error message:
> WARN NettyRpcEndpointRef: Error sending message [message = 
> RemoveExecutor(454,Container container_1468422920649_0001_01_000594 on host: 
> ip-NUMBERS.ec2.internal was preempted.)] in X attempts
>  
> My observations have led me to believe that the application master does not 
> know about this container being killed and continuously asks the container to 
> remove the executor until eventually failing the attempt or continue trying 
> to remove the executor.
>  
> I have done much digging online for anyone else experiencing this issue but 
> have come up with nothing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10683) Source code missing for SparkR test JAR

2016-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10683:


Assignee: (was: Apache Spark)

> Source code missing for SparkR test JAR
> ---
>
> Key: SPARK-10683
> URL: https://issues.apache.org/jira/browse/SPARK-10683
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Terry Moschou
>Priority: Minor
>
> A compiled JAR, located at 
> {code}
> /R/pkg/inst/test_support/sparktestjar_2.10-1.0.jar
> {code}
> has no corresponding source code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16510) Move SparkR test JAR into Spark, include its source code

2016-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16510:


Assignee: Apache Spark  (was: Shivaram Venkataraman)

> Move SparkR test JAR into Spark, include its source code
> 
>
> Key: SPARK-16510
> URL: https://issues.apache.org/jira/browse/SPARK-16510
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Apache Spark
>
> One of the `NOTE`s in the R CMD check is that we currently include a test JAR 
> file in SparkR which is a binary only artifact. I think we can take two steps 
> to address this
> (a) I think we should include the source code for this in say core/src/test/ 
> or something like that. As far as I know the JAR file just needs to have a 
> single method. 
> (b) We should move the JAR file out of the SparkR test support and into some 
> other location in Spark. The trouble is that its tricky to run the test with 
> CRAN mode then. We could either disable the test for CRAN or download the JAR 
> from an external URL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16510) Move SparkR test JAR into Spark, include its source code

2016-07-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381585#comment-15381585
 ] 

Apache Spark commented on SPARK-16510:
--

User 'shivaram' has created a pull request for this issue:
https://github.com/apache/spark/pull/14243

> Move SparkR test JAR into Spark, include its source code
> 
>
> Key: SPARK-16510
> URL: https://issues.apache.org/jira/browse/SPARK-16510
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>
> One of the `NOTE`s in the R CMD check is that we currently include a test JAR 
> file in SparkR which is a binary only artifact. I think we can take two steps 
> to address this
> (a) I think we should include the source code for this in say core/src/test/ 
> or something like that. As far as I know the JAR file just needs to have a 
> single method. 
> (b) We should move the JAR file out of the SparkR test support and into some 
> other location in Spark. The trouble is that its tricky to run the test with 
> CRAN mode then. We could either disable the test for CRAN or download the JAR 
> from an external URL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3728) RandomForest: Learn models too large to store in memory

2016-07-17 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381583#comment-15381583
 ] 

Xusen Yin edited comment on SPARK-3728 at 7/17/16 11:46 PM:


Not now. Because I thought the BFS style could reach the best parallelism, 
while the DFS may hurt the parallel ability. And IMHO the BFS style training is 
not the root cause of out-of-memory during the training phase of RandomForest. 
Do you have any suggestions on this?


was (Author: yinxusen):
Not now. Because I thought the BFS style could reach the best parallelism, 
while the DFS may harm the parallel ability. And IMHO the BFS style training is 
not the root cause of out-of-memory during the training phase of RandomForest. 
Do you have any suggestions on this?

> RandomForest: Learn models too large to store in memory
> ---
>
> Key: SPARK-3728
> URL: https://issues.apache.org/jira/browse/SPARK-3728
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Proposal: Write trees to disk as they are learned.
> RandomForest currently uses a FIFO queue, which means training all trees at 
> once via breadth-first search.  Using a FILO queue would encourage the code 
> to finish one tree before moving on to new ones.  This would allow the code 
> to write trees to disk as they are learned.
> Note: It would also be possible to write nodes to disk as they are learned 
> using a FIFO queue, once the example--node mapping is cached [JIRA].  The 
> [Sequoia Forest package]() does this.  However, it could be useful to learn 
> trees progressively, so that future functionality such as early stopping 
> (training fewer trees than expected) could be supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10683) Source code missing for SparkR test JAR

2016-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10683:


Assignee: Apache Spark

> Source code missing for SparkR test JAR
> ---
>
> Key: SPARK-10683
> URL: https://issues.apache.org/jira/browse/SPARK-10683
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Terry Moschou
>Assignee: Apache Spark
>Priority: Minor
>
> A compiled JAR, located at 
> {code}
> /R/pkg/inst/test_support/sparktestjar_2.10-1.0.jar
> {code}
> has no corresponding source code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10683) Source code missing for SparkR test JAR

2016-07-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381584#comment-15381584
 ] 

Apache Spark commented on SPARK-10683:
--

User 'shivaram' has created a pull request for this issue:
https://github.com/apache/spark/pull/14243

> Source code missing for SparkR test JAR
> ---
>
> Key: SPARK-10683
> URL: https://issues.apache.org/jira/browse/SPARK-10683
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Terry Moschou
>Priority: Minor
>
> A compiled JAR, located at 
> {code}
> /R/pkg/inst/test_support/sparktestjar_2.10-1.0.jar
> {code}
> has no corresponding source code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16510) Move SparkR test JAR into Spark, include its source code

2016-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16510:


Assignee: Shivaram Venkataraman  (was: Apache Spark)

> Move SparkR test JAR into Spark, include its source code
> 
>
> Key: SPARK-16510
> URL: https://issues.apache.org/jira/browse/SPARK-16510
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>
> One of the `NOTE`s in the R CMD check is that we currently include a test JAR 
> file in SparkR which is a binary only artifact. I think we can take two steps 
> to address this
> (a) I think we should include the source code for this in say core/src/test/ 
> or something like that. As far as I know the JAR file just needs to have a 
> single method. 
> (b) We should move the JAR file out of the SparkR test support and into some 
> other location in Spark. The trouble is that its tricky to run the test with 
> CRAN mode then. We could either disable the test for CRAN or download the JAR 
> from an external URL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3728) RandomForest: Learn models too large to store in memory

2016-07-17 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381583#comment-15381583
 ] 

Xusen Yin commented on SPARK-3728:
--

Not now. Because I thought the BFS style could reach the best parallelism, 
while the DFS may harm the parallel ability. And IMHO the BFS style training is 
not the root cause of out-of-memory during the training phase of RandomForest. 
Do you have any suggestions on this?

> RandomForest: Learn models too large to store in memory
> ---
>
> Key: SPARK-3728
> URL: https://issues.apache.org/jira/browse/SPARK-3728
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Proposal: Write trees to disk as they are learned.
> RandomForest currently uses a FIFO queue, which means training all trees at 
> once via breadth-first search.  Using a FILO queue would encourage the code 
> to finish one tree before moving on to new ones.  This would allow the code 
> to write trees to disk as they are learned.
> Note: It would also be possible to write nodes to disk as they are learned 
> using a FIFO queue, once the example--node mapping is cached [JIRA].  The 
> [Sequoia Forest package]() does this.  However, it could be useful to learn 
> trees progressively, so that future functionality such as early stopping 
> (training fewer trees than expected) could be supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16510) Move SparkR test JAR into Spark, include its source code

2016-07-17 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman reassigned SPARK-16510:
-

Assignee: Shivaram Venkataraman

> Move SparkR test JAR into Spark, include its source code
> 
>
> Key: SPARK-16510
> URL: https://issues.apache.org/jira/browse/SPARK-16510
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>
> One of the `NOTE`s in the R CMD check is that we currently include a test JAR 
> file in SparkR which is a binary only artifact. I think we can take two steps 
> to address this
> (a) I think we should include the source code for this in say core/src/test/ 
> or something like that. As far as I know the JAR file just needs to have a 
> single method. 
> (b) We should move the JAR file out of the SparkR test support and into some 
> other location in Spark. The trouble is that its tricky to run the test with 
> CRAN mode then. We could either disable the test for CRAN or download the JAR 
> from an external URL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16510) Move SparkR test JAR into Spark, include its source code

2016-07-17 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381580#comment-15381580
 ] 

Shivaram Venkataraman commented on SPARK-16510:
---

I actually found a better way to do this by moving the test into our Scala test 
suite. Will send a PR soon

> Move SparkR test JAR into Spark, include its source code
> 
>
> Key: SPARK-16510
> URL: https://issues.apache.org/jira/browse/SPARK-16510
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>
> One of the `NOTE`s in the R CMD check is that we currently include a test JAR 
> file in SparkR which is a binary only artifact. I think we can take two steps 
> to address this
> (a) I think we should include the source code for this in say core/src/test/ 
> or something like that. As far as I know the JAR file just needs to have a 
> single method. 
> (b) We should move the JAR file out of the SparkR test support and into some 
> other location in Spark. The trouble is that its tricky to run the test with 
> CRAN mode then. We could either disable the test for CRAN or download the JAR 
> from an external URL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16589) Chained cartesian produces incorrect number of records

2016-07-17 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381557#comment-15381557
 ] 

Dongjoon Hyun commented on SPARK-16589:
---

Oh, Indeed, there is a bug of PySpark.
Could you make a PR for this?

> Chained cartesian produces incorrect number of records
> --
>
> Key: SPARK-16589
> URL: https://issues.apache.org/jira/browse/SPARK-16589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>
> Chaining cartesian calls in PySpark results in the number of records lower 
> than expected. It can be reproduced as follows:
> {code}
> rdd = sc.parallelize(range(10), 1)
> rdd.cartesian(rdd).cartesian(rdd).count()
> ## 355
> {code}
> It looks like it is related to serialization. If we reserialize after initial 
> cartesian:
> {code}
> rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), 
> 1)).cartesian(rdd).count()
> {code}
> or insert identity map:
> {code}
> rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count()
> {code}
> it yields correct results.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16596) Refactor DataSourceScanExec to do partition discovery at execution instead of planning time

2016-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16596:


Assignee: Apache Spark

> Refactor DataSourceScanExec to do partition discovery at execution instead of 
> planning time
> ---
>
> Key: SPARK-16596
> URL: https://issues.apache.org/jira/browse/SPARK-16596
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Apache Spark
>Priority: Minor
>
> Partition discovery is rather expensive, so we should do it at execution time 
> instead of during physical planning. Right now there is not much benefit 
> since ListingFileCatalog will read scan for all partitions at planning time 
> anyways, but this can be optimized in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16596) Refactor DataSourceScanExec to do partition discovery at execution instead of planning time

2016-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16596:


Assignee: (was: Apache Spark)

> Refactor DataSourceScanExec to do partition discovery at execution instead of 
> planning time
> ---
>
> Key: SPARK-16596
> URL: https://issues.apache.org/jira/browse/SPARK-16596
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Eric Liang
>Priority: Minor
>
> Partition discovery is rather expensive, so we should do it at execution time 
> instead of during physical planning. Right now there is not much benefit 
> since ListingFileCatalog will read scan for all partitions at planning time 
> anyways, but this can be optimized in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16596) Refactor DataSourceScanExec to do partition discovery at execution instead of planning time

2016-07-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381553#comment-15381553
 ] 

Apache Spark commented on SPARK-16596:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/14241

> Refactor DataSourceScanExec to do partition discovery at execution instead of 
> planning time
> ---
>
> Key: SPARK-16596
> URL: https://issues.apache.org/jira/browse/SPARK-16596
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Eric Liang
>Priority: Minor
>
> Partition discovery is rather expensive, so we should do it at execution time 
> instead of during physical planning. Right now there is not much benefit 
> since ListingFileCatalog will read scan for all partitions at planning time 
> anyways, but this can be optimized in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16596) Refactor DataSourceScanExec to do partition discovery at execution instead of planning time

2016-07-17 Thread Eric Liang (JIRA)

Eric Liang created SPARK-16596:
--

 Summary: Refactor DataSourceScanExec to do partition discovery at 
execution instead of planning time
 Key: SPARK-16596
 URL: https://issues.apache.org/jira/browse/SPARK-16596
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Eric Liang
Priority: Minor


Partition discovery is rather expensive, so we should do it at execution time 
instead of during physical planning. Right now there is not much benefit since 
ListingFileCatalog will read scan for all partitions at planning time anyways, 
but this can be optimized in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16576) Move plan SQL generation code from SQLBuilder into logical operators

2016-07-17 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381525#comment-15381525
 ] 

Dongjoon Hyun commented on SPARK-16576:
---

Oh, I see. I could access from the Spark shell, so I was confused. Thank you!

> Move plan SQL generation code from SQLBuilder into logical operators
> 
>
> Key: SPARK-16576
> URL: https://issues.apache.org/jira/browse/SPARK-16576
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> We currently handle all SQL query generation in a single class (SQLBuilder). 
> This has many disadvantages:
> 1. It is not extensible, i.e. it is not possible to introduce a new logical 
> operator, even just for experimentation purpose, without modifying Spark.
> 2. It is very fragile. When we introduce a new logical operator, it is very 
> likely that we forget to update SQLBuilder and then the use of that new 
> logical operator would fail view definition.
> We should move the SQL definition part into logical operators themselves, so 
> this becomes more robust and scalable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16576) Move plan SQL generation code from SQLBuilder into logical operators

2016-07-17 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381518#comment-15381518
 ] 

Reynold Xin commented on SPARK-16576:
-

Everything in catalyst module is private.




> Move plan SQL generation code from SQLBuilder into logical operators
> 
>
> Key: SPARK-16576
> URL: https://issues.apache.org/jira/browse/SPARK-16576
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> We currently handle all SQL query generation in a single class (SQLBuilder). 
> This has many disadvantages:
> 1. It is not extensible, i.e. it is not possible to introduce a new logical 
> operator, even just for experimentation purpose, without modifying Spark.
> 2. It is very fragile. When we introduce a new logical operator, it is very 
> likely that we forget to update SQLBuilder and then the use of that new 
> logical operator would fail view definition.
> We should move the SQL definition part into logical operators themselves, so 
> this becomes more robust and scalable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16576) Move plan SQL generation code from SQLBuilder into logical operators

2016-07-17 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381516#comment-15381516
 ] 

Dongjoon Hyun commented on SPARK-16576:
---

After moving the SQL generation codes into Logical Operators, `SQLBuilder` is 
needed to be a deprecated class for a while since it was a public class. Or, is 
there no need to keep `SQLBuilder`?

> Move plan SQL generation code from SQLBuilder into logical operators
> 
>
> Key: SPARK-16576
> URL: https://issues.apache.org/jira/browse/SPARK-16576
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> We currently handle all SQL query generation in a single class (SQLBuilder). 
> This has many disadvantages:
> 1. It is not extensible, i.e. it is not possible to introduce a new logical 
> operator, even just for experimentation purpose, without modifying Spark.
> 2. It is very fragile. When we introduce a new logical operator, it is very 
> likely that we forget to update SQLBuilder and then the use of that new 
> logical operator would fail view definition.
> We should move the SQL definition part into logical operators themselves, so 
> this becomes more robust and scalable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14816) Update MLlib, GraphX, SparkR websites for 2.0

2016-07-17 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381506#comment-15381506
 ] 

Shivaram Venkataraman commented on SPARK-14816:
---

We didn't create a website for SparkR as we didn't have one for PySpark as 
well. I don't know if its worth adding one
[~rxin] [~felixcheung] [~mengxr] Any thoughts on this ?

> Update MLlib, GraphX, SparkR websites for 2.0
> -
>
> Key: SPARK-14816
> URL: https://issues.apache.org/jira/browse/SPARK-14816
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Update the sub-projects' websites to include new features in this release.
> For MLlib, make it clear that the DataFrame-based API is the primary one now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16595) Spark History server Rest Api gives Application not found error

2016-07-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16595.
---
Resolution: Duplicate

Um, you already reported this. Are you not following this existing thread?

> Spark History server Rest Api gives Application not found error
> ---
>
> Key: SPARK-16595
> URL: https://issues.apache.org/jira/browse/SPARK-16595
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>
> Scenario:
> * Start SparkPi application in Spark1 (application_1468686376753_0041) and 
> * After application finishes validate application exists in respective Spark 
> History server.
> {code}
> Error loading url 
> http://xx.xx.xx.xx:18080/api/v1/applications/application_1468686376753_0041/1/executors
> HTTP Code: 404
> HTTP Data: no such app: application_1468686376753_0041{code}
> {code:title=spark HS log}
> 16/07/16 15:55:29 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0049.inprogress
> 16/07/16 15:56:20 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0049
> 16/07/16 16:23:14 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0061.inprogress
> 16/07/16 16:24:14 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0061
> 16/07/16 17:42:32 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/local-1468690940553.inprogress
> 16/07/16 17:43:22 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/local-1468690940553
> 16/07/16 17:43:44 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/local-1468691017376.inprogress
> 16/07/16 17:44:34 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/local-1468691017376
> 16/07/16 18:53:10 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0041_1.inprogress
> 16/07/16 19:03:26 INFO PackagesResourceConfig: Scanning for root resource and 
> provider classes in the packages:
>   org.apache.spark.status.api.v1
> 16/07/16 19:03:35 INFO ScanningResourceConfig: Root resource classes found:
>   class org.apache.spark.status.api.v1.ApiRootResource
> 16/07/16 19:03:35 INFO ScanningResourceConfig: Provider classes found:
>   class org.apache.spark.status.api.v1.JacksonMessageWriter
> 16/07/16 19:03:35 INFO WebApplicationImpl: Initiating Jersey application, 
> version 'Jersey: 1.9 09/02/2011 11:17 AM'
> 16/07/16 19:03:36 INFO SecurityManager: Changing view acls to: spark
> 16/07/16 19:03:36 INFO SecurityManager: Changing modify acls to: spark
> 16/07/16 19:03:36 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(spark); users 
> with modify permissions: Set(spark)
> 16/07/16 19:03:36 INFO ApplicationCache: Failed to load application attempt 
> application_1468686376753_0041/Some(1)
> 16/07/16 19:04:21 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0043.inprogress
> 16/07/16 19:12:02 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0043
> 16/07/16 19:16:11 INFO SecurityManager: Changing view acls to: spark
> 16/07/16 19:16:11 INFO SecurityManager: Changing modify acls to: spark
> 16/07/16 19:16:11 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(spark); users 
> with modify permissions: Set(spark)
> 16/07/16 19:16:11 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0043
> 16/07/16 19:16:22 INFO SecurityManager: Changing acls enabled to: false
> 16/07/16 19:16:22 INFO SecurityManager: Changing admin acls to:
> 16/07/16 19:16:22 INFO SecurityManager: Changing view acls to: hrt_qa{code}
> {code}
> hdfs@xxx:/var/log/spark$ hdfs dfs -ls /spark-history/
> Found 8 items
> -rwxrwx---   3 hrt_qa hadoop  28793 2016-07-16 15:56 
> /spark-history/application_1468678823755_0049
> -rwxrwx---   3 hrt_qa hadoop  28763 2016-07-16 16:24 
> /spark-history/application_1468678823755_0061
> -rwxrwx---   3 hrt_qa hadoop   58868885 2016-07-16 18:59 
> /spark-history/application_1468686376753_0041_1
> -rwxrwx---   3 hrt_qa hadoop   58841982 2016-07-16 19:11 
> /spark-history/application_1468686376753_0043
> -rwxrwx---   3 hive   hadoop   5823 2016-07-16 11:38 
> /spark-history/local-1468666932940
> -rwxrwx---   3 hive   hadoop   5757 2016-07-16 22:44 
> /spark-history/local-1468669677840.inprogre

[jira] [Created] (SPARK-16595) Spark History server Rest Api gives Application not found error

2016-07-17 Thread Yesha Vora (JIRA)

Yesha Vora created SPARK-16595:
--

 Summary: Spark History server Rest Api gives Application not found 
error
 Key: SPARK-16595
 URL: https://issues.apache.org/jira/browse/SPARK-16595
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.1
Reporter: Yesha Vora


Scenario:
* Start SparkPi application in Spark1 (application_1468686376753_0041) and 
* After application finishes validate application exists in respective Spark 
History server.

{code}
Error loading url 
http://xx.xx.xx.xx:18080/api/v1/applications/application_1468686376753_0041/1/executors
HTTP Code: 404
HTTP Data: no such app: application_1468686376753_0041{code}

{code:title=spark HS log}
16/07/16 15:55:29 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0049.inprogress
16/07/16 15:56:20 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0049
16/07/16 16:23:14 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0061.inprogress
16/07/16 16:24:14 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/application_1468678823755_0061
16/07/16 17:42:32 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/local-1468690940553.inprogress
16/07/16 17:43:22 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/local-1468690940553
16/07/16 17:43:44 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/local-1468691017376.inprogress
16/07/16 17:44:34 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/local-1468691017376
16/07/16 18:53:10 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0041_1.inprogress
16/07/16 19:03:26 INFO PackagesResourceConfig: Scanning for root resource and 
provider classes in the packages:
  org.apache.spark.status.api.v1
16/07/16 19:03:35 INFO ScanningResourceConfig: Root resource classes found:
  class org.apache.spark.status.api.v1.ApiRootResource
16/07/16 19:03:35 INFO ScanningResourceConfig: Provider classes found:
  class org.apache.spark.status.api.v1.JacksonMessageWriter
16/07/16 19:03:35 INFO WebApplicationImpl: Initiating Jersey application, 
version 'Jersey: 1.9 09/02/2011 11:17 AM'
16/07/16 19:03:36 INFO SecurityManager: Changing view acls to: spark
16/07/16 19:03:36 INFO SecurityManager: Changing modify acls to: spark
16/07/16 19:03:36 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(spark); users with 
modify permissions: Set(spark)
16/07/16 19:03:36 INFO ApplicationCache: Failed to load application attempt 
application_1468686376753_0041/Some(1)
16/07/16 19:04:21 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0043.inprogress
16/07/16 19:12:02 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0043
16/07/16 19:16:11 INFO SecurityManager: Changing view acls to: spark
16/07/16 19:16:11 INFO SecurityManager: Changing modify acls to: spark
16/07/16 19:16:11 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(spark); users with 
modify permissions: Set(spark)
16/07/16 19:16:11 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx.xx.xx.xx:8020/spark-history/application_1468686376753_0043
16/07/16 19:16:22 INFO SecurityManager: Changing acls enabled to: false
16/07/16 19:16:22 INFO SecurityManager: Changing admin acls to:
16/07/16 19:16:22 INFO SecurityManager: Changing view acls to: hrt_qa{code}

{code}
hdfs@xxx:/var/log/spark$ hdfs dfs -ls /spark-history/
Found 8 items
-rwxrwx---   3 hrt_qa hadoop  28793 2016-07-16 15:56 
/spark-history/application_1468678823755_0049
-rwxrwx---   3 hrt_qa hadoop  28763 2016-07-16 16:24 
/spark-history/application_1468678823755_0061
-rwxrwx---   3 hrt_qa hadoop   58868885 2016-07-16 18:59 
/spark-history/application_1468686376753_0041_1
-rwxrwx---   3 hrt_qa hadoop   58841982 2016-07-16 19:11 
/spark-history/application_1468686376753_0043
-rwxrwx---   3 hive   hadoop   5823 2016-07-16 11:38 
/spark-history/local-1468666932940
-rwxrwx---   3 hive   hadoop   5757 2016-07-16 22:44 
/spark-history/local-1468669677840.inprogress
-rwxrwx---   3 hrt_qa hadoop 484113 2016-07-16 17:43 
/spark-history/local-1468690940553
-rwxrwx---   3 hrt_qa hadoop  57747 2016-07-16 17:44 
/spark-history/local-1468691017376
hdfs@xxx:/var/log/spark$ hdfs dfs -ls 
/spark-history/application_1468686376753_0041_1
-rwxrwx---   3 hrt_qa hadoop   58868885 2016-07-16 18:59 
/spark-history/application_1468686376753_0041_1{code}



--
This message was sent by Atlas

[jira] [Commented] (SPARK-14816) Update MLlib, GraphX, SparkR websites for 2.0

2016-07-17 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381438#comment-15381438
 ] 

Joseph K. Bradley commented on SPARK-14816:
---

I'd say the updates for [http://spark.apache.org/mllib/] are:
* Ease of Use: "MLlib fits into Spark's APIs and interoperates with NumPy in 
Python (starting in Spark 0.9)." --> change to "MLlib fits into Spark's APIs 
and interoperates with NumPy in Python (as of Spark 0.9) and R (as of Spark 
1.5)."
* Algorithms list: Change to a list of categories, not specific algs
* Calling MLlib in Python code snippet: Change to:
{code}
data = spark.read.format("libsvm").load("hdfs://...")
model = new KMeans().setK(10).fit(data)
{code}

If this sounds good, I can make the change.

[~shivaram] SparkR does not really have a website.  Should we add one?

> Update MLlib, GraphX, SparkR websites for 2.0
> -
>
> Key: SPARK-14816
> URL: https://issues.apache.org/jira/browse/SPARK-14816
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Update the sub-projects' websites to include new features in this release.
> For MLlib, make it clear that the DataFrame-based API is the primary one now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16594) Physical Plan Differences when Table Scan Having Duplicate Columns

2016-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16594:


Assignee: (was: Apache Spark)

> Physical Plan Differences when Table Scan Having Duplicate Columns
> --
>
> Key: SPARK-16594
> URL: https://issues.apache.org/jira/browse/SPARK-16594
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Currently, we keep two implementations for planning scans over data sources. 
> There is one difference between two implementation when deciding whether a 
> `Project` is needed or not. 
> - Data Source Table Scan: 
> https://github.com/apache/spark/blob/b1e5281c5cb429e338c3719c13c0b93078d7312a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L322-L395
>   - `SELECT b, b FROM oneToTenPruned`: **_Add `ProjectExec`._**
> - Hive Table Scan and In-memory Table Scan:
> https://github.com/apache/spark/blob/865ec32dd997e63aea01a871d1c7b4947f43c111/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala#L71-L99
>  
>   - `SELECT b, b FROM oneToTenPruned`: **_No `ProjectExec` is added._** 
> **Note.** When alias is being used, we will always add `ProjectExec` in all 
> the scan types.
>   - `SELECT b as alias_b, b FROM oneToTenPruned`: **_Add `ProjectExec`._**
> Because selecting the same column twice without adding `alias` is very weird, 
> no clue why the code needs to behave differently here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16594) Physical Plan Differences when Table Scan Having Duplicate Columns

2016-07-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381418#comment-15381418
 ] 

Apache Spark commented on SPARK-16594:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14240

> Physical Plan Differences when Table Scan Having Duplicate Columns
> --
>
> Key: SPARK-16594
> URL: https://issues.apache.org/jira/browse/SPARK-16594
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Currently, we keep two implementations for planning scans over data sources. 
> There is one difference between two implementation when deciding whether a 
> `Project` is needed or not. 
> - Data Source Table Scan: 
> https://github.com/apache/spark/blob/b1e5281c5cb429e338c3719c13c0b93078d7312a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L322-L395
>   - `SELECT b, b FROM oneToTenPruned`: **_Add `ProjectExec`._**
> - Hive Table Scan and In-memory Table Scan:
> https://github.com/apache/spark/blob/865ec32dd997e63aea01a871d1c7b4947f43c111/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala#L71-L99
>  
>   - `SELECT b, b FROM oneToTenPruned`: **_No `ProjectExec` is added._** 
> **Note.** When alias is being used, we will always add `ProjectExec` in all 
> the scan types.
>   - `SELECT b as alias_b, b FROM oneToTenPruned`: **_Add `ProjectExec`._**
> Because selecting the same column twice without adding `alias` is very weird, 
> no clue why the code needs to behave differently here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16594) Physical Plan Differences when Table Scan Having Duplicate Columns

2016-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16594:


Assignee: Apache Spark

> Physical Plan Differences when Table Scan Having Duplicate Columns
> --
>
> Key: SPARK-16594
> URL: https://issues.apache.org/jira/browse/SPARK-16594
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Currently, we keep two implementations for planning scans over data sources. 
> There is one difference between two implementation when deciding whether a 
> `Project` is needed or not. 
> - Data Source Table Scan: 
> https://github.com/apache/spark/blob/b1e5281c5cb429e338c3719c13c0b93078d7312a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L322-L395
>   - `SELECT b, b FROM oneToTenPruned`: **_Add `ProjectExec`._**
> - Hive Table Scan and In-memory Table Scan:
> https://github.com/apache/spark/blob/865ec32dd997e63aea01a871d1c7b4947f43c111/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala#L71-L99
>  
>   - `SELECT b, b FROM oneToTenPruned`: **_No `ProjectExec` is added._** 
> **Note.** When alias is being used, we will always add `ProjectExec` in all 
> the scan types.
>   - `SELECT b as alias_b, b FROM oneToTenPruned`: **_Add `ProjectExec`._**
> Because selecting the same column twice without adding `alias` is very weird, 
> no clue why the code needs to behave differently here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16594) Physical Plan Differences when Table Scan Having Duplicate Columns

2016-07-17 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-16594:

Description: 
Currently, we keep two implementations for planning scans over data sources. 
There is one difference between two implementation when deciding whether a 
`Project` is needed or not. 

- Data Source Table Scan: 
https://github.com/apache/spark/blob/b1e5281c5cb429e338c3719c13c0b93078d7312a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L322-L395

  - `SELECT b, b FROM oneToTenPruned`: **_Add `ProjectExec`._**

- Hive Table Scan and In-memory Table Scan:
https://github.com/apache/spark/blob/865ec32dd997e63aea01a871d1c7b4947f43c111/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala#L71-L99
 

  - `SELECT b, b FROM oneToTenPruned`: **_No `ProjectExec` is added._** 

**Note.** When alias is being used, we will always add `ProjectExec` in all the 
scan types.

  - `SELECT b as alias_b, b FROM oneToTenPruned`: **_Add `ProjectExec`._**

Because selecting the same column twice without adding `alias` is very weird, 
no clue why the code needs to behave differently here. 


  was:
Currently, we keep two implementations for planning scans over data sources. 
There is one difference between two implementation when deciding whether a 
`Project` is needed or not. 

- Data Source Table Scan: 
https://github.com/apache/spark/blob/b1e5281c5cb429e338c3719c13c0b93078d7312a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L322-L395

  - `SELECT b, b FROM oneToTenPruned`: **_Add `ProjectExec`._**

- Hive Table Scan and In-memory Table Scan:
https://github.com/apache/spark/blob/865ec32dd997e63aea01a871d1c7b4947f43c111/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala#L71-L99
 

  - `SELECT b, b FROM oneToTenPruned`: **_No `ProjectExec` is added._** 

**Note.** When alias is being used, we will always add `ProjectExec` in all the 
scan types.

  - `SELECT b as alias_b, b FROM oneToTenPruned`: **_Add `ProjectExec`._**

Because selecting the same column twice without adding `alias` is very weird, 
no clue why the code needs to behave differently here. This PR is to remove the 
differences. 



> Physical Plan Differences when Table Scan Having Duplicate Columns
> --
>
> Key: SPARK-16594
> URL: https://issues.apache.org/jira/browse/SPARK-16594
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Currently, we keep two implementations for planning scans over data sources. 
> There is one difference between two implementation when deciding whether a 
> `Project` is needed or not. 
> - Data Source Table Scan: 
> https://github.com/apache/spark/blob/b1e5281c5cb429e338c3719c13c0b93078d7312a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L322-L395
>   - `SELECT b, b FROM oneToTenPruned`: **_Add `ProjectExec`._**
> - Hive Table Scan and In-memory Table Scan:
> https://github.com/apache/spark/blob/865ec32dd997e63aea01a871d1c7b4947f43c111/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala#L71-L99
>  
>   - `SELECT b, b FROM oneToTenPruned`: **_No `ProjectExec` is added._** 
> **Note.** When alias is being used, we will always add `ProjectExec` in all 
> the scan types.
>   - `SELECT b as alias_b, b FROM oneToTenPruned`: **_Add `ProjectExec`._**
> Because selecting the same column twice without adding `alias` is very weird, 
> no clue why the code needs to behave differently here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16594) Physical Plan Differences when Table Scan Having Duplicate Columns

2016-07-17 Thread Xiao Li (JIRA)

Xiao Li created SPARK-16594:
---

 Summary: Physical Plan Differences when Table Scan Having 
Duplicate Columns
 Key: SPARK-16594
 URL: https://issues.apache.org/jira/browse/SPARK-16594
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


Currently, we keep two implementations for planning scans over data sources. 
There is one difference between two implementation when deciding whether a 
`Project` is needed or not. 

- Data Source Table Scan: 
https://github.com/apache/spark/blob/b1e5281c5cb429e338c3719c13c0b93078d7312a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L322-L395

  - `SELECT b, b FROM oneToTenPruned`: **_Add `ProjectExec`._**

- Hive Table Scan and In-memory Table Scan:
https://github.com/apache/spark/blob/865ec32dd997e63aea01a871d1c7b4947f43c111/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala#L71-L99
 

  - `SELECT b, b FROM oneToTenPruned`: **_No `ProjectExec` is added._** 

**Note.** When alias is being used, we will always add `ProjectExec` in all the 
scan types.

  - `SELECT b as alias_b, b FROM oneToTenPruned`: **_Add `ProjectExec`._**

Because selecting the same column twice without adding `alias` is very weird, 
no clue why the code needs to behave differently here. This PR is to remove the 
differences. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16593) Provide a pre-fetch mechanism to accelerate shuffle stage.

2016-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16593:


Assignee: Apache Spark

> Provide a pre-fetch mechanism to accelerate  shuffle stage.
> ---
>
> Key: SPARK-16593
> URL: https://issues.apache.org/jira/browse/SPARK-16593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Biao Ma
>Assignee: Apache Spark
>Priority: Minor
>  Labels: features
>
> Currently, the `NettyBlockRpcServer` will reading data through BlockManager, 
> while the block is not cached in memory, the data should be read from DISK 
> first, then into MEM. I wonder if we implement a mechanism add a message 
> contains the blockIds that the same as the openBlock message but one loop 
> ahead, then the `NettyBlockRpcServer ` will load the block ready for transfer 
> to  the reduce side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16593) Provide a pre-fetch mechanism to accelerate shuffle stage.

2016-07-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381391#comment-15381391
 ] 

Apache Spark commented on SPARK-16593:
--

User 'f7753' has created a pull request for this issue:
https://github.com/apache/spark/pull/14239

> Provide a pre-fetch mechanism to accelerate  shuffle stage.
> ---
>
> Key: SPARK-16593
> URL: https://issues.apache.org/jira/browse/SPARK-16593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Biao Ma
>Priority: Minor
>  Labels: features
>
> Currently, the `NettyBlockRpcServer` will reading data through BlockManager, 
> while the block is not cached in memory, the data should be read from DISK 
> first, then into MEM. I wonder if we implement a mechanism add a message 
> contains the blockIds that the same as the openBlock message but one loop 
> ahead, then the `NettyBlockRpcServer ` will load the block ready for transfer 
> to  the reduce side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16593) Provide a pre-fetch mechanism to accelerate shuffle stage.

2016-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16593:


Assignee: (was: Apache Spark)

> Provide a pre-fetch mechanism to accelerate  shuffle stage.
> ---
>
> Key: SPARK-16593
> URL: https://issues.apache.org/jira/browse/SPARK-16593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Biao Ma
>Priority: Minor
>  Labels: features
>
> Currently, the `NettyBlockRpcServer` will reading data through BlockManager, 
> while the block is not cached in memory, the data should be read from DISK 
> first, then into MEM. I wonder if we implement a mechanism add a message 
> contains the blockIds that the same as the openBlock message but one loop 
> ahead, then the `NettyBlockRpcServer ` will load the block ready for transfer 
> to  the reduce side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16593) Provide a pre-fetch mechanism to accelerate shuffle stage.

2016-07-17 Thread Biao Ma (JIRA)

Biao Ma created SPARK-16593:
---

 Summary: Provide a pre-fetch mechanism to accelerate  shuffle 
stage.
 Key: SPARK-16593
 URL: https://issues.apache.org/jira/browse/SPARK-16593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Biao Ma
Priority: Minor


Currently, the `NettyBlockRpcServer` will reading data through BlockManager, 
while the block is not cached in memory, the data should be read from DISK 
first, then into MEM. I wonder if we implement a mechanism add a message 
contains the blockIds that the same as the openBlock message but one loop 
ahead, then the `NettyBlockRpcServer ` will load the block ready for transfer 
to  the reduce side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function

2016-07-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381335#comment-15381335
 ] 

Apache Spark commented on SPARK-16283:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14237

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16283) Implement percentile_approx SQL function

2016-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16283:


Assignee: (was: Apache Spark)

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16283) Implement percentile_approx SQL function

2016-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16283:


Assignee: Apache Spark

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error

2016-07-17 Thread Rahul Palamuttam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381236#comment-15381236
 ] 

Rahul Palamuttam commented on SPARK-13634:
--

Understood and thank you for explaining.
I agree that it is pretty implicit that you can't serialize context-like 
objects, but it's a little strange when the object gets pulled in without the 
user even writing code that explicitly does so (in the shell). I agree with 
your latter point as well, and will take that into consideration. It could just 
be too specific to the use case.


> Assigning spark context to variable results in serialization error
> --
>
> Key: SPARK-13634
> URL: https://issues.apache.org/jira/browse/SPARK-13634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Rahul Palamuttam
>Priority: Minor
>
> The following lines of code cause a task serialization error when executed in 
> the spark-shell. 
> Note that the error does not occur when submitting the code as a batch job - 
> via spark-submit.
> val temp = 10
> val newSC = sc
> val new RDD = newSC.parallelize(0 to 100).map(p => p + temp)
> For some reason when temp is being pulled in to the referencing environment 
> of the closure, so is the SparkContext. 
> We originally hit this issue in the SciSpark project, when referencing a 
> string variable inside of a lambda expression in RDD.map(...)
> Any insight into how this could be resolved would be appreciated.
> While the above code is trivial, SciSpark uses a wrapper around the 
> SparkContext to read from various file formats. We want to keep this class 
> structure and also use it in notebook and shell environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-16592) Improving ml.Logistic Regression on speed and scalability

2016-07-17 Thread yuhao yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-16592:
---
Comment: was deleted

(was: sparse data support)

> Improving ml.Logistic Regression on speed and scalability
> -
>
> Key: SPARK-16592
> URL: https://issues.apache.org/jira/browse/SPARK-16592
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>
> With the spreading application of Apache Spark* logistic regression, we've 
> seen more and more requirements come up about improving the speed and 
> scalability. Many suggestions and discussions have been evolving in the 
> developer and user communities.  While it may be difficult to find an 
> optimization for all the cases, understanding the various scenarios and 
> approaches will be important. 
> As discussed with [~josephkb], this JIRA is created for discussion and 
> collecting efforts on the optimization work of LR (logistic regression). All 
> the ongoing related JIRA will be linked here, as well as new ideas and 
> possibilities. 
> Users are encouraged to share their experiences/expectations on LR and track 
> the development status from the community. Developers can leverage the JIRA 
> to browse existing efforts, make communication and introduce 
> research/development resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16592) Improving ml.Logistic Regression on speed and scalability

2016-07-17 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381225#comment-15381225
 ] 

yuhao yang commented on SPARK-16592:


sparse data support

> Improving ml.Logistic Regression on speed and scalability
> -
>
> Key: SPARK-16592
> URL: https://issues.apache.org/jira/browse/SPARK-16592
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>
> With the spreading application of Apache Spark* logistic regression, we've 
> seen more and more requirements come up about improving the speed and 
> scalability. Many suggestions and discussions have been evolving in the 
> developer and user communities.  While it may be difficult to find an 
> optimization for all the cases, understanding the various scenarios and 
> approaches will be important. 
> As discussed with [~josephkb], this JIRA is created for discussion and 
> collecting efforts on the optimization work of LR (logistic regression). All 
> the ongoing related JIRA will be linked here, as well as new ideas and 
> possibilities. 
> Users are encouraged to share their experiences/expectations on LR and track 
> the development status from the community. Developers can leverage the JIRA 
> to browse existing efforts, make communication and introduce 
> research/development resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16588) Missed API fix for a function name mismatched between FunctionRegistry and functions.scala

2016-07-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381224#comment-15381224
 ] 

Apache Spark commented on SPARK-16588:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/14236

> Missed API fix for a function name mismatched between FunctionRegistry and 
> functions.scala
> --
>
> Key: SPARK-16588
> URL: https://issues.apache.org/jira/browse/SPARK-16588
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> It seems the function {{monotonicallyIncreasingId}} was missed in 
> https://issues.apache.org/jira/browse/SPARK-10621 
> The registered name is 
> [{{monotonically_increasing_id}}|https://github.com/apache/spark/blob/56bd399a86c4e92be412d151200cb5e4a5f6a48a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L369]
>  but 
> [{{monotonicallyIncreasingId}}|https://github.com/apache/spark/blob/5f342049cce9102fb62b4de2d8d8fa691c2e8ac4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L981]
>  was still not deprecated and removed. 
> So, this was also missed in https://issues.apache.org/jira/browse/SPARK-12600 
> (removing deprecated APIs).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16588) Missed API fix for a function name mismatched between FunctionRegistry and functions.scala

2016-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16588:


Assignee: (was: Apache Spark)

> Missed API fix for a function name mismatched between FunctionRegistry and 
> functions.scala
> --
>
> Key: SPARK-16588
> URL: https://issues.apache.org/jira/browse/SPARK-16588
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> It seems the function {{monotonicallyIncreasingId}} was missed in 
> https://issues.apache.org/jira/browse/SPARK-10621 
> The registered name is 
> [{{monotonically_increasing_id}}|https://github.com/apache/spark/blob/56bd399a86c4e92be412d151200cb5e4a5f6a48a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L369]
>  but 
> [{{monotonicallyIncreasingId}}|https://github.com/apache/spark/blob/5f342049cce9102fb62b4de2d8d8fa691c2e8ac4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L981]
>  was still not deprecated and removed. 
> So, this was also missed in https://issues.apache.org/jira/browse/SPARK-12600 
> (removing deprecated APIs).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16588) Missed API fix for a function name mismatched between FunctionRegistry and functions.scala

2016-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16588:


Assignee: Apache Spark

> Missed API fix for a function name mismatched between FunctionRegistry and 
> functions.scala
> --
>
> Key: SPARK-16588
> URL: https://issues.apache.org/jira/browse/SPARK-16588
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Trivial
>
> It seems the function {{monotonicallyIncreasingId}} was missed in 
> https://issues.apache.org/jira/browse/SPARK-10621 
> The registered name is 
> [{{monotonically_increasing_id}}|https://github.com/apache/spark/blob/56bd399a86c4e92be412d151200cb5e4a5f6a48a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L369]
>  but 
> [{{monotonicallyIncreasingId}}|https://github.com/apache/spark/blob/5f342049cce9102fb62b4de2d8d8fa691c2e8ac4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L981]
>  was still not deprecated and removed. 
> So, this was also missed in https://issues.apache.org/jira/browse/SPARK-12600 
> (removing deprecated APIs).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16592) Improving ml.Logistic Regression on speed and scalability

2016-07-17 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381222#comment-15381222
 ] 

yuhao yang edited comment on SPARK-16592 at 7/17/16 8:19 AM:
-

Placeholder for lists of primary requirements and ongoing efforts:




was (Author: yuhaoyan):
Placeholder for list of primary ongoing efforts:



> Improving ml.Logistic Regression on speed and scalability
> -
>
> Key: SPARK-16592
> URL: https://issues.apache.org/jira/browse/SPARK-16592
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>
> With the spreading application of Apache Spark* logistic regression, we've 
> seen more and more requirements come up about improving the speed and 
> scalability. Many suggestions and discussions have been evolving in the 
> developer and user communities.  While it may be difficult to find an 
> optimization for all the cases, understanding the various scenarios and 
> approaches will be important. 
> As discussed with [~josephkb], this JIRA is created for discussion and 
> collecting efforts on the optimization work of LR (logistic regression). All 
> the ongoing related JIRA will be linked here, as well as new ideas and 
> possibilities. 
> Users are encouraged to share their experiences/expectations on LR and track 
> the development status from the community. Developers can leverage the JIRA 
> to browse existing efforts, make communication and introduce 
> research/development resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16592) Improving ml.Logistic Regression on speed and scalability

2016-07-17 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381222#comment-15381222
 ] 

yuhao yang commented on SPARK-16592:


Placeholder for list of primary ongoing efforts:



> Improving ml.Logistic Regression on speed and scalability
> -
>
> Key: SPARK-16592
> URL: https://issues.apache.org/jira/browse/SPARK-16592
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>
> With the spreading application of Apache Spark* logistic regression, we've 
> seen more and more requirements come up about improving the speed and 
> scalability. Many suggestions and discussions have been evolving in the 
> developer and user communities.  While it may be difficult to find an 
> optimization for all the cases, understanding the various scenarios and 
> approaches will be important. 
> As discussed with [~josephkb], this JIRA is created for discussion and 
> collecting efforts on the optimization work of LR (logistic regression). All 
> the ongoing related JIRA will be linked here, as well as new ideas and 
> possibilities. 
> Users are encouraged to share their experiences/expectations on LR and track 
> the development status from the community. Developers can leverage the JIRA 
> to browse existing efforts, make communication and introduce 
> research/development resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16592) Improving ml.Logistic Regression on speed and scalability

2016-07-17 Thread yuhao yang (JIRA)

yuhao yang created SPARK-16592:
--

 Summary: Improving ml.Logistic Regression on speed and scalability
 Key: SPARK-16592
 URL: https://issues.apache.org/jira/browse/SPARK-16592
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: yuhao yang


With the spreading application of Apache Spark* logistic regression, we've seen 
more and more requirements come up about improving the speed and scalability. 
Many suggestions and discussions have been evolving in the developer and user 
communities.  While it may be difficult to find an optimization for all the 
cases, understanding the various scenarios and approaches will be important. 

As discussed with [~josephkb], this JIRA is created for discussion and 
collecting efforts on the optimization work of LR (logistic regression). All 
the ongoing related JIRA will be linked here, as well as new ideas and 
possibilities. 

Users are encouraged to share their experiences/expectations on LR and track 
the development status from the community. Developers can leverage the JIRA to 
browse existing efforts, make communication and introduce research/development 
resources.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-16466) names() function allows creation of column name containing "-". filter() function subsequently fails

2016-07-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-16466:
---

Minor: I prefer using "Fixed" when there's a resolution to point to. Here we're 
not sure what if any change resolved something here

> names() function allows creation of column name containing "-".  filter() 
> function subsequently fails
> -
>
> Key: SPARK-16466
> URL: https://issues.apache.org/jira/browse/SPARK-16466
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
> Environment: Databricks.com
>Reporter: Neil Dewar
>Priority: Minor
> Fix For: 1.6.2
>
>
> If I assign names to a DataFrame using the names() function, it allows the 
> introduction of "-" characters that caused the filter() function to 
> subsequently fail.  I am unclear if other special characters cause similar 
> problems.
> Example:
> sdfCar <- createDataFrame(sqlContext, mtcars)
> names(sdfCar) <- c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", 
> "am", "gear", "carb-count") # note: carb renamed to carb-count
> sdfCar3 <- filter(sdfCar, carb-count==4)
> Above fails with error: failure: identifier expected carb-count==4.  This 
> logic appears to be assuming that the "-" in the column name is a minus sign.
> I am unsure if the problem here is that "-" is illegal in a column name, or 
> if the filter function should be able to handle "-" in a column name, but one 
> or the other must be wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16466) names() function allows creation of column name containing "-". filter() function subsequently fails

2016-07-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16466.
---
Resolution: Not A Problem

> names() function allows creation of column name containing "-".  filter() 
> function subsequently fails
> -
>
> Key: SPARK-16466
> URL: https://issues.apache.org/jira/browse/SPARK-16466
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
> Environment: Databricks.com
>Reporter: Neil Dewar
>Priority: Minor
> Fix For: 1.6.2
>
>
> If I assign names to a DataFrame using the names() function, it allows the 
> introduction of "-" characters that caused the filter() function to 
> subsequently fail.  I am unclear if other special characters cause similar 
> problems.
> Example:
> sdfCar <- createDataFrame(sqlContext, mtcars)
> names(sdfCar) <- c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", 
> "am", "gear", "carb-count") # note: carb renamed to carb-count
> sdfCar3 <- filter(sdfCar, carb-count==4)
> Above fails with error: failure: identifier expected carb-count==4.  This 
> logic appears to be assuming that the "-" in the column name is a minus sign.
> I am unsure if the problem here is that "-" is illegal in a column name, or 
> if the filter function should be able to handle "-" in a column name, but one 
> or the other must be wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12420) Have a built-in CSV data source implementation

2016-07-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12420:
--
Assignee: Hossein Falaki

> Have a built-in CSV data source implementation
> --
>
> Key: SPARK-12420
> URL: https://issues.apache.org/jira/browse/SPARK-12420
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Hossein Falaki
> Fix For: 2.0.0
>
> Attachments: Built-in CSV datasource in Spark.pdf
>
>
> CSV is the most common data format in the "small data" world. It is often the 
> first format people want to try when they see Spark on a single node. Making 
> this built-in for the most common source can provide a better experience for 
> first-time users.
> We should consider inlining https://github.com/databricks/spark-csv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12420) Have a built-in CSV data source implementation

2016-07-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12420:
--
Assignee: Hyukjin Kwon  (was: Hossein Falaki)

> Have a built-in CSV data source implementation
> --
>
> Key: SPARK-12420
> URL: https://issues.apache.org/jira/browse/SPARK-12420
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Hyukjin Kwon
> Fix For: 2.0.0
>
> Attachments: Built-in CSV datasource in Spark.pdf
>
>
> CSV is the most common data format in the "small data" world. It is often the 
> first format people want to try when they see Spark on a single node. Making 
> this built-in for the most common source can provide a better experience for 
> first-time users.
> We should consider inlining https://github.com/databricks/spark-csv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12436) If all values of a JSON field is null, JSON's inferSchema should return NullType instead of StringType

2016-07-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12436:
--
Target Version/s:   (was: 2.0.0)

> If all values of a JSON field is null, JSON's inferSchema should return 
> NullType instead of StringType
> --
>
> Key: SPARK-12436
> URL: https://issues.apache.org/jira/browse/SPARK-12436
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: starter
>
> Right now, JSON's inferSchema will return {{StringType}} for a field that 
> always has null values or an {{ArrayType(StringType)}}  for a field that 
> always has empty array values. Although this behavior makes writing JSON data 
> to other data sources easy (i.e. when writing data, we do not need to remove 
> those {{NullType}} or {{ArrayType(NullType)}} columns), it makes downstream 
> application hard to reason about the actual schema of the data and thus makes 
> schema merging hard. We should allow JSON's inferSchema returns {{NullType}} 
> and {{ArrayType(NullType)}}. Also, we need to make sure that when we write 
> data out, we should remove those {{NullType}} or {{ArrayType(NullType)}} 
> columns first. 
> Besides  {{NullType}} and {{ArrayType(NullType)}}, we may need to do the same 
> thing for empty {{StructType}}s (i.e. a {{StructType}} having 0 fields). 
> To finish this work, we need to finish the following sub-tasks:
> * Allow JSON's inferSchema returns {{NullType}} and {{ArrayType(NullType)}}.
> * Determine whether we need to add the operation of removing {{NullType}} and 
> {{ArrayType(NullType)}} columns from the data that will be write out for all 
> data sources (i.e. data sources based our data source API and Hive tables). 
> Or, we should just add this operation for certain data sources (e.g. 
> Parquet). For example, we may not need this operation for Hive because Hive 
> has VoidObjectInspector.
> * Implement the change and get it merged to Spark master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12373) Type coercion rule of dividing two decimal values may choose an intermediate precision that does not have enough number of digits at the left of decimal point

2016-07-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12373:
--
Target Version/s:   (was: 2.0.0)

> Type coercion rule of dividing two decimal values may choose an intermediate 
> precision that does not have enough number of digits at the left of decimal 
> point 
> ---
>
> Key: SPARK-12373
> URL: https://issues.apache.org/jira/browse/SPARK-12373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> Looks like the {{widerDecimalType}} at 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala#L432
>  can produce something like {{(38, 38)}} when we have have two operand types 
> {{Decimal(38, 0)}} and {{Decimal(38, 38)}}. We should take a look at if there 
> is more reasonable way to handle precision/scale.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12437) Reserved words (like table) throws error when writing a data frame to JDBC

2016-07-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12437:
--
Target Version/s:   (was: 2.0.0)

> Reserved words (like table) throws error when writing a data frame to JDBC
> --
>
> Key: SPARK-12437
> URL: https://issues.apache.org/jira/browse/SPARK-12437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>  Labels: starter
>
> From: A Spark user
> If you have a DataFrame column name that contains a SQL reserved word, it 
> will not write to a JDBC source.  This is somewhat similar to an error found 
> in the redshift adapter:
> https://github.com/databricks/spark-redshift/issues/80
> I have produced this on a MySQL (AWS Aurora) database
> Steps to reproduce:
> {code}
> val connectionProperties = new java.util.Properties()
> sqlContext.table("diamonds").write.jdbc(jdbcUrl, "diamonds", 
> connectionProperties)
> {code}
> Exception:
> {code}
> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error 
> in your SQL syntax; check the manual that corresponds to your MySQL server 
> version for the right syntax to use near 'table DOUBLE PRECISION , price 
> INTEGER , x DOUBLE PRECISION , y DOUBLE PRECISION' at line 1
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
>   at com.mysql.jdbc.Util.getInstance(Util.java:386)
>   at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1054)
>   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4237)
>   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4169)
>   at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2617)
>   at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2778)
>   at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2825)
>   at 
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2156)
>   at 
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2459)
>   at 
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2376)
>   at 
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2360)
>   at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:275)
> {code}
> You can workaround this by renaming the column on the dataframe before 
> writing, but ideally we should be able to do something like encapsulate the 
> name in quotes which is allowed.  Example: 
> {code}
> CREATE TABLE `test_table_column` (
>   `id` int(11) DEFAULT NULL,
>   `table` varchar(100) DEFAULT NULL
> ) 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16585) Update inner fields of complex types in dataframes

2016-07-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16585:
--
   Labels:   (was: build features)
 Priority: Minor  (was: Blocker)
Fix Version/s: (was: 1.6.0)

See https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
first. Don't set Blocker; you can't target released versions; ask questions on 
user@.

> Update inner fields of complex types in dataframes
> --
>
> Key: SPARK-16585
> URL: https://issues.apache.org/jira/browse/SPARK-16585
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Affects Versions: 1.6.0
> Environment: spark 1.6.0
> scala 2.11
> hive 0.13
>Reporter: Naveen
>Priority: Minor
>
> Using dataframe.withColumn(,udf($colname)) for inner fields in 
> struct/complex datatype, results in a new dataframe with the a new column 
> appended to it. "colname" in the above argument is given as fullname with dot 
> notation to access the struct/complex fields. 
> For eg: hive table has columns: (id int, address struct buildname:string, stname:string>>, line2:string>) 
> I need to update the inner field 'buildname'. I can select the inner field 
> through dataframe as : df.select($"address.line1.buildname"), however when I 
> use df.withColumn("address.line1.buildname", 
> toUpperCaseUDF($"address.line1.buildname")), it is resulting in a new 
> dataframe with new column: "address.line1.buildname" appended, with 
> toUpperCaseUDF values from inner field buildname.
> How can I update the inner fields of the complex data types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16591) HadoopFsRelation will list , cache all parquet file paths

2016-07-17 Thread cen yuhai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-16591:
--
Description: 
HadoopFsRelation has a fileStatusCache which list all paths and then cache all 
filestatus no matter whether you specify partition columns  or not.  It may 
cause OOM when reading parquet table.

In HiveMetastoreCatalog file, spark will convert MetaStoreRelation to 
ParquetRelation by calling convertToParquetRelation method.
It will call metastoreRelation.getHiveQlPartitions() to request hive metastore 
service for all partitions without filters. And then pass all partition paths 
to ParquetRelation's paths member.

In FileStatusCache's refresh method, it will list all paths : "val files = 
listLeafFiles(paths)"

  was:
HadoopFsRelation has a fileStatusCache which list all paths and then cache all 
filestatus no matter whether you specify partition columns  or not.  It will 
cause OOM when reading parquet table.

In HiveMetastoreCatalog file, spark will convert MetaStoreRelation to 
ParquetRelation by calling convertToParquetRelation method.
It will call metastoreRelation.getHiveQlPartitions() to request hive metastore 
service for all partitions without filters. And then pass all partition paths 
to ParquetRelation's paths member.

In FileStatusCache's refresh method, it will list all paths : "val files = 
listLeafFiles(paths)"


> HadoopFsRelation will list , cache all parquet file paths
> -
>
> Key: SPARK-16591
> URL: https://issues.apache.org/jira/browse/SPARK-16591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: cen yuhai
>
> HadoopFsRelation has a fileStatusCache which list all paths and then cache 
> all filestatus no matter whether you specify partition columns  or not.  It 
> may cause OOM when reading parquet table.
> In HiveMetastoreCatalog file, spark will convert MetaStoreRelation to 
> ParquetRelation by calling convertToParquetRelation method.
> It will call metastoreRelation.getHiveQlPartitions() to request hive 
> metastore service for all partitions without filters. And then pass all 
> partition paths to ParquetRelation's paths member.
> In FileStatusCache's refresh method, it will list all paths : "val files = 
> listLeafFiles(paths)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16585) Update inner fields of complex types in dataframes

2016-07-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16585.
---
Resolution: Invalid

> Update inner fields of complex types in dataframes
> --
>
> Key: SPARK-16585
> URL: https://issues.apache.org/jira/browse/SPARK-16585
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Affects Versions: 1.6.0
> Environment: spark 1.6.0
> scala 2.11
> hive 0.13
>Reporter: Naveen
>Priority: Minor
>
> Using dataframe.withColumn(,udf($colname)) for inner fields in 
> struct/complex datatype, results in a new dataframe with the a new column 
> appended to it. "colname" in the above argument is given as fullname with dot 
> notation to access the struct/complex fields. 
> For eg: hive table has columns: (id int, address struct buildname:string, stname:string>>, line2:string>) 
> I need to update the inner field 'buildname'. I can select the inner field 
> through dataframe as : df.select($"address.line1.buildname"), however when I 
> use df.withColumn("address.line1.buildname", 
> toUpperCaseUDF($"address.line1.buildname")), it is resulting in a new 
> dataframe with new column: "address.line1.buildname" appended, with 
> toUpperCaseUDF values from inner field buildname.
> How can I update the inner fields of the complex data types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16570) Not able to access table's data after ALTER TABLE RENAME in Spark 1.6.2

2016-07-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16570.
---
  Resolution: Duplicate
   Fix Version/s: (was: 1.6.2)
Target Version/s:   (was: 1.6.2)

> Not able to access table's data after ALTER TABLE RENAME in Spark 1.6.2
> ---
>
> Key: SPARK-16570
> URL: https://issues.apache.org/jira/browse/SPARK-16570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2
> Environment: Ubuntu 14.04, hadoop 2.7.1
>Reporter: Ales Cervenka
>
> Spark 1.6.1 and 1.6.2 is not able to read data from a table after the table 
> has been renamed. This can be reproduced with the following actions in the 
> spark-shell:
> sqlContext.sql("SELECT 1 as col1, 2 as 
> col2").write.format("parquet").mode("overwrite").saveAsTable("mytesttable")
> sqlContext.sql("SELECT * FROM mytesttable").show()
> +++
> |col1|col2|
> +++
> |   1|   2|
> +++
> sqlContext.sql("ALTER TABLE mytesttable RENAME TO mytesttable_withnewname")
> sqlContext.sql("SELECT * FROM mytesttable_withnewname").show()
> +++
> |col1|col2|
> +++
> +++
> I believe the issue is related to SPARK-14920 and SPARK-15635 - Spark stores 
> (and later retrieves) a location of a table to a "path" SerDe property, which 
> is not modified in HiveMetaStore's alter_table. And as the actual directory 
> on HDFS is renamed, the "path" doesn't point to a correct location after the 
> update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16591) HadoopFsRelation will list , cache all parquet file paths

2016-07-17 Thread cen yuhai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-16591:
--
Shepherd: Cheng Lian  (was: lianwenbo)

> HadoopFsRelation will list , cache all parquet file paths
> -
>
> Key: SPARK-16591
> URL: https://issues.apache.org/jira/browse/SPARK-16591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: cen yuhai
>
> HadoopFsRelation has a fileStatusCache which list all paths and then cache 
> all filestatus no matter whether you specify partition columns  or not.  It 
> will cause OOM when reading parquet table.
> In HiveMetastoreCatalog file, spark will convert MetaStoreRelation to 
> ParquetRelation by calling convertToParquetRelation method.
> It will call metastoreRelation.getHiveQlPartitions() to request hive 
> metastore service for all partitions without filters. And then pass all 
> partition paths to ParquetRelation's paths member.
> In FileStatusCache's refresh method, it will list all paths : "val files = 
> listLeafFiles(paths)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16591) HadoopFsRelation will list , cache all parquet file paths

2016-07-17 Thread cen yuhai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-16591:
--
Shepherd: lianwenbo

> HadoopFsRelation will list , cache all parquet file paths
> -
>
> Key: SPARK-16591
> URL: https://issues.apache.org/jira/browse/SPARK-16591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: cen yuhai
>
> HadoopFsRelation has a fileStatusCache which list all paths and then cache 
> all filestatus no matter whether you specify partition columns  or not.  It 
> will cause OOM when reading parquet table.
> In HiveMetastoreCatalog file, spark will convert MetaStoreRelation to 
> ParquetRelation by calling convertToParquetRelation method.
> It will call metastoreRelation.getHiveQlPartitions() to request hive 
> metastore service for all partitions without filters. And then pass all 
> partition paths to ParquetRelation's paths member.
> In FileStatusCache's refresh method, it will list all paths : "val files = 
> listLeafFiles(paths)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16591) HadoopFsRelation will list , cache all parquet file paths

2016-07-17 Thread cen yuhai (JIRA)

cen yuhai created SPARK-16591:
-

 Summary: HadoopFsRelation will list , cache all parquet file paths
 Key: SPARK-16591
 URL: https://issues.apache.org/jira/browse/SPARK-16591
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.2, 1.6.1, 1.6.0
Reporter: cen yuhai


HadoopFsRelation has a fileStatusCache which list all paths and then cache all 
filestatus no matter whether you specify partition columns  or not.  It will 
cause OOM when reading parquet table.

In HiveMetastoreCatalog file, spark will convert MetaStoreRelation to 
ParquetRelation by calling convertToParquetRelation method.
It will call metastoreRelation.getHiveQlPartitions() to request hive metastore 
service for all partitions without filters. And then pass all partition paths 
to ParquetRelation's paths member.

In FileStatusCache's refresh method, it will list all paths : "val files = 
listLeafFiles(paths)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13634) Assigning spark context to variable results in serialization error

2016-07-17 Thread Rahul Palamuttam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381168#comment-15381168
 ] 

Rahul Palamuttam edited comment on SPARK-13634 at 7/17/16 7:41 AM:
---

[~Kai Chen], thank you. I apologize for not responding sooner. This does 
resolve our issue as a workaround.
 
As a little background :
We utilize a wrapper class for the SparkContext, and while I set the 
SparkContext variable inside the class to transient it didn't resolve our issue.
Instead attaching @transient tag to an instance of the wrapper class resolved 
the issue. 
Before :
val SciSc = new SciSparkContext(sc)
After
@transient val SciSc = new SciSparkContext(sc)

We utilize the wrapper class SciSparkContext to delegate to functions like 
BinaryFiles to read file formats like netcdf while abstracting the extra 
details to actually read it in that format.

[~srowen] and [~chrismattmann] - thank you for allowing the JIRA to be 
re-opened.
I would like to resolve the issue, but first I did wanted to point out that I 
didn't see much or any documentation on this issue. 
I was looking at the quick start here : 
http://spark.apache.org/docs/latest/quick-start.html#interactive-analysis-with-the-spark-shell
(I may have just missed it else where).
The spark-repll as a mode of interacting with spark seems to be becoming more 
common - especially with notebook projects like zeppelin (which we are using).
I do think that this is worth pointing out and mentioning - even if it is 
really an issue with scala.
If we are in agreement, I would like to change this JIRA to a documentation 
JIRA and submit the patch.

I'll also respond sooner next time.





was (Author: rahul palamuttam):
Kai Chen, thank you. I apologize for not responding sooner. This does resolve 
our issue. 
As a little background :
We utilize a wrapper class for the SparkContext, and while I set the 
SparkContext variable inside the class to transient it didn't resolve our issue.
Instead attaching @transient tag to an instance of the wrapper class resolved 
the issue. 
Before :
val SciSc = new SciSparkContext(sc)
After
@transient SciSc = new SciSparkContext(sc)
We utilize the wrapper class SciSparkContext to delegate to functions like 
BinaryFiles to read file formats like netcdf while abstracting the extra 
details to actually read it in that format.

Sean Owen and Chris A. Mattmann - thank you for allowing the JIRA to be 
re-opened.
I would like to resolve the issue, but first I did wanted to point out that I 
didn't see much or any documentation on this issue. 
I was looking at the quick start here : 
http://spark.apache.org/docs/latest/quick-start.html#interactive-analysis-with-the-spark-shell
(I may have just missed it else where).
The spark-shell as a mode of interacting with spark seems to be becoming more 
common - especially with notebook projects like zeppelin (which we are using).
I do think that this is worth pointing out and mentioning - even if it is 
really an issue with scala.
If we are in agreement, I would like to change this JIRA to a documentation 
JIRA and submit the patch (I've never submitted a doc patch and it would be a 
nice experience for me).

I'll also respond sooner next time.




> Assigning spark context to variable results in serialization error
> --
>
> Key: SPARK-13634
> URL: https://issues.apache.org/jira/browse/SPARK-13634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Rahul Palamuttam
>Priority: Minor
>
> The following lines of code cause a task serialization error when executed in 
> the spark-shell. 
> Note that the error does not occur when submitting the code as a batch job - 
> via spark-submit.
> val temp = 10
> val newSC = sc
> val new RDD = newSC.parallelize(0 to 100).map(p => p + temp)
> For some reason when temp is being pulled in to the referencing environment 
> of the closure, so is the SparkContext. 
> We originally hit this issue in the SciSpark project, when referencing a 
> string variable inside of a lambda expression in RDD.map(...)
> Any insight into how this could be resolved would be appreciated.
> While the above code is trivial, SciSpark uses a wrapper around the 
> SparkContext to read from various file formats. We want to keep this class 
> structure and also use it in notebook and shell environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15393) Writing empty Dataframes doesn't save any _metadata files

2016-07-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15393.
---
  Resolution: Duplicate
Target Version/s:   (was: 2.0.0)

> Writing empty Dataframes doesn't save any _metadata files
> -
>
> Key: SPARK-15393
> URL: https://issues.apache.org/jira/browse/SPARK-15393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jurriaan Pruis
>Priority: Critical
>
> Writing empty dataframes is broken on latest master.
> It omits the metadata and sometimes throws the following exception (when 
> saving as parquet):
> {code}
> 8-May-2016 22:37:14 WARNING: 
> org.apache.parquet.hadoop.ParquetOutputCommitter: could not write summary 
> file for file:/some/test/file
> java.lang.NullPointerException
> at 
> org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:456)
> at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420)
> at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
> at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
> at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:220)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:144)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:115)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:115)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:115)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:417)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:252)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:234)
> at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:626)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> It only saves an _SUCCESS file (which is also incorrect behaviour, because it 
> raised an exception).
> This means that loading it again will result in the following error:
> {code}
> Unable to infer schema for ParquetFormat at /some/test/file. It must be 
> specified manually;'
> {code}
> It looks like this problem was introduced in 
> https://github.com/apache/spark/pull/12855 (SPARK-10216).
> After reverting those changes I could save the empty dataframe as parquet and 
> load it again without Spark complaining or throwing any exceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

--

[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error

2016-07-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381169#comment-15381169
 ] 

Sean Owen commented on SPARK-13634:
---

Go ahead, though in general I think it's pretty implicit that you can't 
serialize context-like objects anywhere. This may in fact be just a hack, and 
you need to redesign your code so that objects that are sent around do not 
capture a context object to begin with. Your use case is not normal shell 
usage; you're writing a custom framework. You can suggest doc changes (in a 
PR); just consider what is quite specific to your usage vs what is likely 
widely applicable enough to go in the docs.

> Assigning spark context to variable results in serialization error
> --
>
> Key: SPARK-13634
> URL: https://issues.apache.org/jira/browse/SPARK-13634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Rahul Palamuttam
>Priority: Minor
>
> The following lines of code cause a task serialization error when executed in 
> the spark-shell. 
> Note that the error does not occur when submitting the code as a batch job - 
> via spark-submit.
> val temp = 10
> val newSC = sc
> val new RDD = newSC.parallelize(0 to 100).map(p => p + temp)
> For some reason when temp is being pulled in to the referencing environment 
> of the closure, so is the SparkContext. 
> We originally hit this issue in the SciSpark project, when referencing a 
> string variable inside of a lambda expression in RDD.map(...)
> Any insight into how this could be resolved would be appreciated.
> While the above code is trivial, SciSpark uses a wrapper around the 
> SparkContext to read from various file formats. We want to keep this class 
> structure and also use it in notebook and shell environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error

2016-07-17 Thread Rahul Palamuttam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381168#comment-15381168
 ] 

Rahul Palamuttam commented on SPARK-13634:
--

Kai Chen, thank you. I apologize for not responding sooner. This does resolve 
our issue. 
As a little background :
We utilize a wrapper class for the SparkContext, and while I set the 
SparkContext variable inside the class to transient it didn't resolve our issue.
Instead attaching @transient tag to an instance of the wrapper class resolved 
the issue. 
Before :
val SciSc = new SciSparkContext(sc)
After
@transient SciSc = new SciSparkContext(sc)
We utilize the wrapper class SciSparkContext to delegate to functions like 
BinaryFiles to read file formats like netcdf while abstracting the extra 
details to actually read it in that format.

Sean Owen and Chris A. Mattmann - thank you for allowing the JIRA to be 
re-opened.
I would like to resolve the issue, but first I did wanted to point out that I 
didn't see much or any documentation on this issue. 
I was looking at the quick start here : 
http://spark.apache.org/docs/latest/quick-start.html#interactive-analysis-with-the-spark-shell
(I may have just missed it else where).
The spark-shell as a mode of interacting with spark seems to be becoming more 
common - especially with notebook projects like zeppelin (which we are using).
I do think that this is worth pointing out and mentioning - even if it is 
really an issue with scala.
If we are in agreement, I would like to change this JIRA to a documentation 
JIRA and submit the patch (I've never submitted a doc patch and it would be a 
nice experience for me).

I'll also respond sooner next time.




> Assigning spark context to variable results in serialization error
> --
>
> Key: SPARK-13634
> URL: https://issues.apache.org/jira/browse/SPARK-13634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Rahul Palamuttam
>Priority: Minor
>
> The following lines of code cause a task serialization error when executed in 
> the spark-shell. 
> Note that the error does not occur when submitting the code as a batch job - 
> via spark-submit.
> val temp = 10
> val newSC = sc
> val new RDD = newSC.parallelize(0 to 100).map(p => p + temp)
> For some reason when temp is being pulled in to the referencing environment 
> of the closure, so is the SparkContext. 
> We originally hit this issue in the SciSpark project, when referencing a 
> string variable inside of a lambda expression in RDD.map(...)
> Any insight into how this could be resolved would be appreciated.
> While the above code is trivial, SciSpark uses a wrapper around the 
> SparkContext to read from various file formats. We want to keep this class 
> structure and also use it in notebook and shell environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13634) Assigning spark context to variable results in serialization error

2016-07-17 Thread Rahul Palamuttam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177093#comment-15177093
 ] 

Rahul Palamuttam edited comment on SPARK-13634 at 7/17/16 7:24 AM:
---

[~chrismattmann]


was (Author: rahul palamuttam):
[~chrismattmann]

> Assigning spark context to variable results in serialization error
> --
>
> Key: SPARK-13634
> URL: https://issues.apache.org/jira/browse/SPARK-13634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Rahul Palamuttam
>Priority: Minor
>
> The following lines of code cause a task serialization error when executed in 
> the spark-shell. 
> Note that the error does not occur when submitting the code as a batch job - 
> via spark-submit.
> val temp = 10
> val newSC = sc
> val new RDD = newSC.parallelize(0 to 100).map(p => p + temp)
> For some reason when temp is being pulled in to the referencing environment 
> of the closure, so is the SparkContext. 
> We originally hit this issue in the SciSpark project, when referencing a 
> string variable inside of a lambda expression in RDD.map(...)
> Any insight into how this could be resolved would be appreciated.
> While the above code is trivial, SciSpark uses a wrapper around the 
> SparkContext to read from various file formats. We want to keep this class 
> structure and also use it in notebook and shell environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12261) pyspark crash for large dataset

2016-07-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381167#comment-15381167
 ] 

Sean Owen commented on SPARK-12261:
---

We need logs showing the actual error. If this is local mode, the executor 
output is in the same log, but, this snippet doesn't show anything but 'job 
failed'. If there's really nothing here then it's something to do with the 
python process exiting or crashing. Not as sure how to get that output.

> pyspark crash for large dataset
> ---
>
> Key: SPARK-12261
> URL: https://issues.apache.org/jira/browse/SPARK-12261
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: windows
>Reporter: zihao
>
> I tried to import a local text(over 100mb) file via textFile in pyspark, when 
> i ran data.take(), it failed and gave error messages including:
> 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
> aborting job
> Traceback (most recent call last):
>   File "E:/spark_python/test3.py", line 9, in 
> lines.take(5)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, 
> in take
> res = self.context.runJob(self, takeUpToNumLeft, p)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 
> 916, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in 
> __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 
> 36, in deco
> return f(*a, **kw)
>   File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in 
> get_return_value
> format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.net.SocketException: Connection reset by peer: 
> socket write error
> Then i ran the same code for a small text file, this time .take() worked fine.
> How can i solve this problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16452) basic INFORMATION_SCHEMA support

2016-07-17 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381164#comment-15381164
 ] 

Dongjoon Hyun commented on SPARK-16452:
---

Could you review the PR again?

> basic INFORMATION_SCHEMA support
> 
>
> Key: SPARK-16452
> URL: https://issues.apache.org/jira/browse/SPARK-16452
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: INFORMATION_SCHEMAsupport.pdf
>
>
> INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a 
> few tables as defined in SQL92 standard to Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 102 matches

Mail list logo