[jira] [Commented] (SPARK-18877) Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20

2016-12-16 Thread Navya Krishnappa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756554#comment-15756554
 ] 

Navya Krishnappa commented on SPARK-18877:
--

Thank you for replying [~dongjoon]. Can you help me in understanding whether 
the above mentioned PR will resolve the below mentioned issue.

I have another issue with respect to the decimal scale. When i'm trying to read 
the below mentioned csv source file and creating an parquet file from that 
throws an java.lang.IllegalArgumentException: Invalid DECIMAL scale: -9 
exception.


The source file content is 
Row(column name)
9.03E+12
1.19E+11

 Refer the given code used read the csv file and creating an parquet file:

//Read the csv file
Dataset dataset = getSqlContext().read()
.option(DAWBConstant.HEADER, "true")
.option(DAWBConstant.PARSER_LIB, "commons")
.option(DAWBConstant.INFER_SCHEMA, "true")
.option(DAWBConstant.DELIMITER, ",")
.option(DAWBConstant.QUOTE, "\"")
.option(DAWBConstant.ESCAPE, "
")
.option(DAWBConstant.MODE, Mode.PERMISSIVE)
.csv(sourceFile)

// create an parquet file
dataset.write().parquet("//path.parquet")


Stack trace:

Caused by: java.lang.IllegalArgumentException: Invalid DECIMAL scale: -9
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
at 
org.apache.parquet.schema.Types$PrimitiveBuilder.decimalMetadata(Types.java:410)
at 
org.apache.parquet.schema.Types$PrimitiveBuilder.build(Types.java:324)
at 
org.apache.parquet.schema.Types$PrimitiveBuilder.build(Types.java:250)
at org.apache.parquet.schema.Types$Builder.named(Types.java:228)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:412)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convert$1.apply(ParquetSchemaConverter.scala:313)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convert$1.apply(ParquetSchemaConverter.scala:313)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convert(ParquetSchemaConverter.scala:313)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.init(ParquetWriteSupport.scala:85)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetFileFormat.scala:562)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:139)
at 
org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)


> Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: 
> requirement failed: Decimal precision 28 exceeds max precision 20
> --
>
> Key: SPARK-18877
> URL: https://issues.apache.org/jira/browse/SPARK-18877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Navya Krishnappa
>
> When reading below mentioned csv data, even though the maximum decimal 
> precision is 38, following exception is 

[jira] [Commented] (SPARK-18813) MLlib 2.2 Roadmap

2016-12-16 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756480#comment-15756480
 ] 

zhengruifeng commented on SPARK-18813:
--

prediction on single instance  &  Set initial model  +1
urgently need these features

> MLlib 2.2 Roadmap
> -
>
> Key: SPARK-18813
> URL: https://issues.apache.org/jira/browse/SPARK-18813
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
> The roadmap process described below is significantly updated since the 2.1 
> roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
> the basis for this proposal, and comment in this JIRA if you have suggestions 
> for improvements.
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> * Vote for & watch issues which are important to you.
> ** MLlib, sorted by: [Votes | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC]
>  or [Watchers | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC]
> ** SparkR, sorted by: [Votes | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC]
>  or [Watchers | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC]
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.  _These 
> meanings have been updated in this proposal for the 2.2 process._
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | 1 | next release | Blocker | *must* | *must* | *must* |
> | 2 | next release | Critical | *must* | yes, unless small | *best effort* |
> | 3 | next release | Major | *must* | optional | *best effort* |
> | 4 | next release | Minor | optional | no | maybe |
> | 5 | next release | Trivial | optional | no | maybe |
> | 6 | (empty) | (any) | yes | no | maybe |
> | 7 | (empty) | (any) | no | no | maybe |
> The *Category* in the table above has the following meaning:
> 1. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.
> 2-3. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.  The issue may slip to the 
> next release if development is slower than expected.
> 4-5. A committer has promised interest in this issue.  Contributions *will* 
> receive attention.  The issue may slip to another release.
> 6. A committer has promised interest in this issue and should respond, but no 
> promises are made about priorities or releases.
> 7. This issue is open for discussion, but it needs a committer to promise 
> interest to proceed.
> h1. Instructions
> h2. For contributors
> Getting started
> * Please read http://spark.apache.org/contributing.html carefully. Code 
> style, documentation, and unit tests are important.
> * If you are a first-time contributor, please always start with a small 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a larger feature.
> Coordinating on JIRA
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start work. This is to avoid duplicate work. For small patches, you do 
> not need to get the JIRA assigned to you to begin work.
> * For medium/large features or features with dependencies, please get 
> assigned first before coding and keep the ETA updated on the 

[jira] [Assigned] (SPARK-18910) Can't use UDF that jar file in hdfs

2016-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18910:


Assignee: Apache Spark

> Can't use UDF that jar file in hdfs
> ---
>
> Key: SPARK-18910
> URL: https://issues.apache.org/jira/browse/SPARK-18910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Hong Shen
>Assignee: Apache Spark
>
> When I create a UDF that jar file in hdfs, I can't use the UDF. 
> {code}
> spark-sql> create function trans_array as 'com.test.udf.TransArray'  using 
> jar 
> 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';
> spark-sql> describe function trans_array;
> Function: test_db.trans_array
> Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
> Usage: N/A.
> Time taken: 0.127 seconds, Fetched 3 row(s)
> spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) 
> from test_spark limit 10;
> Error in query: Undefined function: 'trans_array'. This function is neither a 
> registered temporary function nor a permanent function registered in the 
> database 'test_db'.; line 1 pos 7
> {code}
> The reason is when 
> org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource,
>  the uri.toURL throw exception with " failed unknown protocol: hdfs"
> {code}
>   def addJar(path: String): Unit = {
> sparkSession.sparkContext.addJar(path)
> val uri = new Path(path).toUri
> val jarURL = if (uri.getScheme == null) {
>   // `path` is a local file path without a URL scheme
>   new File(path).toURI.toURL
> } else {
>   // `path` is a URL with a scheme
>   {color:red}uri.toURL{color}
> }
> jarClassLoader.addURL(jarURL)
> Thread.currentThread().setContextClassLoader(jarClassLoader)
>   }
> {code}
> I think we should setURLStreamHandlerFactory method on URL with an instance 
> of FsUrlStreamHandlerFactory, just like:
> {code}
> static {
>   URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18910) Can't use UDF that jar file in hdfs

2016-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18910:


Assignee: (was: Apache Spark)

> Can't use UDF that jar file in hdfs
> ---
>
> Key: SPARK-18910
> URL: https://issues.apache.org/jira/browse/SPARK-18910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Hong Shen
>
> When I create a UDF that jar file in hdfs, I can't use the UDF. 
> {code}
> spark-sql> create function trans_array as 'com.test.udf.TransArray'  using 
> jar 
> 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';
> spark-sql> describe function trans_array;
> Function: test_db.trans_array
> Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
> Usage: N/A.
> Time taken: 0.127 seconds, Fetched 3 row(s)
> spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) 
> from test_spark limit 10;
> Error in query: Undefined function: 'trans_array'. This function is neither a 
> registered temporary function nor a permanent function registered in the 
> database 'test_db'.; line 1 pos 7
> {code}
> The reason is when 
> org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource,
>  the uri.toURL throw exception with " failed unknown protocol: hdfs"
> {code}
>   def addJar(path: String): Unit = {
> sparkSession.sparkContext.addJar(path)
> val uri = new Path(path).toUri
> val jarURL = if (uri.getScheme == null) {
>   // `path` is a local file path without a URL scheme
>   new File(path).toURI.toURL
> } else {
>   // `path` is a URL with a scheme
>   {color:red}uri.toURL{color}
> }
> jarClassLoader.addURL(jarURL)
> Thread.currentThread().setContextClassLoader(jarClassLoader)
>   }
> {code}
> I think we should setURLStreamHandlerFactory method on URL with an instance 
> of FsUrlStreamHandlerFactory, just like:
> {code}
> static {
>   URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18910) Can't use UDF that jar file in hdfs

2016-12-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756433#comment-15756433
 ] 

Apache Spark commented on SPARK-18910:
--

User 'shenh062326' has created a pull request for this issue:
https://github.com/apache/spark/pull/16324

> Can't use UDF that jar file in hdfs
> ---
>
> Key: SPARK-18910
> URL: https://issues.apache.org/jira/browse/SPARK-18910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Hong Shen
>
> When I create a UDF that jar file in hdfs, I can't use the UDF. 
> {code}
> spark-sql> create function trans_array as 'com.test.udf.TransArray'  using 
> jar 
> 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';
> spark-sql> describe function trans_array;
> Function: test_db.trans_array
> Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
> Usage: N/A.
> Time taken: 0.127 seconds, Fetched 3 row(s)
> spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) 
> from test_spark limit 10;
> Error in query: Undefined function: 'trans_array'. This function is neither a 
> registered temporary function nor a permanent function registered in the 
> database 'test_db'.; line 1 pos 7
> {code}
> The reason is when 
> org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource,
>  the uri.toURL throw exception with " failed unknown protocol: hdfs"
> {code}
>   def addJar(path: String): Unit = {
> sparkSession.sparkContext.addJar(path)
> val uri = new Path(path).toUri
> val jarURL = if (uri.getScheme == null) {
>   // `path` is a local file path without a URL scheme
>   new File(path).toURI.toURL
> } else {
>   // `path` is a URL with a scheme
>   {color:red}uri.toURL{color}
> }
> jarClassLoader.addURL(jarURL)
> Thread.currentThread().setContextClassLoader(jarClassLoader)
>   }
> {code}
> I think we should setURLStreamHandlerFactory method on URL with an instance 
> of FsUrlStreamHandlerFactory, just like:
> {code}
> static {
>   URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17997) Aggregation function for counting distinct values for multiple intervals

2016-12-16 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-17997.
---
Resolution: Later

> Aggregation function for counting distinct values for multiple intervals
> 
>
> Key: SPARK-17997
> URL: https://issues.apache.org/jira/browse/SPARK-17997
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> This is for computing ndv's for bins in equi-height histograms. A bin 
> consists of two endpoints which form an interval of values and the ndv in 
> that interval. For computing histogram statistics, after getting the 
> endpoints, we need an agg function to count distinct values in each interval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17997) Aggregation function for counting distinct values for multiple intervals

2016-12-16 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17997:

Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-16026

> Aggregation function for counting distinct values for multiple intervals
> 
>
> Key: SPARK-17997
> URL: https://issues.apache.org/jira/browse/SPARK-17997
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> This is for computing ndv's for bins in equi-height histograms. A bin 
> consists of two endpoints which form an interval of values and the ndv in 
> that interval. For computing histogram statistics, after getting the 
> endpoints, we need an agg function to count distinct values in each interval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-16 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756403#comment-15756403
 ] 

Shivaram Venkataraman commented on SPARK-18817:
---

[~felixcheung] What do you think are the down sides of disabling Hive by 
default ? https://github.com/apache/spark/pull/16290 should fix the warehouse 
dir issue and if we disable hive by default then derby.log and metastore_db 
should not be created. 

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18911) Decouple Statistics and CatalogTable

2016-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18911:


Assignee: (was: Apache Spark)

> Decouple Statistics and CatalogTable
> 
>
> Key: SPARK-18911
> URL: https://issues.apache.org/jira/browse/SPARK-18911
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Zhenhua Wang
>
> Statistics in LogicalPlan should use attributes to refer to columns rather 
> than column names, because two columns from two relations can have the same 
> column name. But CatalogTable doesn't have the concepts of attribute or 
> broadcast hint in Statistics. Therefore, putting Statistics in CatalogTable 
> is confusing. We need to define a different statistic structure in 
> CatalogTable, which is only responsible for interacting with metastore, and 
> is converted to statistics in LogicalPlan when it is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18911) Decouple Statistics and CatalogTable

2016-12-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756389#comment-15756389
 ] 

Apache Spark commented on SPARK-18911:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/16323

> Decouple Statistics and CatalogTable
> 
>
> Key: SPARK-18911
> URL: https://issues.apache.org/jira/browse/SPARK-18911
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Zhenhua Wang
>
> Statistics in LogicalPlan should use attributes to refer to columns rather 
> than column names, because two columns from two relations can have the same 
> column name. But CatalogTable doesn't have the concepts of attribute or 
> broadcast hint in Statistics. Therefore, putting Statistics in CatalogTable 
> is confusing. We need to define a different statistic structure in 
> CatalogTable, which is only responsible for interacting with metastore, and 
> is converted to statistics in LogicalPlan when it is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18910) Can't use UDF that jar file in hdfs

2016-12-16 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated SPARK-18910:
--
Description: 
When I create a UDF that jar file in hdfs, I can't use the UDF. 
{code}
spark-sql> create function trans_array as 'com.test.udf.TransArray'  using jar 
'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';

spark-sql> describe function trans_array;
Function: test_db.trans_array
Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
Usage: N/A.
Time taken: 0.127 seconds, Fetched 3 row(s)

spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) from 
test_spark limit 10;
Error in query: Undefined function: 'trans_array'. This function is neither a 
registered temporary function nor a permanent function registered in the 
database 'test_db'.; line 1 pos 7
{code}

The reason is when 
org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, 
the uri.toURL throw exception with " failed unknown protocol: hdfs"
{code}
  def addJar(path: String): Unit = {
sparkSession.sparkContext.addJar(path)

val uri = new Path(path).toUri
val jarURL = if (uri.getScheme == null) {
  // `path` is a local file path without a URL scheme
  new File(path).toURI.toURL
} else {
  // `path` is a URL with a scheme
  {color:red}uri.toURL{color}
}
jarClassLoader.addURL(jarURL)
Thread.currentThread().setContextClassLoader(jarClassLoader)
  }
{code}

I think we should setURLStreamHandlerFactory method on URL with an instance of 
FsUrlStreamHandlerFactory, just like:
{code}
static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
{code}


  was:
When I create a UDF that jar file in hdfs, I can't use the UDF. 
{code}
spark-sql> create function trans_array as 'com.test.udf.TransArray'  using jar 
'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';

spark-sql> describe function trans_array;
Function: test_db.trans_array
Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
Usage: N/A.
Time taken: 0.127 seconds, Fetched 3 row(s)

spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) from 
test_spark limit 10;
Error in query: Undefined function: 'trans_array'. This function is neither a 
registered temporary function nor a permanent function registered in the 
database 'test_db'.; line 1 pos 7
{code}

The reason is when 
org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, 
the uri.toURL throw exception with " failed unknown protocol: hdfs"
{code}
  def addJar(path: String): Unit = {
sparkSession.sparkContext.addJar(path)

val uri = new Path(path).toUri
val jarURL = if (uri.getScheme == null) {
  // `path` is a local file path without a URL scheme
  new File(path).toURI.toURL
} else {
  // `path` is a URL with a scheme
  {color:red}uri.toURL{color}
}
jarClassLoader.addURL(jarURL)
Thread.currentThread().setContextClassLoader(jarClassLoader)
  }
{code}

I think we should setURLStreamHandlerFactory method on URL with an instance of 
FsUrlStreamHandlerFactory, just like:
{code}
static {
// This method can be called at most once in a given JVM.
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
{code}



> Can't use UDF that jar file in hdfs
> ---
>
> Key: SPARK-18910
> URL: https://issues.apache.org/jira/browse/SPARK-18910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Hong Shen
>
> When I create a UDF that jar file in hdfs, I can't use the UDF. 
> {code}
> spark-sql> create function trans_array as 'com.test.udf.TransArray'  using 
> jar 
> 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';
> spark-sql> describe function trans_array;
> Function: test_db.trans_array
> Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
> Usage: N/A.
> Time taken: 0.127 seconds, Fetched 3 row(s)
> spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) 
> from test_spark limit 10;
> Error in query: Undefined function: 'trans_array'. This function is neither a 
> registered temporary function nor a permanent function registered in the 
> database 'test_db'.; line 1 pos 7
> {code}
> The reason is when 
> org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource,
>  the uri.toURL throw exception with " failed unknown protocol: hdfs"
> {code}
>   def addJar(path: String): Unit = {
> sparkSession.sparkContext.addJar(path)
> val uri = new Path(path).toUri
> val jarURL = if (uri.getScheme == null) {
>   // `path` is a local file path without a URL scheme
>   new File(path).toURI.toURL

[jira] [Updated] (SPARK-18910) Can't use UDF that jar file in hdfs

2016-12-16 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated SPARK-18910:
--
Description: 
When I create a UDF that jar file in hdfs, I can't use the UDF. 
{code}
spark-sql> create function trans_array as 'com.test.udf.TransArray'  using jar 
'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';

spark-sql> describe function trans_array;
Function: test_db.trans_array
Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
Usage: N/A.
Time taken: 0.127 seconds, Fetched 3 row(s)

spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) from 
test_spark limit 10;
Error in query: Undefined function: 'trans_array'. This function is neither a 
registered temporary function nor a permanent function registered in the 
database 'test_db'.; line 1 pos 7
{code}

The reason is when 
org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, 
the uri.toURL throw exception with " failed unknown protocol: hdfs"
{code}
  def addJar(path: String): Unit = {
sparkSession.sparkContext.addJar(path)

val uri = new Path(path).toUri
val jarURL = if (uri.getScheme == null) {
  // `path` is a local file path without a URL scheme
  new File(path).toURI.toURL
} else {
  // `path` is a URL with a scheme
  {color:red}uri.toURL{color}
}
jarClassLoader.addURL(jarURL)
Thread.currentThread().setContextClassLoader(jarClassLoader)
  }
{code}

I think we should setURLStreamHandlerFactory method on URL with an instance of 
FsUrlStreamHandlerFactory, just like:
{code}
static {
// This method can be called at most once in a given JVM.
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
{code}


  was:
When I create a UDF that jar file in hdfs, I can't use the UDF. 

spark-sql> create function trans_array as 'com.test.udf.TransArray'  using jar 
'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';

spark-sql> describe function trans_array;
Function: test_db.trans_array
Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
Usage: N/A.
Time taken: 0.127 seconds, Fetched 3 row(s)

spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) from 
test_spark limit 10;
Error in query: Undefined function: 'trans_array'. This function is neither a 
registered temporary function nor a permanent function registered in the 
database 'test_db'.; line 1 pos 7


The reason is when 
org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, 
the uri.toURL throw exception with " failed unknown protocol: hdfs"

  def addJar(path: String): Unit = {
sparkSession.sparkContext.addJar(path)

val uri = new Path(path).toUri
val jarURL = if (uri.getScheme == null) {
  // `path` is a local file path without a URL scheme
  new File(path).toURI.toURL
} else {
  // `path` is a URL with a scheme
  {color:red}uri.toURL{color}
}
jarClassLoader.addURL(jarURL)
Thread.currentThread().setContextClassLoader(jarClassLoader)
  }


I think we should setURLStreamHandlerFactory method on URL with an instance of 
FsUrlStreamHandlerFactory, just like:

static {
// This method can be called at most once in a given JVM.
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}




> Can't use UDF that jar file in hdfs
> ---
>
> Key: SPARK-18910
> URL: https://issues.apache.org/jira/browse/SPARK-18910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Hong Shen
>
> When I create a UDF that jar file in hdfs, I can't use the UDF. 
> {code}
> spark-sql> create function trans_array as 'com.test.udf.TransArray'  using 
> jar 
> 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';
> spark-sql> describe function trans_array;
> Function: test_db.trans_array
> Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
> Usage: N/A.
> Time taken: 0.127 seconds, Fetched 3 row(s)
> spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) 
> from test_spark limit 10;
> Error in query: Undefined function: 'trans_array'. This function is neither a 
> registered temporary function nor a permanent function registered in the 
> database 'test_db'.; line 1 pos 7
> {code}
> The reason is when 
> org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource,
>  the uri.toURL throw exception with " failed unknown protocol: hdfs"
> {code}
>   def addJar(path: String): Unit = {
> sparkSession.sparkContext.addJar(path)
> val uri = new Path(path).toUri
> val jarURL = if (uri.getScheme == null) {
>   // `path` is a local file path without a URL scheme
>

[jira] [Assigned] (SPARK-18911) Decouple Statistics and CatalogTable

2016-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18911:


Assignee: Apache Spark

> Decouple Statistics and CatalogTable
> 
>
> Key: SPARK-18911
> URL: https://issues.apache.org/jira/browse/SPARK-18911
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>
> Statistics in LogicalPlan should use attributes to refer to columns rather 
> than column names, because two columns from two relations can have the same 
> column name. But CatalogTable doesn't have the concepts of attribute or 
> broadcast hint in Statistics. Therefore, putting Statistics in CatalogTable 
> is confusing. We need to define a different statistic structure in 
> CatalogTable, which is only responsible for interacting with metastore, and 
> is converted to statistics in LogicalPlan when it is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18910) Can't use UDF that jar file in hdfs

2016-12-16 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated SPARK-18910:
--
Description: 
When I create a UDF that jar file in hdfs, I can't use the UDF. 

spark-sql> create function trans_array as 'com.test.udf.TransArray'  using jar 
'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';

spark-sql> describe function trans_array;
Function: test_db.trans_array
Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
Usage: N/A.
Time taken: 0.127 seconds, Fetched 3 row(s)

spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) from 
test_spark limit 10;
Error in query: Undefined function: 'trans_array'. This function is neither a 
registered temporary function nor a permanent function registered in the 
database 'test_db'.; line 1 pos 7


The reason is when 
org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, 
the uri.toURL throw exception with " failed unknown protocol: hdfs"

  def addJar(path: String): Unit = {
sparkSession.sparkContext.addJar(path)

val uri = new Path(path).toUri
val jarURL = if (uri.getScheme == null) {
  // `path` is a local file path without a URL scheme
  new File(path).toURI.toURL
} else {
  // `path` is a URL with a scheme
  {color:red}uri.toURL{color}
}
jarClassLoader.addURL(jarURL)
Thread.currentThread().setContextClassLoader(jarClassLoader)
  }


I think we should setURLStreamHandlerFactory method on URL with an instance of 
FsUrlStreamHandlerFactory, just like:

static {
// This method can be called at most once in a given JVM.
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}



  was:
When I create a UDF that jar file in hdfs, I can't use the UDF. 

spark-sql> create function trans_array as 'com.test.udf.TransArray'  using jar 
'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';

spark-sql> describe function trans_array;
Function: test_db.trans_array
Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
Usage: N/A.
Time taken: 0.127 seconds, Fetched 3 row(s)

spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) from 
test_spark limit 10;
Error in query: Undefined function: 'trans_array'. This function is neither a 
registered temporary function nor a permanent function registered in the 
database 'test_db'.; line 1 pos 7


The reason is when 
org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, 
the uri.toURL throw exception with " failed unknown protocol: hdfs"

  def addJar(path: String): Unit = {
sparkSession.sparkContext.addJar(path)

val uri = new Path(path).toUri
val jarURL = if (uri.getScheme == null) {
  // `path` is a local file path without a URL scheme
  new File(path).toURI.toURL
} else {
  // `path` is a URL with a scheme
  uri.toURL
}
jarClassLoader.addURL(jarURL)
Thread.currentThread().setContextClassLoader(jarClassLoader)
  }


I think we should setURLStreamHandlerFactory method on URL with an instance of 
FsUrlStreamHandlerFactory, just like:

static {
// This method can be called at most once in a given JVM.
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}




> Can't use UDF that jar file in hdfs
> ---
>
> Key: SPARK-18910
> URL: https://issues.apache.org/jira/browse/SPARK-18910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Hong Shen
>
> When I create a UDF that jar file in hdfs, I can't use the UDF. 
> 
> spark-sql> create function trans_array as 'com.test.udf.TransArray'  using 
> jar 
> 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';
> spark-sql> describe function trans_array;
> Function: test_db.trans_array
> Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
> Usage: N/A.
> Time taken: 0.127 seconds, Fetched 3 row(s)
> spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) 
> from test_spark limit 10;
> Error in query: Undefined function: 'trans_array'. This function is neither a 
> registered temporary function nor a permanent function registered in the 
> database 'test_db'.; line 1 pos 7
> 
> The reason is when 
> org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource,
>  the uri.toURL throw exception with " failed unknown protocol: hdfs"
> 
>   def addJar(path: String): Unit = {
> sparkSession.sparkContext.addJar(path)
> val uri = new Path(path).toUri
> val jarURL = if (uri.getScheme == null) {
>   // `path` is a local file path without a URL scheme
>   new File(path).toURI.toURL
> } else {
>   // `path` is a URL 

[jira] [Commented] (SPARK-18910) Can't use UDF that jar file in hdfs

2016-12-16 Thread Hong Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756385#comment-15756385
 ] 

Hong Shen commented on SPARK-18910:
---

Should I add a pull request to resolve this problem?

> Can't use UDF that jar file in hdfs
> ---
>
> Key: SPARK-18910
> URL: https://issues.apache.org/jira/browse/SPARK-18910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Hong Shen
>
> When I create a UDF that jar file in hdfs, I can't use the UDF. 
> 
> spark-sql> create function trans_array as 'com.test.udf.TransArray'  using 
> jar 
> 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';
> spark-sql> describe function trans_array;
> Function: test_db.trans_array
> Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
> Usage: N/A.
> Time taken: 0.127 seconds, Fetched 3 row(s)
> spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) 
> from test_spark limit 10;
> Error in query: Undefined function: 'trans_array'. This function is neither a 
> registered temporary function nor a permanent function registered in the 
> database 'test_db'.; line 1 pos 7
> 
> The reason is when 
> org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource,
>  the uri.toURL throw exception with " failed unknown protocol: hdfs"
> 
>   def addJar(path: String): Unit = {
> sparkSession.sparkContext.addJar(path)
> val uri = new Path(path).toUri
> val jarURL = if (uri.getScheme == null) {
>   // `path` is a local file path without a URL scheme
>   new File(path).toURI.toURL
> } else {
>   // `path` is a URL with a scheme
>   uri.toURL
> }
> jarClassLoader.addURL(jarURL)
> Thread.currentThread().setContextClassLoader(jarClassLoader)
>   }
> 
> I think we should setURLStreamHandlerFactory method on URL with an instance 
> of FsUrlStreamHandlerFactory, just like:
> 
> static {
>   // This method can be called at most once in a given JVM.
>   URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
> }
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18895) Fix resource-closing-related and path-related test failures in identified ones on Windows

2016-12-16 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-18895.
---
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.2.0

Resolved by https://github.com/apache/spark/pull/16305

> Fix resource-closing-related and path-related test failures in identified 
> ones on Windows
> -
>
> Key: SPARK-18895
> URL: https://issues.apache.org/jira/browse/SPARK-18895
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.2.0
>
>
> There are several tests failing due to resource-closing-related and 
> path-related  problems on Windows as below.
> - {{RPackageUtilsSuite}}:
> {code}
> - build an R package from a jar end to end *** FAILED *** (1 second, 625 
> milliseconds)
>   java.io.IOException: Unable to delete file: 
> C:\projects\spark\target\tmp\1481729427517-0\a\dep2\d\dep2-d.jar
>   at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279)
>   at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
>   at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
> - faulty R package shows documentation *** FAILED *** (359 milliseconds)
>   java.io.IOException: Unable to delete file: 
> C:\projects\spark\target\tmp\1481729428970-0\dep1-c.jar
>   at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279)
>   at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
>   at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
> - SparkR zipping works properly *** FAILED *** (47 milliseconds)
>   java.util.regex.PatternSyntaxException: Unknown character property name {r} 
> near index 4
> C:\projects\spark\target\tmp\1481729429282-0
> ^
>   at java.util.regex.Pattern.error(Pattern.java:1955)
>   at java.util.regex.Pattern.charPropertyNodeFor(Pattern.java:2781)
> {code}
> - {{InputOutputMetricsSuite}}:
> {code}
> - input metrics for old hadoop with coalesce *** FAILED *** (240 milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics with cache and coalesce *** FAILED *** (109 milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics for new Hadoop API with coalesce *** FAILED *** (0 
> milliseconds)
>   java.lang.IllegalArgumentException: Wrong FS: 
> file://C:\projects\spark\target\tmp\spark-9366ec94-dac7-4a5c-a74b-3e7594a692ab\test\InputOutputMetricsSuite.txt,
>  expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>   at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:114)
> - input metrics when reading text file *** FAILED *** (110 milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics on records read - simple *** FAILED *** (125 milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics on records read - more stages *** FAILED *** (110 
> milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics on records - New Hadoop API *** FAILED *** (16 milliseconds)
>   java.lang.IllegalArgumentException: Wrong FS: 
> file://C:\projects\spark\target\tmp\spark-3f10a1a4-7820-4772-b821-25fd7523bf6f\test\InputOutputMetricsSuite.txt,
>  expected: file:///
>   at 

[jira] [Updated] (SPARK-18910) Can't use UDF that jar file in hdfs

2016-12-16 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated SPARK-18910:
--
Description: 
When I create a UDF that jar file in hdfs, I can't use the UDF. 

spark-sql> create function trans_array as 'com.test.udf.TransArray'  using jar 
'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';

spark-sql> describe function trans_array;
Function: test_db.trans_array
Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
Usage: N/A.
Time taken: 0.127 seconds, Fetched 3 row(s)

spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) from 
test_spark limit 10;
Error in query: Undefined function: 'trans_array'. This function is neither a 
registered temporary function nor a permanent function registered in the 
database 'test_db'.; line 1 pos 7


The reason is when 
org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, 
the uri.toURL throw exception with " failed unknown protocol: hdfs"

  def addJar(path: String): Unit = {
sparkSession.sparkContext.addJar(path)

val uri = new Path(path).toUri
val jarURL = if (uri.getScheme == null) {
  // `path` is a local file path without a URL scheme
  new File(path).toURI.toURL
} else {
  // `path` is a URL with a scheme
  uri.toURL
}
jarClassLoader.addURL(jarURL)
Thread.currentThread().setContextClassLoader(jarClassLoader)
  }


I think we should setURLStreamHandlerFactory method on URL with an instance of 
FsUrlStreamHandlerFactory, just like:

static {
// This method can be called at most once in a given JVM.
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}



  was:
When I create a UDF that jar

spark-sql> create function trans_array as 'com.test.udf.TransArray'  using jar 
'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';

spark-sql> describe function trans_array;
Function: test_db.trans_array
Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
Usage: N/A.
Time taken: 0.127 seconds, Fetched 3 row(s)

spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) from 
test_spark limit 10;
Error in query: Undefined function: 'trans_array'. This function is neither a 
registered temporary function nor a permanent function registered in the 
database 'test_db'.; line 1 pos 7




> Can't use UDF that jar file in hdfs
> ---
>
> Key: SPARK-18910
> URL: https://issues.apache.org/jira/browse/SPARK-18910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Hong Shen
>
> When I create a UDF that jar file in hdfs, I can't use the UDF. 
> 
> spark-sql> create function trans_array as 'com.test.udf.TransArray'  using 
> jar 
> 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';
> spark-sql> describe function trans_array;
> Function: test_db.trans_array
> Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
> Usage: N/A.
> Time taken: 0.127 seconds, Fetched 3 row(s)
> spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) 
> from test_spark limit 10;
> Error in query: Undefined function: 'trans_array'. This function is neither a 
> registered temporary function nor a permanent function registered in the 
> database 'test_db'.; line 1 pos 7
> 
> The reason is when 
> org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource,
>  the uri.toURL throw exception with " failed unknown protocol: hdfs"
> 
>   def addJar(path: String): Unit = {
> sparkSession.sparkContext.addJar(path)
> val uri = new Path(path).toUri
> val jarURL = if (uri.getScheme == null) {
>   // `path` is a local file path without a URL scheme
>   new File(path).toURI.toURL
> } else {
>   // `path` is a URL with a scheme
>   uri.toURL
> }
> jarClassLoader.addURL(jarURL)
> Thread.currentThread().setContextClassLoader(jarClassLoader)
>   }
> 
> I think we should setURLStreamHandlerFactory method on URL with an instance 
> of FsUrlStreamHandlerFactory, just like:
> 
> static {
>   // This method can be called at most once in a given JVM.
>   URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
> }
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18903) uiWebUrl is not accessible to SparkR

2016-12-16 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756379#comment-15756379
 ] 

Felix Cheung commented on SPARK-18903:
--

this sounds like a reasonable ask, I'll take a look.

> uiWebUrl is not accessible to SparkR
> 
>
> Key: SPARK-18903
> URL: https://issues.apache.org/jira/browse/SPARK-18903
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, Web UI
>Affects Versions: 2.0.2
>Reporter: Diogo Munaro Vieira
>Priority: Minor
>
> Like https://issues.apache.org/jira/browse/SPARK-17437 uiWebUrl is not 
> accessible to SparkR context



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18911) Decouple Statistics and CatalogTable

2016-12-16 Thread Zhenhua Wang (JIRA)
Zhenhua Wang created SPARK-18911:


 Summary: Decouple Statistics and CatalogTable
 Key: SPARK-18911
 URL: https://issues.apache.org/jira/browse/SPARK-18911
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Zhenhua Wang


Statistics in LogicalPlan should use attributes to refer to columns rather than 
column names, because two columns from two relations can have the same column 
name. But CatalogTable doesn't have the concepts of attribute or broadcast hint 
in Statistics. Therefore, putting Statistics in CatalogTable is confusing. We 
need to define a different statistic structure in CatalogTable, which is only 
responsible for interacting with metastore, and is converted to statistics in 
LogicalPlan when it is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18903) uiWebUrl is not accessible to SparkR

2016-12-16 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18903:
-
Component/s: (was: Java API)

> uiWebUrl is not accessible to SparkR
> 
>
> Key: SPARK-18903
> URL: https://issues.apache.org/jira/browse/SPARK-18903
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, Web UI
>Affects Versions: 2.0.2
>Reporter: Diogo Munaro Vieira
>Priority: Minor
>
> Like https://issues.apache.org/jira/browse/SPARK-17437 uiWebUrl is not 
> accessible to SparkR context



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18902) Include Apache License in R source Package

2016-12-16 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-18902.
--
Resolution: Not A Problem

> Include Apache License in R source Package
> --
>
> Key: SPARK-18902
> URL: https://issues.apache.org/jira/browse/SPARK-18902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Shivaram Venkataraman
>
> Per [~srowen]'s email on the dev mailing list
> {quote}
> I don't see an Apache license / notice for the Pyspark or SparkR artifacts. 
> It would be good practice to include this in a convenience binary. I'm not 
> sure if it's strictly mandatory, but something to adjust in any event. I 
> think that's all there is to do for SparkR
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18902) Include Apache License in R source Package

2016-12-16 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756369#comment-15756369
 ] 

Felix Cheung commented on SPARK-18902:
--

We have the license in DESCRIPTION file as required for R package:
https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Licensing

Closing this issue - thanks for validating!

> Include Apache License in R source Package
> --
>
> Key: SPARK-18902
> URL: https://issues.apache.org/jira/browse/SPARK-18902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Shivaram Venkataraman
>
> Per [~srowen]'s email on the dev mailing list
> {quote}
> I don't see an Apache license / notice for the Pyspark or SparkR artifacts. 
> It would be good practice to include this in a convenience binary. I'm not 
> sure if it's strictly mandatory, but something to adjust in any event. I 
> think that's all there is to do for SparkR
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18902) Include Apache License in R source Package

2016-12-16 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung closed SPARK-18902.

Assignee: Felix Cheung

> Include Apache License in R source Package
> --
>
> Key: SPARK-18902
> URL: https://issues.apache.org/jira/browse/SPARK-18902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Shivaram Venkataraman
>Assignee: Felix Cheung
>
> Per [~srowen]'s email on the dev mailing list
> {quote}
> I don't see an Apache license / notice for the Pyspark or SparkR artifacts. 
> It would be good practice to include this in a convenience binary. I'm not 
> sure if it's strictly mandatory, but something to adjust in any event. I 
> think that's all there is to do for SparkR
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18910) Can't use UDF that jar file in hdfs

2016-12-16 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated SPARK-18910:
--
Summary: Can't use UDF that jar file in hdfs  (was: Can't use UDF that 
source file in hdfs)

> Can't use UDF that jar file in hdfs
> ---
>
> Key: SPARK-18910
> URL: https://issues.apache.org/jira/browse/SPARK-18910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Hong Shen
>
> When I create a UDF that jar
> 
> spark-sql> create function trans_array as 'com.test.udf.TransArray'  using 
> jar 
> 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';
> spark-sql> describe function trans_array;
> Function: test_db.trans_array
> Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
> Usage: N/A.
> Time taken: 0.127 seconds, Fetched 3 row(s)
> spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) 
> from test_spark limit 10;
> Error in query: Undefined function: 'trans_array'. This function is neither a 
> registered temporary function nor a permanent function registered in the 
> database 'test_db'.; line 1 pos 7
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18910) Can't use UDF that source file in hdfs

2016-12-16 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated SPARK-18910:
--
Description: 
When I create a UDF that jar

spark-sql> create function trans_array as 'com.test.udf.TransArray'  using jar 
'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';

spark-sql> describe function trans_array;
Function: test_db.trans_array
Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
Usage: N/A.
Time taken: 0.127 seconds, Fetched 3 row(s)

spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) from 
test_spark limit 10;
Error in query: Undefined function: 'trans_array'. This function is neither a 
registered temporary function nor a permanent function registered in the 
database 'test_db'.; line 1 pos 7



> Can't use UDF that source file in hdfs
> --
>
> Key: SPARK-18910
> URL: https://issues.apache.org/jira/browse/SPARK-18910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Hong Shen
>
> When I create a UDF that jar
> 
> spark-sql> create function trans_array as 'com.test.udf.TransArray'  using 
> jar 
> 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';
> spark-sql> describe function trans_array;
> Function: test_db.trans_array
> Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
> Usage: N/A.
> Time taken: 0.127 seconds, Fetched 3 row(s)
> spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) 
> from test_spark limit 10;
> Error in query: Undefined function: 'trans_array'. This function is neither a 
> registered temporary function nor a permanent function registered in the 
> database 'test_db'.; line 1 pos 7
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18910) Can't use UDF that source file in hdfs

2016-12-16 Thread Hong Shen (JIRA)
Hong Shen created SPARK-18910:
-

 Summary: Can't use UDF that source file in hdfs
 Key: SPARK-18910
 URL: https://issues.apache.org/jira/browse/SPARK-18910
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
Reporter: Hong Shen






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18909) The error message in `ExpressionEncoder.toRow` and `fromRow` is too verbose

2016-12-16 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-18909:
---

 Summary: The error message in `ExpressionEncoder.toRow` and 
`fromRow` is too verbose
 Key: SPARK-18909
 URL: https://issues.apache.org/jira/browse/SPARK-18909
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Priority: Minor


In `ExpressionEncoder.toRow` and `fromRow`, we will catch the exception and put 
the treeString of serializer/deserializer expressions in the error message. 
However, encoder can be very complex and the serializer/deserializer 
expressions can be very large trees and blow up the log files(e.g. generate 
over 500mb logs for this single error message.)

We should simplify it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18906) CSV parser should return null for empty (or with "") numeric columns.

2016-12-16 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756184#comment-15756184
 ] 

Hyukjin Kwon commented on SPARK-18906:
--

Yes, we should be able to set multiple values in {{nullValue}} like R. Just 
FYI, there is a JIRA here, SPARK-17878
I have a patch for SPARK-17878 and SPARK-17967 in my laptop but I faced a 
tricky problem (related with null robustness). I will open a PR soon discuss 
further.

> CSV parser should return null for empty (or with "") numeric columns.
> -
>
> Key: SPARK-18906
> URL: https://issues.apache.org/jira/browse/SPARK-18906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Kuba Tyszko
>Priority: Minor
>
> Spark allows user to set a nullValue that will indicate certain value's 
> translation to a null type , for example string "NA" could be the one.
> Data sources that use such nullValue but also have other columns that may 
> contain empty values may not be parsed correctly.
> The change resolves that by assuming that:
> when column is infered as numeric
> its field will be set to null when parsing fails, for example upon seeing 
> empty value or an empty string.
> Example:
> ---
> |char|int1|int2|
> ---
> |a|1|2|
> ---
> |a|  |0|
> ---
> |NA|""|""|
> 
> This example illustrates that column "char" may contain an empty value 
> indicated as "NA", column int1 has a "true null" value but then both int1 and 
> int2 columns have an empty string set as their values.
> In such situation parsing will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout

2016-12-16 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756179#comment-15756179
 ] 

Mridul Muralidharan commented on SPARK-18886:
-


- Delay 'using up' all resources - another task/taskset which has better 
locality preference might be available for that executor (also, see speculative 
exec impact)
- A delay would cause a better locality preference to become available for the 
task. Suboptimal schedule has a cascading effect on rest of the executors, 
application and cluster.
- Note that not all executors are available at the same time in resourceOffer : 
you have bulk reschedule periodically,  sporadic reschedules when tasks finish 
and periodic bulk speculative schedule updates.


> Delay scheduling should not delay some executors indefinitely if one task is 
> scheduled before delay timeout
> ---
>
> Key: SPARK-18886
> URL: https://issues.apache.org/jira/browse/SPARK-18886
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>
> Delay scheduling can introduce an unbounded delay and underutilization of 
> cluster resources under the following circumstances:
> 1. Tasks have locality preferences for a subset of available resources
> 2. Tasks finish in less time than the delay scheduling.
> Instead of having *one* delay to wait for resources with better locality, 
> spark waits indefinitely.
> As an example, consider a cluster with 100 executors, and a taskset with 500 
> tasks.  Say all tasks have a preference for one executor, which is by itself 
> on one host.  Given the default locality wait of 3s per level, we end up with 
> a 6s delay till we schedule on other hosts (process wait + host wait).
> If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks 
> get scheduled on _only one_ executor.  This means you're only using a 1% of 
> your cluster, and you get a ~100x slowdown.  You'd actually be better off if 
> tasks took 7 seconds.
> *WORKAROUNDS*: 
> (1) You can change the locality wait times so that it is shorter than the 
> task execution time.  You need to take into account the sum of all wait times 
> to use all the resources on your cluster.  For example, if you have resources 
> on different racks, this will include the sum of 
> "spark.locality.wait.process" + "spark.locality.wait.node" + 
> "spark.locality.wait.rack".  Those each default to "3s".  The simplest way to 
> be to set "spark.locality.wait.process" to your desired wait interval, and 
> set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0".  
> For example, if your tasks take ~3 seconds on average, you might set 
> "spark.locality.wait.process" to "1s".
> Note that this workaround isn't perfect --with less delay scheduling, you may 
> not get as good resource locality.  After this issue is fixed, you'd most 
> likely want to undo these configuration changes.
> (2) The worst case here will only happen if your tasks have extreme skew in 
> their locality preferences.  Users may be able to modify their job to 
> controlling the distribution of the original input data.
> (2a) A shuffle may end up with very skewed locality preferences, especially 
> if you do a repartition starting from a small number of partitions.  (Shuffle 
> locality preference is assigned if any node has more than 20% of the shuffle 
> input data -- by chance, you may have one node just above that threshold, and 
> all other nodes just below it.)  In this case, you can turn off locality 
> preference for shuffle data by setting 
> {{spark.shuffle.reduceLocality.enabled=false}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed

2016-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18699:


Assignee: Apache Spark

> Spark CSV parsing types other than String throws exception when malformed
> -
>
> Key: SPARK-18699
> URL: https://issues.apache.org/jira/browse/SPARK-18699
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Jakub Nowacki
>Assignee: Apache Spark
>
> If CSV is read and the schema contains any other type than String, exception 
> is thrown when the string value in CSV is malformed; e.g. if the timestamp 
> does not match the defined one, an exception is thrown:
> {code}
> Caused by: java.lang.IllegalArgumentException
>   at java.sql.Date.valueOf(Date.java:143)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at scala.util.Try.getOrElse(Try.scala:79)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
>   ... 8 more
> {code}
> It behaves similarly with Integer and Long types, from what I've seen.
> To my understanding modes PERMISSIVE and DROPMALFORMED should just null the 
> value or drop the line, but instead they kill the job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed

2016-12-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756175#comment-15756175
 ] 

Apache Spark commented on SPARK-18699:
--

User 'kubatyszko' has created a pull request for this issue:
https://github.com/apache/spark/pull/16319

> Spark CSV parsing types other than String throws exception when malformed
> -
>
> Key: SPARK-18699
> URL: https://issues.apache.org/jira/browse/SPARK-18699
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Jakub Nowacki
>
> If CSV is read and the schema contains any other type than String, exception 
> is thrown when the string value in CSV is malformed; e.g. if the timestamp 
> does not match the defined one, an exception is thrown:
> {code}
> Caused by: java.lang.IllegalArgumentException
>   at java.sql.Date.valueOf(Date.java:143)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at scala.util.Try.getOrElse(Try.scala:79)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
>   ... 8 more
> {code}
> It behaves similarly with Integer and Long types, from what I've seen.
> To my understanding modes PERMISSIVE and DROPMALFORMED should just null the 
> value or drop the line, but instead they kill the job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed

2016-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18699:


Assignee: (was: Apache Spark)

> Spark CSV parsing types other than String throws exception when malformed
> -
>
> Key: SPARK-18699
> URL: https://issues.apache.org/jira/browse/SPARK-18699
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Jakub Nowacki
>
> If CSV is read and the schema contains any other type than String, exception 
> is thrown when the string value in CSV is malformed; e.g. if the timestamp 
> does not match the defined one, an exception is thrown:
> {code}
> Caused by: java.lang.IllegalArgumentException
>   at java.sql.Date.valueOf(Date.java:143)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at scala.util.Try.getOrElse(Try.scala:79)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
>   ... 8 more
> {code}
> It behaves similarly with Integer and Long types, from what I've seen.
> To my understanding modes PERMISSIVE and DROPMALFORMED should just null the 
> value or drop the line, but instead they kill the job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed

2016-12-16 Thread Kuba Tyszko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756170#comment-15756170
 ] 

Kuba Tyszko commented on SPARK-18699:
-

Fixed in PR https://github.com/apache/spark/pull/16319

Enjoy

> Spark CSV parsing types other than String throws exception when malformed
> -
>
> Key: SPARK-18699
> URL: https://issues.apache.org/jira/browse/SPARK-18699
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Jakub Nowacki
>
> If CSV is read and the schema contains any other type than String, exception 
> is thrown when the string value in CSV is malformed; e.g. if the timestamp 
> does not match the defined one, an exception is thrown:
> {code}
> Caused by: java.lang.IllegalArgumentException
>   at java.sql.Date.valueOf(Date.java:143)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at scala.util.Try.getOrElse(Try.scala:79)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
>   ... 8 more
> {code}
> It behaves similarly with Integer and Long types, from what I've seen.
> To my understanding modes PERMISSIVE and DROPMALFORMED should just null the 
> value or drop the line, but instead they kill the job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed

2016-12-16 Thread Kuba Tyszko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756170#comment-15756170
 ] 

Kuba Tyszko edited comment on SPARK-18699 at 12/17/16 3:05 AM:
---

Fixed in PR https://github.com/apache/spark/pull/16319

(not merged into the tree as of now).

Enjoy


was (Author: kubatyszko):
Fixed in PR https://github.com/apache/spark/pull/16319

Enjoy

> Spark CSV parsing types other than String throws exception when malformed
> -
>
> Key: SPARK-18699
> URL: https://issues.apache.org/jira/browse/SPARK-18699
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Jakub Nowacki
>
> If CSV is read and the schema contains any other type than String, exception 
> is thrown when the string value in CSV is malformed; e.g. if the timestamp 
> does not match the defined one, an exception is thrown:
> {code}
> Caused by: java.lang.IllegalArgumentException
>   at java.sql.Date.valueOf(Date.java:143)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at scala.util.Try.getOrElse(Try.scala:79)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
>   ... 8 more
> {code}
> It behaves similarly with Integer and Long types, from what I've seen.
> To my understanding modes PERMISSIVE and DROPMALFORMED should just null the 
> value or drop the line, but instead they kill the job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18906) CSV parser should return null for empty (or with "") numeric columns.

2016-12-16 Thread Kuba Tyszko (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuba Tyszko closed SPARK-18906.
---
Resolution: Duplicate

> CSV parser should return null for empty (or with "") numeric columns.
> -
>
> Key: SPARK-18906
> URL: https://issues.apache.org/jira/browse/SPARK-18906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Kuba Tyszko
>Priority: Minor
>
> Spark allows user to set a nullValue that will indicate certain value's 
> translation to a null type , for example string "NA" could be the one.
> Data sources that use such nullValue but also have other columns that may 
> contain empty values may not be parsed correctly.
> The change resolves that by assuming that:
> when column is infered as numeric
> its field will be set to null when parsing fails, for example upon seeing 
> empty value or an empty string.
> Example:
> ---
> |char|int1|int2|
> ---
> |a|1|2|
> ---
> |a|  |0|
> ---
> |NA|""|""|
> 
> This example illustrates that column "char" may contain an empty value 
> indicated as "NA", column int1 has a "true null" value but then both int1 and 
> int2 columns have an empty string set as their values.
> In such situation parsing will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18827) Cann't read broadcast if broadcast blocks are stored on-disk

2016-12-16 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-18827:

Attachment: NoSuchElementException4722.gif

> Cann't read broadcast if broadcast blocks are stored on-disk
> 
>
> Key: SPARK-18827
> URL: https://issues.apache.org/jira/browse/SPARK-18827
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.0.2, 2.1.0
>Reporter: Yuming Wang
> Attachments: NoSuchElementException4722.gif
>
>
> How to reproduce it:
> {code:java}
>   test("Cache broadcast to disk") {
> val conf = new SparkConf()
>   .setAppName("Cache broadcast to disk")
>   .setMaster("local")
>   .set("spark.memory.useLegacyMode", "true")
>   .set("spark.storage.memoryFraction", "0.0")
> sc = new SparkContext(conf)
> val list = List[Int](1, 2, 3, 4)
> val broadcast = sc.broadcast(list)
> assert(broadcast.value.sum === 10)
>   }
> {code}
> {{NoSuchElementException}} will throw since SPARK-17503 if a broadcast cannot 
> cache in memory. The reason is that that change cannot cover 
> {{!unrolled.hasNext}} in {{next()}} function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18908) It's hard for the user to see the failure if StreamExecution fails to create the logical plan

2016-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18908:


Assignee: Apache Spark  (was: Shixiong Zhu)

> It's hard for the user to see the failure if StreamExecution fails to create 
> the logical plan
> -
>
> Key: SPARK-18908
> URL: https://issues.apache.org/jira/browse/SPARK-18908
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Blocker
>
> If the logical plan fails to create, e.g., some Source options are invalid, 
> the user cannot use the code to detect the failure. The only place receiving 
> this error is Thread's UncaughtExceptionHandler.
> This bug is because logicalPlan is lazy, and when we try to create 
> StreamingQueryException to wrap the exception thrown by creating logicalPlan, 
> it calls logicalPlan agains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18908) It's hard for the user to see the failure if StreamExecution fails to create the logical plan

2016-12-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755992#comment-15755992
 ] 

Apache Spark commented on SPARK-18908:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16322

> It's hard for the user to see the failure if StreamExecution fails to create 
> the logical plan
> -
>
> Key: SPARK-18908
> URL: https://issues.apache.org/jira/browse/SPARK-18908
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>
> If the logical plan fails to create, e.g., some Source options are invalid, 
> the user cannot use the code to detect the failure. The only place receiving 
> this error is Thread's UncaughtExceptionHandler.
> This bug is because logicalPlan is lazy, and when we try to create 
> StreamingQueryException to wrap the exception thrown by creating logicalPlan, 
> it calls logicalPlan agains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18908) It's hard for the user to see the failure if StreamExecution fails to create the logical plan

2016-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18908:


Assignee: Shixiong Zhu  (was: Apache Spark)

> It's hard for the user to see the failure if StreamExecution fails to create 
> the logical plan
> -
>
> Key: SPARK-18908
> URL: https://issues.apache.org/jira/browse/SPARK-18908
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>
> If the logical plan fails to create, e.g., some Source options are invalid, 
> the user cannot use the code to detect the failure. The only place receiving 
> this error is Thread's UncaughtExceptionHandler.
> This bug is because logicalPlan is lazy, and when we try to create 
> StreamingQueryException to wrap the exception thrown by creating logicalPlan, 
> it calls logicalPlan agains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18908) It's hard for the user to see the failure if StreamExecution fails to create the logical plan

2016-12-16 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-18908:


Assignee: Shixiong Zhu

> It's hard for the user to see the failure if StreamExecution fails to create 
> the logical plan
> -
>
> Key: SPARK-18908
> URL: https://issues.apache.org/jira/browse/SPARK-18908
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>
> If the logical plan fails to create, e.g., some Source options are invalid, 
> the user cannot use the code to detect the failure. The only place receiving 
> this error is Thread's UncaughtExceptionHandler.
> This bug is because logicalPlan is lazy, and when we try to create 
> StreamingQueryException to wrap the exception thrown by creating logicalPlan, 
> it calls logicalPlan agains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18908) It's hard for the user to see the failure if StreamExecution fails to create the logical plan

2016-12-16 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18908:
-
Description: 
If the logical plan fails to create, e.g., some Source options are invalid, the 
user cannot use the code to detect the failure. The only place receiving this 
error is Thread's UncaughtExceptionHandler.

This bug is because logicalPlan is lazy, and when we try to create 
StreamingQueryException to wrap the exception thrown by creating logicalPlan, 
it calls logicalPlan agains.



  was:
If the logical plan fails to create, e.g., some options are invalid, the user 
cannot use the code to detect the failure. The only place receiving this error 
is Thread's UncaughtExceptionHandler.

This bug is because logicalPlan is lazy, and when we try to create 
StreamingQueryException to wrap the exception thrown by creating logicalPlan, 
it calls logicalPlan agains.




> It's hard for the user to see the failure if StreamExecution fails to create 
> the logical plan
> -
>
> Key: SPARK-18908
> URL: https://issues.apache.org/jira/browse/SPARK-18908
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Priority: Blocker
>
> If the logical plan fails to create, e.g., some Source options are invalid, 
> the user cannot use the code to detect the failure. The only place receiving 
> this error is Thread's UncaughtExceptionHandler.
> This bug is because logicalPlan is lazy, and when we try to create 
> StreamingQueryException to wrap the exception thrown by creating logicalPlan, 
> it calls logicalPlan agains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18908) It's hard for the user to see the failure if StreamExecution fails to create the logical plan

2016-12-16 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18908:
-
Description: 
If the logical plan fails to create, e.g., some options are invalid, the user 
cannot use the code to detect the failure. The only place receiving this error 
is Thread's UncaughtExceptionHandler.

This bug is because logicalPlan is lazy, and when we try to create 
StreamingQueryException to wrap the exception thrown by creating logicalPlan, 
it calls logicalPlan agains.



> It's hard for the user to see the failure if StreamExecution fails to create 
> the logical plan
> -
>
> Key: SPARK-18908
> URL: https://issues.apache.org/jira/browse/SPARK-18908
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Priority: Blocker
>
> If the logical plan fails to create, e.g., some options are invalid, the user 
> cannot use the code to detect the failure. The only place receiving this 
> error is Thread's UncaughtExceptionHandler.
> This bug is because logicalPlan is lazy, and when we try to create 
> StreamingQueryException to wrap the exception thrown by creating logicalPlan, 
> it calls logicalPlan agains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18908) It's hard for the user to see the failure if StreamExecution fails to create the logical plan

2016-12-16 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18908:
-
Priority: Blocker  (was: Major)

> It's hard for the user to see the failure if StreamExecution fails to create 
> the logical plan
> -
>
> Key: SPARK-18908
> URL: https://issues.apache.org/jira/browse/SPARK-18908
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18908) It's hard for the user to see the failure if StreamExecution fails to create the logical plan

2016-12-16 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-18908:


 Summary: It's hard for the user to see the failure if 
StreamExecution fails to create the logical plan
 Key: SPARK-18908
 URL: https://issues.apache.org/jira/browse/SPARK-18908
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Reporter: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18908) It's hard for the user to see the failure if StreamExecution fails to create the logical plan

2016-12-16 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18908:
-
Affects Version/s: 2.1.0

> It's hard for the user to see the failure if StreamExecution fails to create 
> the logical plan
> -
>
> Key: SPARK-18908
> URL: https://issues.apache.org/jira/browse/SPARK-18908
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18907) Fix flaky test: o.a.s.sql.streaming.FileStreamSourceSuite max files per trigger - incorrect values

2016-12-16 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18907:
-
Priority: Minor  (was: Major)

> Fix flaky test: o.a.s.sql.streaming.FileStreamSourceSuite max files per 
> trigger - incorrect values
> --
>
> Key: SPARK-18907
> URL: https://issues.apache.org/jira/browse/SPARK-18907
> Project: Spark
>  Issue Type: Test
>Reporter: Shixiong Zhu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18907) Fix flaky test: o.a.s.sql.streaming.FileStreamSourceSuite max files per trigger - incorrect values

2016-12-16 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-18907:


 Summary: Fix flaky test: o.a.s.sql.streaming.FileStreamSourceSuite 
max files per trigger - incorrect values
 Key: SPARK-18907
 URL: https://issues.apache.org/jira/browse/SPARK-18907
 Project: Spark
  Issue Type: Test
Reporter: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18031) Flaky test: org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic functionality

2016-12-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755839#comment-15755839
 ] 

Apache Spark commented on SPARK-18031:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16321

> Flaky test: 
> org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic 
> functionality
> ---
>
> Key: SPARK-18031
> URL: https://issues.apache.org/jira/browse/SPARK-18031
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Shixiong Zhu
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite_name=basic+functionality



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18031) Flaky test: org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic functionality

2016-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18031:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Flaky test: 
> org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic 
> functionality
> ---
>
> Key: SPARK-18031
> URL: https://issues.apache.org/jira/browse/SPARK-18031
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite_name=basic+functionality



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18031) Flaky test: org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic functionality

2016-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18031:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Flaky test: 
> org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic 
> functionality
> ---
>
> Key: SPARK-18031
> URL: https://issues.apache.org/jira/browse/SPARK-18031
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Shixiong Zhu
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite_name=basic+functionality



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18031) Flaky test: org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic functionality

2016-12-16 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-18031:


Assignee: Shixiong Zhu

> Flaky test: 
> org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic 
> functionality
> ---
>
> Key: SPARK-18031
> URL: https://issues.apache.org/jira/browse/SPARK-18031
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Shixiong Zhu
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite_name=basic+functionality



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18884) Support Array[_] in ScalaUDF

2016-12-16 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755778#comment-15755778
 ] 

Dongjoon Hyun commented on SPARK-18884:
---

+1 for the idea!

> Support Array[_] in ScalaUDF
> 
>
> Key: SPARK-18884
> URL: https://issues.apache.org/jira/browse/SPARK-18884
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> Throw an exception if we use `Array[_]` in `ScalaUDF`;
> {code}
> scala> import org.apache.spark.sql.execution.debug._
> scala> Seq((0, 1)).toDF("a", "b").select(array($"a", 
> $"b").as("ar")).write.mode("overwrite").parquet("/Users/maropu/Desktop/data/")
> scala> val df = spark.read.load("/Users/maropu/Desktop/data/")
> scala> val df = Seq((0, 1)).toDF("a", "b").select(array($"a", $"b").as("ar"))
> scala> val testArrayUdf = udf { (ar: Array[Int]) => ar.sum }
> scala> df.select(testArrayUdf($"ar")).show
> Caused by: java.lang.ClassCastException: 
> scala.collection.mutable.WrappedArray$ofRef cannot be cast to [I
>   at $anonfun$1.apply(:23)
>   at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:89)
>   at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:88)
>   at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1069)
>   ... 99 more
> {code}
> On the other hand, the query below is passed;
> {code}
> scala> val testSeqUdf = udf { (ar: Seq[Int]) => ar.sum }
> scala> df.select(testSeqUdf($"ar")).show
> +---+
> |UDF(ar)|
> +---+
> |  1|
> +---+
> {code}
> I'm not sure this behivour is an expected one. The curernt implementation 
> querys argument types (`DataType`) by reflection 
> (`ScalaReflection.schemaFor`) in `sql.functions.udf`, and then creates type 
> converters (`CatalystTypeConverters`) from the types. `Seq[_]` and `Array[_]` 
> are represented as `ArrayType` in `DataType` and both types are handled by 
> using `ArrayConverter. However, since we cannot tell a difference between 
> both types in `DataType`, ISTM it's difficut to support the two array types 
> based on this current design. One idea (of curse, it's not the best) I have 
> is to create type converters directly from `TypeTag` in `sql.functions.udf`. 
> This is a prototype 
> (https://github.com/apache/spark/compare/master...maropu:ArrayTypeUdf#diff-89643554d9757dd3e91abff1cc6096c7R740)
>  to support both array types in `ScalaUDF`. I'm not sure this is acceptable 
> and welcome any suggestion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18255) SQLContext.getOrCreate always returns a SQLContext even if a user originally created a HiveContext

2016-12-16 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755762#comment-15755762
 ] 

Dongjoon Hyun commented on SPARK-18255:
---

Hi, [~yhuai].
I have a question about HiveContext. HiveContext is deprecated in 2.0.0, and 
Apache Spark is now 2.1.0.
I guess HiveContext will be removed in 2017. Then, the code related this issue 
will be removed together at that them.
Am I correct?

> SQLContext.getOrCreate always returns a SQLContext even if a user originally 
> created a HiveContext
> --
>
> Key: SPARK-18255
> URL: https://issues.apache.org/jira/browse/SPARK-18255
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> It is possible that a user creates HiveContext at beginning. However, 
> SQLContext.getOrCreate will always returns a SQLContext instead of returning 
> a HiveContext. This behavior change may break user code if 
> {{SQLContext.getOrCreate().asInstanceOf[HiveContext]}} is used. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18904) Merge two FileStreamSourceSuite files

2016-12-16 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-18904.
--
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> Merge two FileStreamSourceSuite files
> -
>
> Key: SPARK-18904
> URL: https://issues.apache.org/jira/browse/SPARK-18904
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming, Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
>
> There are two FileStreamSourceSuite files and it's confusing. We should just 
> merge them into one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2016-12-16 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755742#comment-15755742
 ] 

Nicholas Chammas commented on SPARK-18492:
--

I'm hitting this problem as well when I try to apply a bunch of nested SQL 
functions to the fields in a struct column. That combination quickly blows up 
the size of the generated code, since it repeats the nested functions for each 
struct field. The strange thing is that despite spitting out an error and all 
this generated Java code, Spark continues along and the program completes 
successfully.

Here's a minimal repro that's very similar to what I'm actually doing in my 
application:

{code}
from collections import namedtuple

import pyspark
from pyspark.sql import Column
from pyspark.sql.functions import (
struct,
regexp_replace,
lower,
trim,
col,
coalesce,
lit,
)


Person = namedtuple(
'Person', [
'first_name',
'last_name',
'address_1',
'address_2',
'city',
'state',
])


def normalize_udf(column: Column) -> Column:
normalized_column = column
normalized_column = coalesce(normalized_column, lit(''))
normalized_column = (
regexp_replace(
normalized_column,
pattern=r'[^\p{IsLatin}\d\s]+',
replacement=' ',
)
)
normalized_column = (
regexp_replace(
normalized_column,
pattern=r'[\s]+',
replacement=' ',
)
)
normalized_column = lower(trim(normalized_column))
return normalized_column


if __name__ == '__main__':
spark = pyspark.sql.SparkSession.builder.getOrCreate()
raw_df = spark.createDataFrame(
[
(1, Person('  Nick', ' Chammas  ', '22 Two Drive ', '', '', '')),
(2, Person('BOB', None, '', '', None, None)),
(3, Person('Guido ', 'van  Rossum', '   ', ' ', None, None)),
],
['id', 'person'],
)
normalized_df = (
raw_df
.select(
'id',
struct([
normalize_udf('person.' + field).alias(field)
for field in Person._fields
]).alias('person'))
# Uncomment this persist() and the codegen error goes away.
# However, one of our tests that exercises this code starts
# failing in a strange way.
# .persist()
# The normalize_udf() calls below are repeated to trigger the
# error. In a more realistic scenario, of course, you would have
# other chained function calls.
.select(
'id',
struct([
normalize_udf('person.' + field).alias(field)
for field in Person._fields
]).alias('person'))
.select(
'id',
struct([
normalize_udf('person.' + field).alias(field)
for field in Person._fields
]).alias('person'))
)
normalized_df.show(truncate=False)
{code}

I suppose the workarounds for this type of problem are:
* Play code golf and try to compress what you're doing into fewer function 
calls to stay under the 64KB limit.
** Disadvantage: It's difficult and makes the code ugly and difficult to 
maintain.
* Use {{persist()}} in strategic locations to force Spark to break up the 
generated code into smaller chunks.
** Disadvantage: It's difficult to track and unpersist these intermediate RDDs 
that get persisted.
** Disadvantage: One of my tests mysteriously fails when I implement this 
approach. (The failing test was a join that started to fail because the join 
key types didn't match. Really weird.)
* Rewrite parts of the code to use a non-Spark UDF implementation (i.e. in my 
case, pure Python code).
** Disadvantage: You lose the advantages of codegen and gain the overhead of 
running stuff in pure Python.

I'm seeing this on 2.0.2 and on master at 
{{1ac6567bdb03d7cc5c5f3473827a102280cb1030}} which is from 2 days ago.

[~marmbrus] / [~davies]: What are your thoughts on this?

> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private 

[jira] [Assigned] (SPARK-18877) Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20

2016-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18877:


Assignee: Apache Spark

> Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: 
> requirement failed: Decimal precision 28 exceeds max precision 20
> --
>
> Key: SPARK-18877
> URL: https://issues.apache.org/jira/browse/SPARK-18877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Navya Krishnappa
>Assignee: Apache Spark
>
> When reading below mentioned csv data, even though the maximum decimal 
> precision is 38, following exception is thrown 
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 
> exceeds max precision 20
> Decimal
> 2323366225312000
> 2433573971400
> 23233662253000
> 23233662253



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18877) Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20

2016-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18877:


Assignee: (was: Apache Spark)

> Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: 
> requirement failed: Decimal precision 28 exceeds max precision 20
> --
>
> Key: SPARK-18877
> URL: https://issues.apache.org/jira/browse/SPARK-18877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Navya Krishnappa
>
> When reading below mentioned csv data, even though the maximum decimal 
> precision is 38, following exception is thrown 
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 
> exceeds max precision 20
> Decimal
> 2323366225312000
> 2433573971400
> 23233662253000
> 23233662253



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18877) Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20

2016-12-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755691#comment-15755691
 ] 

Apache Spark commented on SPARK-18877:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/16320

> Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: 
> requirement failed: Decimal precision 28 exceeds max precision 20
> --
>
> Key: SPARK-18877
> URL: https://issues.apache.org/jira/browse/SPARK-18877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Navya Krishnappa
>
> When reading below mentioned csv data, even though the maximum decimal 
> precision is 38, following exception is thrown 
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 
> exceeds max precision 20
> Decimal
> 2323366225312000
> 2433573971400
> 23233662253000
> 23233662253



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18877) Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20

2016-12-16 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18877:
--
Component/s: SQL

> Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: 
> requirement failed: Decimal precision 28 exceeds max precision 20
> --
>
> Key: SPARK-18877
> URL: https://issues.apache.org/jira/browse/SPARK-18877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Navya Krishnappa
>
> When reading below mentioned csv data, even though the maximum decimal 
> precision is 38, following exception is thrown 
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 
> exceeds max precision 20
> Decimal
> 2323366225312000
> 2433573971400
> 23233662253000
> 23233662253



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18906) CSV parser should return null for empty (or with "") numeric columns.

2016-12-16 Thread Kuba Tyszko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755597#comment-15755597
 ] 

Kuba Tyszko edited comment on SPARK-18906 at 12/16/16 9:53 PM:
---

Well, in csv null can either be an empty field or in this case a dedicated 
value (NA), but some data providers use empty string to indicate an empty value 
as well.

I've looked at JIRA and there were a few requests to allow multiple nullValue 
settings - but that seems to be a challenging task.

The patch I'm proposing here enables handing of such "empty integers" in a 
predictable way.

I understand this may look unclean, but unfortunately some reputable data 
providers do that... - there is nothing we can do to stop them...
In fact, for example excel can be set to always quote columns when exporting to 
CSV, it can be limited to only text columns - but I don't think we can assume 
that users won't put numbers in a text column.

We're dealing with completely untyped data source - it's better to be robust..



was (Author: kubatyszko):
Well, in csv null can either be an empty field or in this case a dedicated 
value (NA), but some data providers use empty string to indicate an empty value 
as well.

I've looked at JIRA and there were a few requests to allow multiple nullValue 
settings - but that seems to be a challenging task.

The patch I'm proposing here enables handing of such "empty integers" in a 
predictable way.

I understand this may look unclean, but unfortunately some reputable data 
providers do that... - there is nothing we can do to stop them...

> CSV parser should return null for empty (or with "") numeric columns.
> -
>
> Key: SPARK-18906
> URL: https://issues.apache.org/jira/browse/SPARK-18906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Kuba Tyszko
>Priority: Minor
>
> Spark allows user to set a nullValue that will indicate certain value's 
> translation to a null type , for example string "NA" could be the one.
> Data sources that use such nullValue but also have other columns that may 
> contain empty values may not be parsed correctly.
> The change resolves that by assuming that:
> when column is infered as numeric
> its field will be set to null when parsing fails, for example upon seeing 
> empty value or an empty string.
> Example:
> ---
> |char|int1|int2|
> ---
> |a|1|2|
> ---
> |a|  |0|
> ---
> |NA|""|""|
> 
> This example illustrates that column "char" may contain an empty value 
> indicated as "NA", column int1 has a "true null" value but then both int1 and 
> int2 columns have an empty string set as their values.
> In such situation parsing will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18906) CSV parser should return null for empty (or with "") numeric columns.

2016-12-16 Thread Kuba Tyszko (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuba Tyszko updated SPARK-18906:

Description: 
Spark allows user to set a nullValue that will indicate certain value's 
translation to a null type , for example string "NA" could be the one.
Data sources that use such nullValue but also have other columns that may 
contain empty values may not be parsed correctly.
The change resolves that by assuming that:
when column is infered as numeric
its field will be set to null when parsing fails, for example upon seeing empty 
value or an empty string.

Example:

---
|char|int1|int2|
---
|a|1|2|
---
|a|  |0|
---
|NA|""|""|


This example illustrates that column "char" may contain an empty value 
indicated as "NA", column int1 has a "true null" value but then both int1 and 
int2 columns have an empty string set as their values.
In such situation parsing will fail.




  was:
Spark allows user to set a nullValue that will indicate certain value's 
translation to a null type , for example string "NA" could be the one.
Data sources that use such nullValue but also have other columns that may 
contain empty values may not be parsed correctly.
The change resolves that by assuming that:
when column is infered as numeric
its field will be set to null when parsing fails, for example upon seeing empty 
value or an empty string.

Example:

---
|char|int1|int2|
---
|a|1|2|
---
|a||0|
---
|NA|""|""|


This example illustrates that column "char" may contain an empty value 
indicated as "NA", column int1 has a "true null" value but then both int1 and 
int2 columns have an empty string set as their values.
In such situation parsing will fail.





> CSV parser should return null for empty (or with "") numeric columns.
> -
>
> Key: SPARK-18906
> URL: https://issues.apache.org/jira/browse/SPARK-18906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Kuba Tyszko
>Priority: Minor
>
> Spark allows user to set a nullValue that will indicate certain value's 
> translation to a null type , for example string "NA" could be the one.
> Data sources that use such nullValue but also have other columns that may 
> contain empty values may not be parsed correctly.
> The change resolves that by assuming that:
> when column is infered as numeric
> its field will be set to null when parsing fails, for example upon seeing 
> empty value or an empty string.
> Example:
> ---
> |char|int1|int2|
> ---
> |a|1|2|
> ---
> |a|  |0|
> ---
> |NA|""|""|
> 
> This example illustrates that column "char" may contain an empty value 
> indicated as "NA", column int1 has a "true null" value but then both int1 and 
> int2 columns have an empty string set as their values.
> In such situation parsing will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18901) Require in LR LogisticAggregator is redundant

2016-12-16 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-18901:
---
Component/s: ML

> Require in LR LogisticAggregator is redundant
> -
>
> Key: SPARK-18901
> URL: https://issues.apache.org/jira/browse/SPARK-18901
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> require(numFeatures == features.size, )
> require(weight >= 0.0, ...)
> in LogisticAggregator (add and merge) will never be triggered as the 
> dimension and weight has been checked in MultivariateOnlineSummarizer. 
> Given the frequent usage of function add, the redundant require should be 
> removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18906) CSV parser should return null for empty (or with "") numeric columns.

2016-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18906:


Assignee: Apache Spark

> CSV parser should return null for empty (or with "") numeric columns.
> -
>
> Key: SPARK-18906
> URL: https://issues.apache.org/jira/browse/SPARK-18906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Kuba Tyszko
>Assignee: Apache Spark
>Priority: Minor
>
> Spark allows user to set a nullValue that will indicate certain value's 
> translation to a null type , for example string "NA" could be the one.
> Data sources that use such nullValue but also have other columns that may 
> contain empty values may not be parsed correctly.
> The change resolves that by assuming that:
> when column is infered as numeric
> its field will be set to null when parsing fails, for example upon seeing 
> empty value or an empty string.
> Example:
> ---
> |char|int1|int2|
> ---
> |a|1|2|
> ---
> |a||0|
> ---
> |NA|""|""|
> 
> This example illustrates that column "char" may contain an empty value 
> indicated as "NA", column int1 has a "true null" value but then both int1 and 
> int2 columns have an empty string set as their values.
> In such situation parsing will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18906) CSV parser should return null for empty (or with "") numeric columns.

2016-12-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755604#comment-15755604
 ] 

Apache Spark commented on SPARK-18906:
--

User 'kubatyszko' has created a pull request for this issue:
https://github.com/apache/spark/pull/16319

> CSV parser should return null for empty (or with "") numeric columns.
> -
>
> Key: SPARK-18906
> URL: https://issues.apache.org/jira/browse/SPARK-18906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Kuba Tyszko
>Priority: Minor
>
> Spark allows user to set a nullValue that will indicate certain value's 
> translation to a null type , for example string "NA" could be the one.
> Data sources that use such nullValue but also have other columns that may 
> contain empty values may not be parsed correctly.
> The change resolves that by assuming that:
> when column is infered as numeric
> its field will be set to null when parsing fails, for example upon seeing 
> empty value or an empty string.
> Example:
> ---
> |char|int1|int2|
> ---
> |a|1|2|
> ---
> |a||0|
> ---
> |NA|""|""|
> 
> This example illustrates that column "char" may contain an empty value 
> indicated as "NA", column int1 has a "true null" value but then both int1 and 
> int2 columns have an empty string set as their values.
> In such situation parsing will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18906) CSV parser should return null for empty (or with "") numeric columns.

2016-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18906:


Assignee: (was: Apache Spark)

> CSV parser should return null for empty (or with "") numeric columns.
> -
>
> Key: SPARK-18906
> URL: https://issues.apache.org/jira/browse/SPARK-18906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Kuba Tyszko
>Priority: Minor
>
> Spark allows user to set a nullValue that will indicate certain value's 
> translation to a null type , for example string "NA" could be the one.
> Data sources that use such nullValue but also have other columns that may 
> contain empty values may not be parsed correctly.
> The change resolves that by assuming that:
> when column is infered as numeric
> its field will be set to null when parsing fails, for example upon seeing 
> empty value or an empty string.
> Example:
> ---
> |char|int1|int2|
> ---
> |a|1|2|
> ---
> |a||0|
> ---
> |NA|""|""|
> 
> This example illustrates that column "char" may contain an empty value 
> indicated as "NA", column int1 has a "true null" value but then both int1 and 
> int2 columns have an empty string set as their values.
> In such situation parsing will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18906) CSV parser should return null for empty (or with "") numeric columns.

2016-12-16 Thread Kuba Tyszko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755597#comment-15755597
 ] 

Kuba Tyszko commented on SPARK-18906:
-

Well, in csv null can either be an empty field or in this case a dedicated 
value (NA), but some data providers use empty string to indicate an empty value 
as well.

I've looked at JIRA and there were a few requests to allow multiple nullValue 
settings - but that seems to be a challenging task.

The patch I'm proposing here enables handing of such "empty integers" in a 
predictable way.

I understand this may look unclean, but unfortunately some reputable data 
providers do that... - there is nothing we can do to stop them...

> CSV parser should return null for empty (or with "") numeric columns.
> -
>
> Key: SPARK-18906
> URL: https://issues.apache.org/jira/browse/SPARK-18906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Kuba Tyszko
>Priority: Minor
>
> Spark allows user to set a nullValue that will indicate certain value's 
> translation to a null type , for example string "NA" could be the one.
> Data sources that use such nullValue but also have other columns that may 
> contain empty values may not be parsed correctly.
> The change resolves that by assuming that:
> when column is infered as numeric
> its field will be set to null when parsing fails, for example upon seeing 
> empty value or an empty string.
> Example:
> ---
> |char|int1|int2|
> ---
> |a|1|2|
> ---
> |a||0|
> ---
> |NA|""|""|
> 
> This example illustrates that column "char" may contain an empty value 
> indicated as "NA", column int1 has a "true null" value but then both int1 and 
> int2 columns have an empty string set as their values.
> In such situation parsing will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18906) CSV parser should return null for empty (or with "") numeric columns.

2016-12-16 Thread Kuba Tyszko (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuba Tyszko updated SPARK-18906:

Description: 
Spark allows user to set a nullValue that will indicate certain value's 
translation to a null type , for example string "NA" could be the one.
Data sources that use such nullValue but also have other columns that may 
contain empty values may not be parsed correctly.
The change resolves that by assuming that:
when column is infered as numeric
its field will be set to null when parsing fails, for example upon seeing empty 
value or an empty string.

Example:

---
|char|int1|int2|
---
|a|1|2|
---
|a||0|
---
|NA|""|""|


This example illustrates that column "char" may contain an empty value 
indicated as "NA", column int1 has a "true null" value but then both int1 and 
int2 columns have an empty string set as their values.
In such situation parsing will fail.




  was:
Spark allows user to set a nullValue that will indicate certain value's 
translation to a null type , for example string "NA" could be the one.
Data sources that use such nullValue but also have other columns that may 
contain empty values may not be parsed correctly.
The change resolves that by assuming that:
when column is infered as numeric
its field will be set to null when parsing fails, for example upon seeing empty 
value or an empty string.

Example:

---
|char|int1|int2
---
|a|1|2|
---
|a||0
---
|NA|""|""


This example illustrates that column "char" may contain an empty value 
indicated as "NA", column int1 has a "true null" value but then both int1 and 
int2 columns have an empty string set as their values.
In such situation parsing will fail.





> CSV parser should return null for empty (or with "") numeric columns.
> -
>
> Key: SPARK-18906
> URL: https://issues.apache.org/jira/browse/SPARK-18906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Kuba Tyszko
>Priority: Minor
>
> Spark allows user to set a nullValue that will indicate certain value's 
> translation to a null type , for example string "NA" could be the one.
> Data sources that use such nullValue but also have other columns that may 
> contain empty values may not be parsed correctly.
> The change resolves that by assuming that:
> when column is infered as numeric
> its field will be set to null when parsing fails, for example upon seeing 
> empty value or an empty string.
> Example:
> ---
> |char|int1|int2|
> ---
> |a|1|2|
> ---
> |a||0|
> ---
> |NA|""|""|
> 
> This example illustrates that column "char" may contain an empty value 
> indicated as "NA", column int1 has a "true null" value but then both int1 and 
> int2 columns have an empty string set as their values.
> In such situation parsing will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18906) CSV parser should return null for empty (or with "") numeric columns.

2016-12-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755581#comment-15755581
 ] 

Sean Owen commented on SPARK-18906:
---

It sounds like you're saying your data has many types of "null" -- is that even 
valid?

> CSV parser should return null for empty (or with "") numeric columns.
> -
>
> Key: SPARK-18906
> URL: https://issues.apache.org/jira/browse/SPARK-18906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Kuba Tyszko
>Priority: Minor
>
> Spark allows user to set a nullValue that will indicate certain value's 
> translation to a null type , for example string "NA" could be the one.
> Data sources that use such nullValue but also have other columns that may 
> contain empty values may not be parsed correctly.
> The change resolves that by assuming that:
> when column is infered as numeric
> its field will be set to null when parsing fails, for example upon seeing 
> empty value or an empty string.
> Example:
> ---
> |char|int1|int2
> ---
> |a|1|2|
> ---
> |a||0
> ---
> |NA|""|""
> 
> This example illustrates that column "char" may contain an empty value 
> indicated as "NA", column int1 has a "true null" value but then both int1 and 
> int2 columns have an empty string set as their values.
> In such situation parsing will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18877) Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20

2016-12-16 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755579#comment-15755579
 ] 

Dongjoon Hyun commented on SPARK-18877:
---

Yes. I reproduced the bug and found the root cause on schema inferencing.
I'll make a PR for this.

> Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: 
> requirement failed: Decimal precision 28 exceeds max precision 20
> --
>
> Key: SPARK-18877
> URL: https://issues.apache.org/jira/browse/SPARK-18877
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Navya Krishnappa
>
> When reading below mentioned csv data, even though the maximum decimal 
> precision is 38, following exception is thrown 
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 
> exceeds max precision 20
> Decimal
> 2323366225312000
> 2433573971400
> 23233662253000
> 23233662253



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18906) CSV parser should return null for empty (or with "") numeric columns.

2016-12-16 Thread Kuba Tyszko (JIRA)
Kuba Tyszko created SPARK-18906:
---

 Summary: CSV parser should return null for empty (or with "") 
numeric columns.
 Key: SPARK-18906
 URL: https://issues.apache.org/jira/browse/SPARK-18906
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: Kuba Tyszko
Priority: Minor


Spark allows user to set a nullValue that will indicate certain value's 
translation to a null type , for example string "NA" could be the one.
Data sources that use such nullValue but also have other columns that may 
contain empty values may not be parsed correctly.
The change resolves that by assuming that:
when column is infered as numeric
its field will be set to null when parsing fails, for example upon seeing empty 
value or an empty string.

Example:

---
|char|int1|int2
---
|a|1|2|
---
|a||0
---
|NA|""|""


This example illustrates that column "char" may contain an empty value 
indicated as "NA", column int1 has a "true null" value but then both int1 and 
int2 columns have an empty string set as their values.
In such situation parsing will fail.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2016-12-16 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755560#comment-15755560
 ] 

Ruslan Dautkhanov commented on SPARK-12837:
---

Yep, we continue to see this issue in Spark 2..

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18898) Exception not failing Scala applications (in yarn)

2016-12-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755475#comment-15755475
 ] 

Sean Owen commented on SPARK-18898:
---

Probably the same as SPARK-15955

> Exception not failing Scala applications (in yarn)
> --
>
> Key: SPARK-18898
> URL: https://issues.apache.org/jira/browse/SPARK-18898
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.2
>Reporter: Raphael
>
> I am submitting my scala applications with SparkLauncher which uses 
> spark-submit project.
> When I throw some error in my spark job, the final status of this job in YARN 
> is FINISHED, after viewing the source code of SparkSubmit:
> {code:title=SparkSubmit.scala|borderStyle=solid}
> ...
> try {
>   mainMethod.invoke(null, childArgs.toArray)
> } catch {
>   case t: Throwable =>
> findCause(t) match {
>   case SparkUserAppException(exitCode) =>
> System.exit(exitCode)
>   case t: Throwable =>
> throw t
> }
> }
> ...
> {code}
>  
> Its seems we have in our code to throw the SparkUserAppException, but this 
> exception is a private case class inside the SparkException class.
> Please make this case class avaliable or give us a way to launch errors 
> inside our applications.
> More details at: 
> http://stackoverflow.com/questions/41184158/how-to-throw-an-exception-in-spark
> In the past, the same issue with pyspark was opened here:
> https://issues.apache.org/jira/browse/SPARK-7736
> And resolved here:
> https://github.com/apache/spark/pull/8258
> Best Regards 
> Raphael.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17769) Some FetchFailure refactoring in the DAGScheduler

2016-12-16 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-17769.

   Resolution: Fixed
Fix Version/s: 2.2.0

> Some FetchFailure refactoring in the DAGScheduler
> -
>
> Key: SPARK-17769
> URL: https://issues.apache.org/jira/browse/SPARK-17769
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
>Priority: Minor
> Fix For: 2.2.0
>
>
> SPARK-17644 opened up a discussion about further refactoring of the 
> DAGScheduler's handling of FetchFailure events.  These include:
> * rewriting code and comments to improve readability
> * doing fetchFailedAttemptIds.add(stageAttemptId) even when 
> disallowStageRetryForTest is true
> * issuing a ResubmitFailedStages event based on whether one is already 
> enqueued for the current failed stage, not any prior failed stage
> * logging the resubmission of all failed stages 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18905) Potential Issue of Semantics of BatchCompleted

2016-12-16 Thread Nan Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nan Zhu updated SPARK-18905:

Description: 
the current implementation of Spark streaming considers a batch is completed no 
matter the results of the jobs 
(https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)

Let's consider the following case:

A micro batch contains 2 jobs and they read from two different kafka topics 
respectively. One of these jobs is failed due to some problem in the user 
defined logic, after the other one is finished successfully. 

1. The main thread in the Spark streaming application will execute the line 
mentioned above, 

2. and another thread (checkpoint writer) will make a checkpoint file 
immediately after this line is executed. 

3. Then due to the current error handling mechanism in Spark Streaming, 
StreamingContext will be closed 
(https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)

the user recovers from the checkpoint file, and because the JobSet containing 
the failed job has been removed (taken as completed) before the checkpoint is 
constructed, the data being processed by the failed job would never be 
reprocessed?


I might have missed something in the checkpoint thread or this 
handleJobCompletion()or it is a potential bug 

  was:
the current implementation of Spark streaming considers a batch is completed no 
matter the results of the jobs 
(https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)

Let's consider the following case:

A micro batch contains 2 jobs and they read from two different kafka topics 
respectively. One of these jobs is failed due to some problem in the user 
defined logic. 

1. The main thread in the Spark streaming application will execute the line 
mentioned above, 

2. and another thread (checkpoint writer) will make a checkpoint file 
immediately after this line is executed. 

3. Then due to the current error handling mechanism in Spark Streaming, 
StreamingContext will be closed 
(https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)

the user recovers from the checkpoint file, and because the JobSet containing 
the failed job has been removed (taken as completed) before the checkpoint is 
constructed, the data being processed by the failed job would never be 
reprocessed?


I might have missed something in the checkpoint thread or this 
handleJobCompletion()or it is a potential bug 


> Potential Issue of Semantics of BatchCompleted
> --
>
> Key: SPARK-18905
> URL: https://issues.apache.org/jira/browse/SPARK-18905
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Nan Zhu
>
> the current implementation of Spark streaming considers a batch is completed 
> no matter the results of the jobs 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)
> Let's consider the following case:
> A micro batch contains 2 jobs and they read from two different kafka topics 
> respectively. One of these jobs is failed due to some problem in the user 
> defined logic, after the other one is finished successfully. 
> 1. The main thread in the Spark streaming application will execute the line 
> mentioned above, 
> 2. and another thread (checkpoint writer) will make a checkpoint file 
> immediately after this line is executed. 
> 3. Then due to the current error handling mechanism in Spark Streaming, 
> StreamingContext will be closed 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)
> the user recovers from the checkpoint file, and because the JobSet containing 
> the failed job has been removed (taken as completed) before the checkpoint is 
> constructed, the data being processed by the failed job would never be 
> reprocessed?
> I might have missed something in the checkpoint thread or this 
> handleJobCompletion()or it is a potential bug 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18905) Potential Issue of Semantics of BatchCompleted

2016-12-16 Thread Nan Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nan Zhu updated SPARK-18905:

Description: 
the current implementation of Spark streaming considers a batch is completed no 
matter the results of the jobs 
(https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)

Let's consider the following case:

A micro batch contains 2 jobs and they read from two different kafka topics 
respectively. One of these jobs is failed due to some problem in the user 
defined logic. 

1. The main thread in the Spark streaming application will execute the line 
mentioned above, 

2. and another thread (checkpoint writer) will make a checkpoint file 
immediately after this line is executed. 

3. Then due to the current error handling mechanism in Spark Streaming, 
StreamingContext will be closed 
(https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)

the user recovers from the checkpoint file, and because the JobSet containing 
the failed job has been removed (taken as completed) before the checkpoint is 
constructed, the data being processed by the failed job would never be 
reprocessed?


I might have missed something in the checkpoint thread or this 
handleJobCompletion()or it is a potential bug 

  was:
the current implementation of Spark streaming considers a batch is completed no 
matter the results of the jobs 
(https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)

Let's consider the following case:

A micro batch contains 2 jobs and they read from two different kafka topics 
respectively. One of this job is failed due to some problem in the user defined 
logic. 

1. The main thread in the Spark streaming application will execute the line 
mentioned above, 

2. and another thread (checkpoint writer) will make a checkpoint file 
immediately after this line is executed. 

3. Then due to the current error handling mechanism in Spark Streaming, 
StreamingContext will be closed 
(https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)

the user recovers from the checkpoint file, and because the JobSet containing 
the failed job has been removed (taken as completed) before the checkpoint is 
constructed, the data being processed by the failed job would never be 
reprocessed?


I might have missed something in the checkpoint thread or this 
handleJobCompletion()or it is a potential bug 


> Potential Issue of Semantics of BatchCompleted
> --
>
> Key: SPARK-18905
> URL: https://issues.apache.org/jira/browse/SPARK-18905
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Nan Zhu
>
> the current implementation of Spark streaming considers a batch is completed 
> no matter the results of the jobs 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)
> Let's consider the following case:
> A micro batch contains 2 jobs and they read from two different kafka topics 
> respectively. One of these jobs is failed due to some problem in the user 
> defined logic. 
> 1. The main thread in the Spark streaming application will execute the line 
> mentioned above, 
> 2. and another thread (checkpoint writer) will make a checkpoint file 
> immediately after this line is executed. 
> 3. Then due to the current error handling mechanism in Spark Streaming, 
> StreamingContext will be closed 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)
> the user recovers from the checkpoint file, and because the JobSet containing 
> the failed job has been removed (taken as completed) before the checkpoint is 
> constructed, the data being processed by the failed job would never be 
> reprocessed?
> I might have missed something in the checkpoint thread or this 
> handleJobCompletion()or it is a potential bug 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18905) Potential Issue of Semantics of BatchCompleted

2016-12-16 Thread Nan Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nan Zhu updated SPARK-18905:

Description: 
the current implementation of Spark streaming considers a batch is completed no 
matter the results of the jobs 
(https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)

Let's consider the following case:

A micro batch contains 2 jobs and they read from two different kafka topics 
respectively. One of this job is failed due to some problem in the user defined 
logic. 

1. The main thread in the Spark streaming application will execute the line 
mentioned above, 

2. and another thread (checkpoint writer) will make a checkpoint file 
immediately after this line is executed. 

3. Then due to the current error handling mechanism in Spark Streaming, 
StreamingContext will be closed 
(https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)

the user recovers from the checkpoint file, and because the JobSet containing 
the failed job has been removed (taken as completed) before the checkpoint is 
constructed, the data being processed by the failed job would never be 
reprocessed?


I might have missed something in the checkpoint thread or this 
handleJobCompletion()or it is a potential bug 

  was:
the current implementation of Spark streaming considers a batch is completed no 
matter the result of the jobs 
(https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)

Let's consider the following case:

A micro batch contains 2 jobs and they read from two different kafka topics 
respectively. One of this job is failed due to some problem in the user defined 
logic. 

1. The main thread in the Spark streaming application will execute the line 
mentioned above, 

2. and another thread (checkpoint writer) will make a checkpoint file 
immediately after this line is executed. 

3. Then due to the current error handling mechanism in Spark Streaming, 
StreamingContext will be closed 
(https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)

the user recovers from the checkpoint file, and because the JobSet containing 
the failed job has been removed (taken as completed) before the checkpoint is 
constructed, the data being processed by the failed job would never be 
reprocessed?


I might have missed something in the checkpoint thread or this 
handleJobCompletion()or it is a potential bug 


> Potential Issue of Semantics of BatchCompleted
> --
>
> Key: SPARK-18905
> URL: https://issues.apache.org/jira/browse/SPARK-18905
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Nan Zhu
>
> the current implementation of Spark streaming considers a batch is completed 
> no matter the results of the jobs 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)
> Let's consider the following case:
> A micro batch contains 2 jobs and they read from two different kafka topics 
> respectively. One of this job is failed due to some problem in the user 
> defined logic. 
> 1. The main thread in the Spark streaming application will execute the line 
> mentioned above, 
> 2. and another thread (checkpoint writer) will make a checkpoint file 
> immediately after this line is executed. 
> 3. Then due to the current error handling mechanism in Spark Streaming, 
> StreamingContext will be closed 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)
> the user recovers from the checkpoint file, and because the JobSet containing 
> the failed job has been removed (taken as completed) before the checkpoint is 
> constructed, the data being processed by the failed job would never be 
> reprocessed?
> I might have missed something in the checkpoint thread or this 
> handleJobCompletion()or it is a potential bug 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18905) Potential Issue of Semantics of BatchCompleted

2016-12-16 Thread Nan Zhu (JIRA)
Nan Zhu created SPARK-18905:
---

 Summary: Potential Issue of Semantics of BatchCompleted
 Key: SPARK-18905
 URL: https://issues.apache.org/jira/browse/SPARK-18905
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.0.2, 2.0.1, 2.0.0
Reporter: Nan Zhu


the current implementation of Spark streaming considers a batch is completed no 
matter the result of the jobs 
(https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)

Let's consider the following case:

A micro batch contains 2 jobs and they read from two different kafka topics 
respectively. One of this job is failed due to some problem in the user defined 
logic. 

1. The main thread in the Spark streaming application will execute the line 
mentioned above, 

2. and another thread (checkpoint writer) will make a checkpoint file 
immediately after this line is executed. 

3. Then due to the current error handling mechanism in Spark Streaming, 
StreamingContext will be closed 
(https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)

the user recovers from the checkpoint file, and because the JobSet containing 
the failed job has been removed (taken as completed) before the checkpoint is 
constructed, the data being processed by the failed job would never be 
reprocessed?


I might have missed something in the checkpoint thread or this 
handleJobCompletion()or it is a potential bug 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18903) uiWebUrl is not accessible to SparkR

2016-12-16 Thread Diogo Munaro Vieira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755440#comment-15755440
 ] 

Diogo Munaro Vieira commented on SPARK-18903:
-

Hey Sean, I don't know a way to get what port my jobs are running using sparkR. 
Do you?

Now I'm using jupyterhub with lot of users and each user should be able to 
check their job progress.

Even using my pc I don't know what port spark is using with R notebook

> uiWebUrl is not accessible to SparkR
> 
>
> Key: SPARK-18903
> URL: https://issues.apache.org/jira/browse/SPARK-18903
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, SparkR, Web UI
>Affects Versions: 2.0.2
>Reporter: Diogo Munaro Vieira
>Priority: Minor
>
> Like https://issues.apache.org/jira/browse/SPARK-17437 uiWebUrl is not 
> accessible to SparkR context



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18904) Merge two FileStreamSourceSuite files

2016-12-16 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18904:
-
Priority: Minor  (was: Trivial)

> Merge two FileStreamSourceSuite files
> -
>
> Key: SPARK-18904
> URL: https://issues.apache.org/jira/browse/SPARK-18904
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming, Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> There are two FileStreamSourceSuite files and it's confusing. We should just 
> merge them into one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18904) Merge two FileStreamSourceSuite files

2016-12-16 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18904:
-
Priority: Trivial  (was: Major)

> Merge two FileStreamSourceSuite files
> -
>
> Key: SPARK-18904
> URL: https://issues.apache.org/jira/browse/SPARK-18904
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming, Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Trivial
>
> There are two FileStreamSourceSuite files and it's confusing. We should just 
> merge them into one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18904) Merge two FileStreamSourceSuite files

2016-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18904:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Merge two FileStreamSourceSuite files
> -
>
> Key: SPARK-18904
> URL: https://issues.apache.org/jira/browse/SPARK-18904
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming, Tests
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> There are two FileStreamSourceSuite files and it's confusing. We should just 
> merge them into one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18904) Merge two FileStreamSourceSuite files

2016-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18904:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Merge two FileStreamSourceSuite files
> -
>
> Key: SPARK-18904
> URL: https://issues.apache.org/jira/browse/SPARK-18904
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming, Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> There are two FileStreamSourceSuite files and it's confusing. We should just 
> merge them into one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18904) Merge two FileStreamSourceSuite files

2016-12-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755436#comment-15755436
 ] 

Apache Spark commented on SPARK-18904:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16315

> Merge two FileStreamSourceSuite files
> -
>
> Key: SPARK-18904
> URL: https://issues.apache.org/jira/browse/SPARK-18904
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming, Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> There are two FileStreamSourceSuite files and it's confusing. We should just 
> merge them into one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18904) Merge two FileStreamSourceSuite files

2016-12-16 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-18904:


 Summary: Merge two FileStreamSourceSuite files
 Key: SPARK-18904
 URL: https://issues.apache.org/jira/browse/SPARK-18904
 Project: Spark
  Issue Type: Test
  Components: Structured Streaming, Tests
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


There are two FileStreamSourceSuite files and it's confusing. We should just 
merge them into one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18902) Include Apache License in R source Package

2016-12-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755431#comment-15755431
 ] 

Sean Owen commented on SPARK-18902:
---

Oh, I missed DESCRIPTION. If that's the standard place to put it, then I think 
this is a non-issue, sorry for the noise.
PySpark may still need a touch-up.

> Include Apache License in R source Package
> --
>
> Key: SPARK-18902
> URL: https://issues.apache.org/jira/browse/SPARK-18902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Shivaram Venkataraman
>
> Per [~srowen]'s email on the dev mailing list
> {quote}
> I don't see an Apache license / notice for the Pyspark or SparkR artifacts. 
> It would be good practice to include this in a convenience binary. I'm not 
> sure if it's strictly mandatory, but something to adjust in any event. I 
> think that's all there is to do for SparkR
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18903) uiWebUrl is not accessible to SparkR

2016-12-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755428#comment-15755428
 ] 

Sean Owen commented on SPARK-18903:
---

(Does it really need to be?)

> uiWebUrl is not accessible to SparkR
> 
>
> Key: SPARK-18903
> URL: https://issues.apache.org/jira/browse/SPARK-18903
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, SparkR, Web UI
>Affects Versions: 2.0.2
>Reporter: Diogo Munaro Vieira
>Priority: Minor
>
> Like https://issues.apache.org/jira/browse/SPARK-17437 uiWebUrl is not 
> accessible to SparkR context



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18903) uiWebUrl is not accessible to SparkR

2016-12-16 Thread Diogo Munaro Vieira (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Diogo Munaro Vieira updated SPARK-18903:

Description: Like https://issues.apache.org/jira/browse/SPARK-17437 
uiWebUrl is not accessible to SparkR context

> uiWebUrl is not accessible to SparkR
> 
>
> Key: SPARK-18903
> URL: https://issues.apache.org/jira/browse/SPARK-18903
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, SparkR, Web UI
>Affects Versions: 2.0.2
>Reporter: Diogo Munaro Vieira
>Priority: Minor
>
> Like https://issues.apache.org/jira/browse/SPARK-17437 uiWebUrl is not 
> accessible to SparkR context



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18903) uiWebUrl is not accessible to SparkR

2016-12-16 Thread Diogo Munaro Vieira (JIRA)
Diogo Munaro Vieira created SPARK-18903:
---

 Summary: uiWebUrl is not accessible to SparkR
 Key: SPARK-18903
 URL: https://issues.apache.org/jira/browse/SPARK-18903
 Project: Spark
  Issue Type: Improvement
  Components: Java API, SparkR, Web UI
Affects Versions: 2.0.2
Reporter: Diogo Munaro Vieira
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18902) Include Apache License in R source Package

2016-12-16 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755386#comment-15755386
 ] 

Shivaram Venkataraman commented on SPARK-18902:
---

A couple of points 
- We already include a DESCRIPTION file which lists the license as Apache
- To do this I think we just need to make a copy of LICENSE and put it in 
R/pkg/ -- The `R CMD build` should automatically pick that up then I'd guess

cc [~felixcheung]

> Include Apache License in R source Package
> --
>
> Key: SPARK-18902
> URL: https://issues.apache.org/jira/browse/SPARK-18902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Shivaram Venkataraman
>
> Per [~srowen]'s email on the dev mailing list
> {quote}
> I don't see an Apache license / notice for the Pyspark or SparkR artifacts. 
> It would be good practice to include this in a convenience binary. I'm not 
> sure if it's strictly mandatory, but something to adjust in any event. I 
> think that's all there is to do for SparkR
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18902) Include Apache License in R source Package

2016-12-16 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755385#comment-15755385
 ] 

Shivaram Venkataraman commented on SPARK-18902:
---

A couple of points 
- We already include a DESCRIPTION file which lists the license as Apache
- To do this I think we just need to make a copy of LICENSE and put it in 
R/pkg/ -- The `R CMD build` should automatically pick that up then I'd guess

cc [~felixcheung]

> Include Apache License in R source Package
> --
>
> Key: SPARK-18902
> URL: https://issues.apache.org/jira/browse/SPARK-18902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Shivaram Venkataraman
>
> Per [~srowen]'s email on the dev mailing list
> {quote}
> I don't see an Apache license / notice for the Pyspark or SparkR artifacts. 
> It would be good practice to include this in a convenience binary. I'm not 
> sure if it's strictly mandatory, but something to adjust in any event. I 
> think that's all there is to do for SparkR
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17073) generate basic stats for column

2016-12-16 Thread Ioana Delaney (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755367#comment-15755367
 ] 

Ioana Delaney commented on SPARK-17073:
---

[~mikewzh] FYI, I ran the ANALYZE COMPUTE STATISTICS FOR COLUMNS command on 
tables defined in TPCDS 1TB db. On November builds, the command worked well for 
a reasonable number of columns, but it failed as the number of columns 
increased. The workaround that I used was to run the command on subsets of 
columns, and that worked fine. I retried the command lately, using a December 
build, and this time it seems to complete successfully on a large range of 
columns + data (store_sales table with 2.88e+9 and 23 columns). Great work!

> generate basic stats for column
> ---
>
> Key: SPARK-17073
> URL: https://issues.apache.org/jira/browse/SPARK-17073
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>Assignee: Zhenhua Wang
> Fix For: 2.1.0
>
>
> For a specified column, we need to generate basic stats including max, min, 
> number of nulls, number of distinct values, max column length, average column 
> length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18902) Include Apache License in R source Package

2016-12-16 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-18902:
-

 Summary: Include Apache License in R source Package
 Key: SPARK-18902
 URL: https://issues.apache.org/jira/browse/SPARK-18902
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Shivaram Venkataraman


Per [~srowen]'s email on the dev mailing list

{quote}
I don't see an Apache license / notice for the Pyspark or SparkR artifacts. It 
would be good practice to include this in a convenience binary. I'm not sure if 
it's strictly mandatory, but something to adjust in any event. I think that's 
all there is to do for SparkR
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18648) spark-shell --jars option does not add jars to classpath on windows

2016-12-16 Thread Michel Lemay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755317#comment-15755317
 ] 

Michel Lemay commented on SPARK-18648:
--

It''s related because the path given to textFile is on a different drive!  The 
exception clearly comes from the jar path.


> spark-shell --jars option does not add jars to classpath on windows
> ---
>
> Key: SPARK-18648
> URL: https://issues.apache.org/jira/browse/SPARK-18648
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Windows
>Affects Versions: 2.0.2
> Environment: Windows 7 x64
>Reporter: Michel Lemay
>  Labels: windows
>
> I can't import symbols from command line jars when in the shell:
> Adding jars via --jars:
> {code}
> spark-shell --master local[*] --jars path\to\deeplearning4j-core-0.7.0.jar
> {code}
> Same result if I add it through maven coordinates:
> {code}spark-shell --master local[*] --packages 
> org.deeplearning4j:deeplearning4j-core:0.7.0
> {code}
> I end up with:
> {code}
> scala> import org.deeplearning4j
> :23: error: object deeplearning4j is not a member of package org
>import org.deeplearning4j
> {code}
> NOTE: It is working as expected when running on linux.
> Sample output with --verbose:
> {code}
> Using properties file: null
> Parsed arguments:
>   master  local[*]
>   deployMode  null
>   executorMemory  null
>   executorCores   null
>   totalExecutorCores  null
>   propertiesFile  null
>   driverMemorynull
>   driverCores null
>   driverExtraClassPathnull
>   driverExtraLibraryPath  null
>   driverExtraJavaOptions  null
>   supervise   false
>   queue   null
>   numExecutorsnull
>   files   null
>   pyFiles null
>   archivesnull
>   mainClass   org.apache.spark.repl.Main
>   primaryResource spark-shell
>   nameSpark shell
>   childArgs   []
>   jars
> file:/C:/Apps/Spark/spark-2.0.2-bin-hadoop2.4/bin/../deeplearning4j-core-0.7.0.jar
>   packagesnull
>   packagesExclusions  null
>   repositoriesnull
>   verbose true
> Spark properties used, including those specified through
>  --conf and those from the properties file null:
> Main class:
> org.apache.spark.repl.Main
> Arguments:
> System properties:
> SPARK_SUBMIT -> true
> spark.app.name -> Spark shell
> spark.jars -> 
> file:/C:/Apps/Spark/spark-2.0.2-bin-hadoop2.4/bin/../deeplearning4j-core-0.7.0.jar
> spark.submit.deployMode -> client
> spark.master -> local[*]
> Classpath elements:
> file:/C:/Apps/Spark/spark-2.0.2-bin-hadoop2.4/bin/../deeplearning4j-core-0.7.0.jar
> 16/11/30 08:30:49 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/11/30 08:30:51 WARN SparkContext: Use an existing SparkContext, some 
> configuration may not take effect.
> Spark context Web UI available at http://192.168.70.164:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1480512651325).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.2
>   /_/
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> import org.deeplearning4j
> :23: error: object deeplearning4j is not a member of package org
>import org.deeplearning4j
>   ^
> scala>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18897) Fix SparkR SQL Test to drop test table

2016-12-16 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-18897:
--
Fix Version/s: (was: 2.1.0)
   2.2.0
   2.1.1

> Fix SparkR SQL Test to drop test table
> --
>
> Key: SPARK-18897
> URL: https://issues.apache.org/jira/browse/SPARK-18897
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Affects Versions: 2.0.2
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> Currently, SparkR tests, `R/run-tests.sh` succeeds only once because 
> `test_sparkSQL.R` does not clean up the test table `people`.
> As a result, the test data is accumulated at every run and the test cases 
> fail.
> The following is the failure result for the second run.
> {code}
> Failed 
> -
> 1. Failure: create DataFrame from RDD (@test_sparkSQL.R#204) 
> ---
> collect(sql("SELECT age from people WHERE name = 'Bob'"))$age not equal to 
> c(16).
> Lengths differ: 2 vs 1
> 2. Failure: create DataFrame from RDD (@test_sparkSQL.R#206) 
> ---
> collect(sql("SELECT height from people WHERE name ='Bob'"))$height not equal 
> to c(176.5).
> Lengths differ: 2 vs 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18897) Fix SparkR SQL Test to drop test table

2016-12-16 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-18897:
--
Assignee: Dongjoon Hyun

> Fix SparkR SQL Test to drop test table
> --
>
> Key: SPARK-18897
> URL: https://issues.apache.org/jira/browse/SPARK-18897
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Affects Versions: 2.0.2
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> Currently, SparkR tests, `R/run-tests.sh` succeeds only once because 
> `test_sparkSQL.R` does not clean up the test table `people`.
> As a result, the test data is accumulated at every run and the test cases 
> fail.
> The following is the failure result for the second run.
> {code}
> Failed 
> -
> 1. Failure: create DataFrame from RDD (@test_sparkSQL.R#204) 
> ---
> collect(sql("SELECT age from people WHERE name = 'Bob'"))$age not equal to 
> c(16).
> Lengths differ: 2 vs 1
> 2. Failure: create DataFrame from RDD (@test_sparkSQL.R#206) 
> ---
> collect(sql("SELECT height from people WHERE name ='Bob'"))$height not equal 
> to c(176.5).
> Lengths differ: 2 vs 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18897) Fix SparkR SQL Test to drop test table

2016-12-16 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-18897.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.3

Issue resolved by pull request 16310
[https://github.com/apache/spark/pull/16310]

> Fix SparkR SQL Test to drop test table
> --
>
> Key: SPARK-18897
> URL: https://issues.apache.org/jira/browse/SPARK-18897
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Affects Versions: 2.0.2
>Reporter: Dongjoon Hyun
> Fix For: 2.0.3, 2.1.0
>
>
> Currently, SparkR tests, `R/run-tests.sh` succeeds only once because 
> `test_sparkSQL.R` does not clean up the test table `people`.
> As a result, the test data is accumulated at every run and the test cases 
> fail.
> The following is the failure result for the second run.
> {code}
> Failed 
> -
> 1. Failure: create DataFrame from RDD (@test_sparkSQL.R#204) 
> ---
> collect(sql("SELECT age from people WHERE name = 'Bob'"))$age not equal to 
> c(16).
> Lengths differ: 2 vs 1
> 2. Failure: create DataFrame from RDD (@test_sparkSQL.R#206) 
> ---
> collect(sql("SELECT height from people WHERE name ='Bob'"))$height not equal 
> to c(176.5).
> Lengths differ: 2 vs 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18901) Require in LR LogisticAggregator is redundant

2016-12-16 Thread yuhao yang (JIRA)
yuhao yang created SPARK-18901:
--

 Summary: Require in LR LogisticAggregator is redundant
 Key: SPARK-18901
 URL: https://issues.apache.org/jira/browse/SPARK-18901
 Project: Spark
  Issue Type: Improvement
Reporter: yuhao yang
Priority: Minor


require(numFeatures == features.size, )
require(weight >= 0.0, ...)

in LogisticAggregator (add and merge) will never be triggered as the dimension 
and weight has been checked in MultivariateOnlineSummarizer. 

Given the frequent usage of function add, the redundant require should be 
removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18648) spark-shell --jars option does not add jars to classpath on windows

2016-12-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755223#comment-15755223
 ] 

Sean Owen commented on SPARK-18648:
---

That is unrelated though. It's an error from the path you pass to sc.textFile.

> spark-shell --jars option does not add jars to classpath on windows
> ---
>
> Key: SPARK-18648
> URL: https://issues.apache.org/jira/browse/SPARK-18648
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Windows
>Affects Versions: 2.0.2
> Environment: Windows 7 x64
>Reporter: Michel Lemay
>  Labels: windows
>
> I can't import symbols from command line jars when in the shell:
> Adding jars via --jars:
> {code}
> spark-shell --master local[*] --jars path\to\deeplearning4j-core-0.7.0.jar
> {code}
> Same result if I add it through maven coordinates:
> {code}spark-shell --master local[*] --packages 
> org.deeplearning4j:deeplearning4j-core:0.7.0
> {code}
> I end up with:
> {code}
> scala> import org.deeplearning4j
> :23: error: object deeplearning4j is not a member of package org
>import org.deeplearning4j
> {code}
> NOTE: It is working as expected when running on linux.
> Sample output with --verbose:
> {code}
> Using properties file: null
> Parsed arguments:
>   master  local[*]
>   deployMode  null
>   executorMemory  null
>   executorCores   null
>   totalExecutorCores  null
>   propertiesFile  null
>   driverMemorynull
>   driverCores null
>   driverExtraClassPathnull
>   driverExtraLibraryPath  null
>   driverExtraJavaOptions  null
>   supervise   false
>   queue   null
>   numExecutorsnull
>   files   null
>   pyFiles null
>   archivesnull
>   mainClass   org.apache.spark.repl.Main
>   primaryResource spark-shell
>   nameSpark shell
>   childArgs   []
>   jars
> file:/C:/Apps/Spark/spark-2.0.2-bin-hadoop2.4/bin/../deeplearning4j-core-0.7.0.jar
>   packagesnull
>   packagesExclusions  null
>   repositoriesnull
>   verbose true
> Spark properties used, including those specified through
>  --conf and those from the properties file null:
> Main class:
> org.apache.spark.repl.Main
> Arguments:
> System properties:
> SPARK_SUBMIT -> true
> spark.app.name -> Spark shell
> spark.jars -> 
> file:/C:/Apps/Spark/spark-2.0.2-bin-hadoop2.4/bin/../deeplearning4j-core-0.7.0.jar
> spark.submit.deployMode -> client
> spark.master -> local[*]
> Classpath elements:
> file:/C:/Apps/Spark/spark-2.0.2-bin-hadoop2.4/bin/../deeplearning4j-core-0.7.0.jar
> 16/11/30 08:30:49 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/11/30 08:30:51 WARN SparkContext: Use an existing SparkContext, some 
> configuration may not take effect.
> Spark context Web UI available at http://192.168.70.164:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1480512651325).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.2
>   /_/
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> import org.deeplearning4j
> :23: error: object deeplearning4j is not a member of package org
>import org.deeplearning4j
>   ^
> scala>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables

2016-12-16 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755221#comment-15755221
 ] 

Ryan Blue commented on SPARK-16032:
---

+1

> Audit semantics of various insertion operations related to partitioned tables
> -
>
> Key: SPARK-16032
> URL: https://issues.apache.org/jira/browse/SPARK-16032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>Priority: Critical
> Attachments: [SPARK-16032] Spark SQL table insertion auditing - 
> Google Docs.pdf
>
>
> We found that semantics of various insertion operations related to partition 
> tables can be inconsistent. This is an umbrella ticket for all related 
> tickets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables

2016-12-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16032.
---
Resolution: Done

> Audit semantics of various insertion operations related to partitioned tables
> -
>
> Key: SPARK-16032
> URL: https://issues.apache.org/jira/browse/SPARK-16032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>Priority: Critical
> Attachments: [SPARK-16032] Spark SQL table insertion auditing - 
> Google Docs.pdf
>
>
> We found that semantics of various insertion operations related to partition 
> tables can be inconsistent. This is an umbrella ticket for all related 
> tickets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >