date:20180107

[jira] [Commented] (SPARK-22980) Wrong answer when using pandas_udf

2018-01-07 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315805#comment-16315805
 ] 

Hyukjin Kwon commented on SPARK-22980:
--

[~smilegator], May I ask why this was reopened and what I misunderstood?

> Wrong answer when using pandas_udf
> --
>
> Key: SPARK-22980
> URL: https://issues.apache.org/jira/browse/SPARK-22980
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Blocker
>
> {noformat}
> from pyspark.sql.functions import pandas_udf
> from pyspark.sql.functions import col, lit
> from pyspark.sql.types import LongType
> df = spark.range(3)
> f = pandas_udf(lambda x, y: len(x) + y, LongType())
> df.select(f(lit('text'), col('id'))).show()
> {noformat}
> {noformat}
> from pyspark.sql.functions import udf
> from pyspark.sql.functions import col, lit
> from pyspark.sql.types import LongType
> df = spark.range(3)
> f = udf(lambda x, y: len(x) + y, LongType())
> df.select(f(lit('text'), col('id'))).show()
> {noformat}
> The results of pandas_udf are different from udf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22989) sparkstreaming ui show 0 records when spark-streaming-kafka application restore from checkpoint

2018-01-07 Thread zhaoshijie (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoshijie updated SPARK-22989:
---
Description: 
when a spark-streaming-kafka application restore from checkpoint , I find 
spark-streaming ui  Each batch records is 0.
!https://raw.githubusercontent.com/smdfj/picture/master/spark/batch.png!

  was:
when a spark-streaming-kafka application restore from checkpoint , I find 
spark-streaming ui  Each batch records is 0.
!https://github.com/smdfj/picture/blob/master/spark/batch.png!


> sparkstreaming ui show 0 records when spark-streaming-kafka application 
> restore from checkpoint 
> 
>
> Key: SPARK-22989
> URL: https://issues.apache.org/jira/browse/SPARK-22989
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: zhaoshijie
>
> when a spark-streaming-kafka application restore from checkpoint , I find 
> spark-streaming ui  Each batch records is 0.
> !https://raw.githubusercontent.com/smdfj/picture/master/spark/batch.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22989) sparkstreaming ui show 0 records when spark-streaming-kafka application restore from checkpoint

2018-01-07 Thread zhaoshijie (JIRA)

zhaoshijie created SPARK-22989:
--

 Summary: sparkstreaming ui show 0 records when 
spark-streaming-kafka application restore from checkpoint 
 Key: SPARK-22989
 URL: https://issues.apache.org/jira/browse/SPARK-22989
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.2.0
Reporter: zhaoshijie


when a spark-streaming-kafka application restore from checkpoint , I find 
spark-streaming ui  Each batch records is 0.
!https://github.com/smdfj/picture/blob/master/spark/batch.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22988) Why does dataset's unpersist clear all the caches have the same logical plan?

2018-01-07 Thread Wang Cheng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wang Cheng updated SPARK-22988:
---
Description: 
When I do followings:
dataset A = some dataset
A.persist

dataset B = A.doSomthing
dataset C = A.doSomthing

C.persist
A.unpersist

I found C's cache is removed too, since following code:
def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): 
Unit = writeLock {
val it = cachedData.iterator()
while (it.hasNext) {
  val cd = it.next()
  if (cd.plan.find(_.sameResult(plan)).isDefined) {
cd.cachedRepresentation.cachedColumnBuffers.unpersist(blocking)
it.remove()
  }
}
  }

It removes the data caches contain the same logical plan, should it only remove 
the cache whose dataset calls unpersist method? 

  was:
When I do followings:
dataset A = some dataset
A.persist

dataset B = A.doSomthing
dataset C = A.doSomthing

C.persist
A.unpersist

I found C's cache is removed too, since following code:
def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): 
Unit = writeLock {
val it = cachedData.iterator()
while (it.hasNext) {
  val cd = it.next()
  if (cd.plan.find(_.sameResult(plan)).isDefined) {
cd.cachedRepresentation.cachedColumnBuffers.unpersist(blocking)
it.remove()
  }
}
  }

It removes the data caches contain the some logical plan, should it only remove 
the cache whose dataset calls unpersist method? 


> Why does dataset's unpersist clear all the caches have the same logical plan?
> -
>
> Key: SPARK-22988
> URL: https://issues.apache.org/jira/browse/SPARK-22988
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Wang Cheng
>Priority: Minor
>
> When I do followings:
> dataset A = some dataset
> A.persist
> dataset B = A.doSomthing
> dataset C = A.doSomthing
> C.persist
> A.unpersist
> I found C's cache is removed too, since following code:
> def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): 
> Unit = writeLock {
> val it = cachedData.iterator()
> while (it.hasNext) {
>   val cd = it.next()
>   if (cd.plan.find(_.sameResult(plan)).isDefined) {
> cd.cachedRepresentation.cachedColumnBuffers.unpersist(blocking)
> it.remove()
>   }
> }
>   }
> It removes the data caches contain the same logical plan, should it only 
> remove the cache whose dataset calls unpersist method? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-22980) Wrong answer when using pandas_udf

2018-01-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-22980:
-

> Wrong answer when using pandas_udf
> --
>
> Key: SPARK-22980
> URL: https://issues.apache.org/jira/browse/SPARK-22980
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Blocker
>
> {noformat}
> from pyspark.sql.functions import pandas_udf
> from pyspark.sql.functions import col, lit
> from pyspark.sql.types import LongType
> df = spark.range(3)
> f = pandas_udf(lambda x, y: len(x) + y, LongType())
> df.select(f(lit('text'), col('id'))).show()
> {noformat}
> {noformat}
> from pyspark.sql.functions import udf
> from pyspark.sql.functions import col, lit
> from pyspark.sql.types import LongType
> df = spark.range(3)
> f = udf(lambda x, y: len(x) + y, LongType())
> df.select(f(lit('text'), col('id'))).show()
> {noformat}
> The results of pandas_udf are different from udf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22988) Why does dataset's unpersist clear all the caches have the same logical plan?

2018-01-07 Thread Wang Cheng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wang Cheng updated SPARK-22988:
---
Description: 
When I do followings:
dataset A = some dataset
A.persist

dataset B = A.doSomthing
dataset C = A.doSomthing

C.persist
A.unpersist

I found C's cache is removed too, since following code:
def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): 
Unit = writeLock {
val it = cachedData.iterator()
while (it.hasNext) {
  val cd = it.next()
  if (cd.plan.find(_.sameResult(plan)).isDefined) {
cd.cachedRepresentation.cachedColumnBuffers.unpersist(blocking)
it.remove()
  }
}
  }

It removes the data caches contain the some logical plan, should it only remove 
the cache whose dataset calls unpersist method? 

  was:
When I do followings:
dataset A = some dataset
A.persist

dataset B = A.doSomthing
dataset C = A.doSomthing

C.persist
A.unpersist

I found C's cache is removed too, since following code:
def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): 
Unit = writeLock {
val it = cachedData.iterator()
while (it.hasNext) {
  val cd = it.next()
  if (cd.plan.find(_.sameResult(plan)).isDefined) {
cd.cachedRepresentation.cachedColumnBuffers.unpersist(blocking)
it.remove()
  }
}
  }

It removes the data caches contains the some logical plan, should it only 
remove the cache whose dataset calls unpersist method? 


> Why does dataset's unpersist clear all the caches have the same logical plan?
> -
>
> Key: SPARK-22988
> URL: https://issues.apache.org/jira/browse/SPARK-22988
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Wang Cheng
>Priority: Minor
>
> When I do followings:
> dataset A = some dataset
> A.persist
> dataset B = A.doSomthing
> dataset C = A.doSomthing
> C.persist
> A.unpersist
> I found C's cache is removed too, since following code:
> def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): 
> Unit = writeLock {
> val it = cachedData.iterator()
> while (it.hasNext) {
>   val cd = it.next()
>   if (cd.plan.find(_.sameResult(plan)).isDefined) {
> cd.cachedRepresentation.cachedColumnBuffers.unpersist(blocking)
> it.remove()
>   }
> }
>   }
> It removes the data caches contain the some logical plan, should it only 
> remove the cache whose dataset calls unpersist method? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22988) Why does dataset's unpersist clear all the caches have the same logical plan?

2018-01-07 Thread Wang Cheng (JIRA)

Wang Cheng created SPARK-22988:
--

 Summary: Why does dataset's unpersist clear all the caches have 
the same logical plan?
 Key: SPARK-22988
 URL: https://issues.apache.org/jira/browse/SPARK-22988
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: Wang Cheng
Priority: Minor


When I do followings:
dataset A = some dataset
A.persist

dataset B = A.doSomthing
dataset C = A.doSomthing

C.persist
A.unpersist

I found C's cache is removed too, since following code:
def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): 
Unit = writeLock {
val it = cachedData.iterator()
while (it.hasNext) {
  val cd = it.next()
  if (cd.plan.find(_.sameResult(plan)).isDefined) {
cd.cachedRepresentation.cachedColumnBuffers.unpersist(blocking)
it.remove()
  }
}
  }

It removes the data caches contains the some logical plan, should it only 
remove the cache whose dataset calls unpersist method? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22979) Avoid per-record type dispatch in Python data conversion (EvaluatePython.fromJava)

2018-01-07 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22979.
-
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.3.0

> Avoid per-record type dispatch in Python data conversion 
> (EvaluatePython.fromJava)
> --
>
> Key: SPARK-22979
> URL: https://issues.apache.org/jira/browse/SPARK-22979
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.3.0
>
>
> Seems we are type dispatching between Java objects (from Pyrolite) to Spark's 
> internal data format.
> See 
> https://github.com/apache/spark/blob/3f958a99921d149fb9fdf7ba7e78957afdad1405/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L89-L162
> Looks we can make converters each for each type and then reuse it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22566) Better error message for `_merge_type` in Pandas to Spark DF conversion

2018-01-07 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-22566.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19792
[https://github.com/apache/spark/pull/19792]

> Better error message for `_merge_type` in Pandas to Spark DF conversion
> ---
>
> Key: SPARK-22566
> URL: https://issues.apache.org/jira/browse/SPARK-22566
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Guilherme Berger
>Assignee: Guilherme Berger
>Priority: Minor
> Fix For: 2.3.0
>
>
> When creating a Spark DF from a Pandas DF without specifying a schema, schema 
> inference is used. This inference can fail when a column contains values of 
> two different types; this is ok. The problem is the error message does not 
> tell us in which column this happened.
> When this happens, it is painful to debug since the error message is too 
> vague.
> I plan on submitting a PR which fixes this, providing a better error message 
> for such cases, containing the column name (and possibly the problematic 
> values too).
> >>> spark_session.createDataFrame(pandas_df)
> File "redacted/pyspark/sql/session.py", line 541, in createDataFrame
>   rdd, schema = self._createFromLocal(map(prepare, data), schema)
> File "redacted/pyspark/sql/session.py", line 401, in _createFromLocal
>   struct = self._inferSchemaFromList(data)
> File "redacted/pyspark/sql/session.py", line 333, in _inferSchemaFromList
>   schema = reduce(_merge_type, map(_infer_schema, data))
> File "redacted/pyspark/sql/types.py", line 1124, in _merge_type
>   for f in a.fields]
> File "redacted/pyspark/sql/types.py", line 1118, in _merge_type
>   raise TypeError("Can not merge type %s and %s" % (type(a), type(b)))
> TypeError: Can not merge type  and  'pyspark.sql.types.StringType'>
>



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22566) Better error message for `_merge_type` in Pandas to Spark DF conversion

2018-01-07 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-22566:
-

Assignee: Guilherme Berger

> Better error message for `_merge_type` in Pandas to Spark DF conversion
> ---
>
> Key: SPARK-22566
> URL: https://issues.apache.org/jira/browse/SPARK-22566
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Guilherme Berger
>Assignee: Guilherme Berger
>Priority: Minor
>
> When creating a Spark DF from a Pandas DF without specifying a schema, schema 
> inference is used. This inference can fail when a column contains values of 
> two different types; this is ok. The problem is the error message does not 
> tell us in which column this happened.
> When this happens, it is painful to debug since the error message is too 
> vague.
> I plan on submitting a PR which fixes this, providing a better error message 
> for such cases, containing the column name (and possibly the problematic 
> values too).
> >>> spark_session.createDataFrame(pandas_df)
> File "redacted/pyspark/sql/session.py", line 541, in createDataFrame
>   rdd, schema = self._createFromLocal(map(prepare, data), schema)
> File "redacted/pyspark/sql/session.py", line 401, in _createFromLocal
>   struct = self._inferSchemaFromList(data)
> File "redacted/pyspark/sql/session.py", line 333, in _inferSchemaFromList
>   schema = reduce(_merge_type, map(_infer_schema, data))
> File "redacted/pyspark/sql/types.py", line 1124, in _merge_type
>   for f in a.fields]
> File "redacted/pyspark/sql/types.py", line 1118, in _merge_type
>   raise TypeError("Can not merge type %s and %s" % (type(a), type(b)))
> TypeError: Can not merge type  and  'pyspark.sql.types.StringType'>
>



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22987) UnsafeExternalSorter cases OOM when invoking `getIterator` function.

2018-01-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315657#comment-16315657
 ] 

Apache Spark commented on SPARK-22987:
--

User 'liutang123' has created a pull request for this issue:
https://github.com/apache/spark/pull/20184

> UnsafeExternalSorter cases OOM when invoking `getIterator` function.
> 
>
> Key: SPARK-22987
> URL: https://issues.apache.org/jira/browse/SPARK-22987
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Lijia Liu
>
> ChainedIterator.UnsafeExternalSorter remains a Queue of UnsafeSorterIterator. 
> When call `getIterator` function of UnsafeExternalSorter, 
> UnsafeExternalSorter passes an ArrayList of UnsafeSorterSpillReader to the 
> constructor of UnsafeExternalSorter. But, UnsafeSorterSpillReader maintains a 
> byte array as buffer, witch capacity is more than 1 MB. When spilling 
> frequently, this case maybe causes OOM.
> I try to change the Queue in ChainedIterator to a Iterator. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22987) UnsafeExternalSorter cases OOM when invoking `getIterator` function.

2018-01-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22987:


Assignee: (was: Apache Spark)

> UnsafeExternalSorter cases OOM when invoking `getIterator` function.
> 
>
> Key: SPARK-22987
> URL: https://issues.apache.org/jira/browse/SPARK-22987
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Lijia Liu
>
> ChainedIterator.UnsafeExternalSorter remains a Queue of UnsafeSorterIterator. 
> When call `getIterator` function of UnsafeExternalSorter, 
> UnsafeExternalSorter passes an ArrayList of UnsafeSorterSpillReader to the 
> constructor of UnsafeExternalSorter. But, UnsafeSorterSpillReader maintains a 
> byte array as buffer, witch capacity is more than 1 MB. When spilling 
> frequently, this case maybe causes OOM.
> I try to change the Queue in ChainedIterator to a Iterator. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22987) UnsafeExternalSorter cases OOM when invoking `getIterator` function.

2018-01-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22987:


Assignee: Apache Spark

> UnsafeExternalSorter cases OOM when invoking `getIterator` function.
> 
>
> Key: SPARK-22987
> URL: https://issues.apache.org/jira/browse/SPARK-22987
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Lijia Liu
>Assignee: Apache Spark
>
> ChainedIterator.UnsafeExternalSorter remains a Queue of UnsafeSorterIterator. 
> When call `getIterator` function of UnsafeExternalSorter, 
> UnsafeExternalSorter passes an ArrayList of UnsafeSorterSpillReader to the 
> constructor of UnsafeExternalSorter. But, UnsafeSorterSpillReader maintains a 
> byte array as buffer, witch capacity is more than 1 MB. When spilling 
> frequently, this case maybe causes OOM.
> I try to change the Queue in ChainedIterator to a Iterator. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22987) UnsafeExternalSorter cases OOM when invoking `getIterator` function.

2018-01-07 Thread Lijia Liu (JIRA)

Lijia Liu created SPARK-22987:
-

 Summary: UnsafeExternalSorter cases OOM when invoking 
`getIterator` function.
 Key: SPARK-22987
 URL: https://issues.apache.org/jira/browse/SPARK-22987
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.1, 2.2.0
Reporter: Lijia Liu


ChainedIterator.UnsafeExternalSorter remains a Queue of UnsafeSorterIterator. 
When call `getIterator` function of UnsafeExternalSorter, UnsafeExternalSorter 
passes an ArrayList of UnsafeSorterSpillReader to the constructor of 
UnsafeExternalSorter. But, UnsafeSorterSpillReader maintains a byte array as 
buffer, witch capacity is more than 1 MB. When spilling frequently, this case 
maybe causes OOM.

I try to change the Queue in ChainedIterator to a Iterator. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22711) _pickle.PicklingError: args[0] from newobj args has the wrong class from cloudpickle.py

2018-01-07 Thread Prateek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315648#comment-16315648
 ] 

Prateek commented on SPARK-22711:
-

@ [~hyukjin.kwon] Code updated. Make sure you have downloaded nltk, networkx, 
punkt, averaged_perceptron_tagger,wordnet 

> _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from 
> cloudpickle.py
> ---
>
> Key: SPARK-22711
> URL: https://issues.apache.org/jira/browse/SPARK-22711
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.2.0, 2.2.1
> Environment: Ubuntu pseudo distributed installation of Spark 2.2.0
>Reporter: Prateek
> Attachments: Jira_Spark_minimized_code.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> When I submit a Pyspark program with spark-submit command this error is 
> thrown.
> It happens when for code like below
> RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or 
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v)
> Traceback (most recent call last):
>   File "/home/prateek/Project/textrank.py", line 299, in 
> summaryRDD = sentenceTokensReduceRDD.map(lambda m: 
> get_summary(m)).reduceByKey(lambda c,v :c+v)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, 
> in reduceByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, 
> in combineByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, 
> in partitionBy
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, 
> in _jrdd
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, 
> in _wrap_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, 
> in _prepare_for_python_RDD
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 460, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 704, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 148, in dump
>   File "/usr/lib/python3.5/pickle.py", line 408, in dump
> self.save(obj)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 4

[jira] [Updated] (SPARK-22711) _pickle.PicklingError: args[0] from newobj args has the wrong class from cloudpickle.py

2018-01-07 Thread Prateek (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prateek updated SPARK-22711:

Attachment: Jira_Spark_minimized_code.py

updated missing code

> _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from 
> cloudpickle.py
> ---
>
> Key: SPARK-22711
> URL: https://issues.apache.org/jira/browse/SPARK-22711
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.2.0, 2.2.1
> Environment: Ubuntu pseudo distributed installation of Spark 2.2.0
>Reporter: Prateek
> Attachments: Jira_Spark_minimized_code.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> When I submit a Pyspark program with spark-submit command this error is 
> thrown.
> It happens when for code like below
> RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or 
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v)
> Traceback (most recent call last):
>   File "/home/prateek/Project/textrank.py", line 299, in 
> summaryRDD = sentenceTokensReduceRDD.map(lambda m: 
> get_summary(m)).reduceByKey(lambda c,v :c+v)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, 
> in reduceByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, 
> in combineByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, 
> in partitionBy
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, 
> in _jrdd
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, 
> in _wrap_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, 
> in _prepare_for_python_RDD
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 460, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 704, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 148, in dump
>   File "/usr/lib/python3.5/pickle.py", line 408, in dump
> self.save(obj)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py",

[jira] [Updated] (SPARK-22711) _pickle.PicklingError: args[0] from newobj args has the wrong class from cloudpickle.py

2018-01-07 Thread Prateek (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prateek updated SPARK-22711:

Attachment: (was: Jira_Spark_minimized_code.py)

> _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from 
> cloudpickle.py
> ---
>
> Key: SPARK-22711
> URL: https://issues.apache.org/jira/browse/SPARK-22711
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.2.0, 2.2.1
> Environment: Ubuntu pseudo distributed installation of Spark 2.2.0
>Reporter: Prateek
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> When I submit a Pyspark program with spark-submit command this error is 
> thrown.
> It happens when for code like below
> RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or 
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v)
> Traceback (most recent call last):
>   File "/home/prateek/Project/textrank.py", line 299, in 
> summaryRDD = sentenceTokensReduceRDD.map(lambda m: 
> get_summary(m)).reduceByKey(lambda c,v :c+v)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, 
> in reduceByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, 
> in combineByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, 
> in partitionBy
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, 
> in _jrdd
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, 
> in _wrap_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, 
> in _prepare_for_python_RDD
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 460, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 704, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 148, in dump
>   File "/usr/lib/python3.5/pickle.py", line 408, in dump
> self.save(obj)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/

[jira] [Resolved] (SPARK-22985) Fix argument escaping bug in from_utc_timestamp / to_utc_timestamp codegen

2018-01-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22985.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Fix argument escaping bug in from_utc_timestamp / to_utc_timestamp codegen
> --
>
> Key: SPARK-22985
> URL: https://issues.apache.org/jira/browse/SPARK-22985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.3.0
>
>
> The from_utc_timestamp and to_utc_timestamp expressions do not properly 
> escape their timezone argument in codegen, leading to compilation errors 
> instead of analysis errors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22986) Avoid instantiating multiple instances of broadcast variables

2018-01-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315615#comment-16315615
 ] 

Apache Spark commented on SPARK-22986:
--

User 'ho3rexqj' has created a pull request for this issue:
https://github.com/apache/spark/pull/20183

> Avoid instantiating multiple instances of broadcast variables 
> --
>
> Key: SPARK-22986
> URL: https://issues.apache.org/jira/browse/SPARK-22986
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: ho3rexqj
>
> When resources happen to be constrained on an executor the first time a 
> broadcast variable is instantiated it is persisted to disk by the 
> BlockManager.  Consequently, every subsequent call to 
> TorrentBroadcast::readBroadcastBlock from other instances of that broadcast 
> variable spawns another instance of the underlying value.  That is, broadcast 
> variables are spawned once per executor *unless* memory is constrained, in 
> which case every instance of a broadcast variable is provided with a unique 
> copy of the underlying value.
> The fix I propose is to explicitly cache the underlying values using weak 
> references (in a ReferenceMap) - note, however, that I couldn't find a clean 
> approach to creating the cache container here.  I added that to 
> BroadcastManager as a package-private field for want of a better solution, 
> however if something more appropriate already exists in the project for that 
> purpose please let me know.
> The above issue was terminating our team's applications erratically - 
> effectively, we were distributing roughly 1 GiB of data through a broadcast 
> variable and under certain conditions memory was constrained the first time 
> the broadcast variable was loaded on an executor.  As such, the executor 
> attempted to spawn several additional copies of the broadcast variable (we 
> were using 8 worker threads on the executor) which quickly led to the task 
> failing as the result of an OOM exception.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22986) Avoid instantiating multiple instances of broadcast variables

2018-01-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22986:


Assignee: Apache Spark

> Avoid instantiating multiple instances of broadcast variables 
> --
>
> Key: SPARK-22986
> URL: https://issues.apache.org/jira/browse/SPARK-22986
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: ho3rexqj
>Assignee: Apache Spark
>
> When resources happen to be constrained on an executor the first time a 
> broadcast variable is instantiated it is persisted to disk by the 
> BlockManager.  Consequently, every subsequent call to 
> TorrentBroadcast::readBroadcastBlock from other instances of that broadcast 
> variable spawns another instance of the underlying value.  That is, broadcast 
> variables are spawned once per executor *unless* memory is constrained, in 
> which case every instance of a broadcast variable is provided with a unique 
> copy of the underlying value.
> The fix I propose is to explicitly cache the underlying values using weak 
> references (in a ReferenceMap) - note, however, that I couldn't find a clean 
> approach to creating the cache container here.  I added that to 
> BroadcastManager as a package-private field for want of a better solution, 
> however if something more appropriate already exists in the project for that 
> purpose please let me know.
> The above issue was terminating our team's applications erratically - 
> effectively, we were distributing roughly 1 GiB of data through a broadcast 
> variable and under certain conditions memory was constrained the first time 
> the broadcast variable was loaded on an executor.  As such, the executor 
> attempted to spawn several additional copies of the broadcast variable (we 
> were using 8 worker threads on the executor) which quickly led to the task 
> failing as the result of an OOM exception.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22986) Avoid instantiating multiple instances of broadcast variables

2018-01-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22986:


Assignee: (was: Apache Spark)

> Avoid instantiating multiple instances of broadcast variables 
> --
>
> Key: SPARK-22986
> URL: https://issues.apache.org/jira/browse/SPARK-22986
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: ho3rexqj
>
> When resources happen to be constrained on an executor the first time a 
> broadcast variable is instantiated it is persisted to disk by the 
> BlockManager.  Consequently, every subsequent call to 
> TorrentBroadcast::readBroadcastBlock from other instances of that broadcast 
> variable spawns another instance of the underlying value.  That is, broadcast 
> variables are spawned once per executor *unless* memory is constrained, in 
> which case every instance of a broadcast variable is provided with a unique 
> copy of the underlying value.
> The fix I propose is to explicitly cache the underlying values using weak 
> references (in a ReferenceMap) - note, however, that I couldn't find a clean 
> approach to creating the cache container here.  I added that to 
> BroadcastManager as a package-private field for want of a better solution, 
> however if something more appropriate already exists in the project for that 
> purpose please let me know.
> The above issue was terminating our team's applications erratically - 
> effectively, we were distributing roughly 1 GiB of data through a broadcast 
> variable and under certain conditions memory was constrained the first time 
> the broadcast variable was loaded on an executor.  As such, the executor 
> attempted to spawn several additional copies of the broadcast variable (we 
> were using 8 worker threads on the executor) which quickly led to the task 
> failing as the result of an OOM exception.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22986) Avoid instantiating multiple instances of broadcast variables

2018-01-07 Thread ho3rexqj (JIRA)

ho3rexqj created SPARK-22986:


 Summary: Avoid instantiating multiple instances of broadcast 
variables 
 Key: SPARK-22986
 URL: https://issues.apache.org/jira/browse/SPARK-22986
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: ho3rexqj


When resources happen to be constrained on an executor the first time a 
broadcast variable is instantiated it is persisted to disk by the BlockManager. 
 Consequently, every subsequent call to TorrentBroadcast::readBroadcastBlock 
from other instances of that broadcast variable spawns another instance of the 
underlying value.  That is, broadcast variables are spawned once per executor 
*unless* memory is constrained, in which case every instance of a broadcast 
variable is provided with a unique copy of the underlying value.

The fix I propose is to explicitly cache the underlying values using weak 
references (in a ReferenceMap) - note, however, that I couldn't find a clean 
approach to creating the cache container here.  I added that to 
BroadcastManager as a package-private field for want of a better solution, 
however if something more appropriate already exists in the project for that 
purpose please let me know.

The above issue was terminating our team's applications erratically - 
effectively, we were distributing roughly 1 GiB of data through a broadcast 
variable and under certain conditions memory was constrained the first time the 
broadcast variable was loaded on an executor.  As such, the executor attempted 
to spawn several additional copies of the broadcast variable (we were using 8 
worker threads on the executor) which quickly led to the task failing as the 
result of an OOM exception.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2

2018-01-07 Thread Jepson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jepson updated SPARK-22968:
---
Component/s: (was: Structured Streaming)
 Spark Core

> java.lang.IllegalStateException: No current assignment for partition kssh-2
> ---
>
> Key: SPARK-22968
> URL: https://issues.apache.org/jira/browse/SPARK-22968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Kafka:  0.10.0  (CDH5.12.0)
> Apache Spark 2.1.1
> Spark streaming+Kafka
>Reporter: Jepson
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> *Kafka Broker:*
> {code:java}
>message.max.bytes : 2621440  
> {code}
> *Spark Streaming+Kafka Code:*
> {code:java}
> , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: 
> 1048576
> , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6
> , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3
> , "heartbeat.interval.ms" -> (5000: java.lang.Integer)
> , "receive.buffer.bytes" -> (10485760: java.lang.Integer)
> {code}
> *Error message:*
> {code:java}
> 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously 
> assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, 
> kssh-1] for group use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined 
> group use_a_separate_group_id_for_each_stream with generation 4
> 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned 
> partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for 
> time 1515116907000 ms
> java.lang.IllegalStateException: No current assignment for partition kssh-2
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.streaming.

[jira] [Commented] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2

2018-01-07 Thread Jepson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315552#comment-16315552
 ] 

Jepson commented on SPARK-22968:


[~srowen] Thanks for quick response.
I turn up the parameter "max.partition.fetch.bytes"  from 5242880 to10485760.
In the later days, i'll look at it again.

> java.lang.IllegalStateException: No current assignment for partition kssh-2
> ---
>
> Key: SPARK-22968
> URL: https://issues.apache.org/jira/browse/SPARK-22968
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1
> Environment: Kafka:  0.10.0  (CDH5.12.0)
> Apache Spark 2.1.1
> Spark streaming+Kafka
>Reporter: Jepson
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> *Kafka Broker:*
> {code:java}
>message.max.bytes : 2621440  
> {code}
> *Spark Streaming+Kafka Code:*
> {code:java}
> , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: 
> 1048576
> , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6
> , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3
> , "heartbeat.interval.ms" -> (5000: java.lang.Integer)
> , "receive.buffer.bytes" -> (10485760: java.lang.Integer)
> {code}
> *Error message:*
> {code:java}
> 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously 
> assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, 
> kssh-1] for group use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined 
> group use_a_separate_group_id_for_each_stream with generation 4
> 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned 
> partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for 
> time 1515116907000 ms
> java.lang.IllegalStateException: No current assignment for partition kssh-2
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
>   at 
> org.apache.spark.streaming.sched

[jira] [Commented] (SPARK-22967) VersionSuite failed on Windows caused by unescapeSQLString()

2018-01-07 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315547#comment-16315547
 ] 

Hyukjin Kwon commented on SPARK-22967:
--

Ah, I meant to fix the tests to use URI forms instead of Windows file path if 
they are not specific to test the path forms themselves.

> VersionSuite failed on Windows caused by unescapeSQLString()
> 
>
> Key: SPARK-22967
> URL: https://issues.apache.org/jira/browse/SPARK-22967
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
> Environment: Windos7
>Reporter: wuyi
>Priority: Minor
>  Labels: build, test, windows
>
> On Windows system, two unit test case would fail while running VersionSuite 
> ("A simple set of tests that call the methods of a `HiveClient`, loading 
> different version of hive from maven central.")
> Failed A : test(s"$version: read avro file containing decimal") 
> {code:java}
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> MetaException(message:java.lang.IllegalArgumentException: Can not create a 
> Path from an empty string);
> {code}
> Failed B: test(s"$version: SPARK-17920: Insert into/overwrite avro table")
> {code:java}
> Unable to infer the schema. The schema specification is required to create 
> the table `default`.`tab2`.;
> org.apache.spark.sql.AnalysisException: Unable to infer the schema. The 
> schema specification is required to create the table `default`.`tab2`.;
> {code}
> As I deep into this problem, I found it is related to 
> ParserUtils#unescapeSQLString().
> These are two lines at the beginning of Failed A:
> {code:java}
> val url = 
> Thread.currentThread().getContextClassLoader.getResource("avroDecimal")
> val location = new File(url.getFile)
> {code}
> And in my environment，`location` (path value) is
> {code:java}
> D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\avroDecimal
> {code}
> And then, in SparkSqlParser#visitCreateHiveTable()#L1128:
> {code:java}
> val location = Option(ctx.locationSpec).map(visitLocationSpec)
> {code}
> This line want to get LocationSepcContext's content first, which is equal to 
> `location` above.
> Then, the content is passed to visitLocationSpec(), and passed to 
> unescapeSQLString()
> finally.
> Lets' have a look at unescapeSQLString():
> {code:java}
> /** Unescape baskslash-escaped string enclosed by quotes. */
>   def unescapeSQLString(b: String): String = {
> var enclosure: Character = null
> val sb = new StringBuilder(b.length())
> def appendEscapedChar(n: Char) {
>   n match {
> case '0' => sb.append('\u')
> case '\'' => sb.append('\'')
> case '"' => sb.append('\"')
> case 'b' => sb.append('\b')
> case 'n' => sb.append('\n')
> case 'r' => sb.append('\r')
> case 't' => sb.append('\t')
> case 'Z' => sb.append('\u001A')
> case '\\' => sb.append('\\')
> // The following 2 lines are exactly what MySQL does TODO: why do we 
> do this?
> case '%' => sb.append("\\%")
> case '_' => sb.append("\\_")
> case _ => sb.append(n)
>   }
> }
> var i = 0
> val strLength = b.length
> while (i < strLength) {
>   val currentChar = b.charAt(i)
>   if (enclosure == null) {
> if (currentChar == '\'' || currentChar == '\"') {
>   enclosure = currentChar
> }
>   } else if (enclosure == currentChar) {
> enclosure = null
>   } else if (currentChar == '\\') {
> if ((i + 6 < strLength) && b.charAt(i + 1) == 'u') {
>   // \u style character literals.
>   val base = i + 2
>   val code = (0 until 4).foldLeft(0) { (mid, j) =>
> val digit = Character.digit(b.charAt(j + base), 16)
> (mid << 4) + digit
>   }
>   sb.append(code.asInstanceOf[Char])
>   i += 5
> } else if (i + 4 < strLength) {
>   // \000 style character literals.
>   val i1 = b.charAt(i + 1)
>   val i2 = b.charAt(i + 2)
>   val i3 = b.charAt(i + 3)
>   if ((i1 >= '0' && i1 <= '1') && (i2 >= '0' && i2 <= '7') && (i3 >= 
> '0' && i3 <= '7')) {
> val tmp = ((i3 - '0') + ((i2 - '0') << 3) + ((i1 - '0') << 
> 6)).asInstanceOf[Char]
> sb.append(tmp)
> i += 3
>   } else {
> appendEscapedChar(i1)
> i += 1
>   }
> } else if (i + 2 < strLength) {
>   // escaped character literals.
>   val n = b.charAt(i + 1)
>   appendEscapedChar(n)
>   i += 1
> }
>   } else {
> // non-escaped character literals.
> sb.append(currentChar)
>   }
>   i += 1
> }
> sb.toString(

[jira] [Assigned] (SPARK-22985) Fix argument escaping bug in from_utc_timestamp / to_utc_timestamp codegen

2018-01-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22985:


Assignee: Apache Spark  (was: Josh Rosen)

> Fix argument escaping bug in from_utc_timestamp / to_utc_timestamp codegen
> --
>
> Key: SPARK-22985
> URL: https://issues.apache.org/jira/browse/SPARK-22985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> The from_utc_timestamp and to_utc_timestamp expressions do not properly 
> escape their timezone argument in codegen, leading to compilation errors 
> instead of analysis errors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22985) Fix argument escaping bug in from_utc_timestamp / to_utc_timestamp codegen

2018-01-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22985:


Assignee: Josh Rosen  (was: Apache Spark)

> Fix argument escaping bug in from_utc_timestamp / to_utc_timestamp codegen
> --
>
> Key: SPARK-22985
> URL: https://issues.apache.org/jira/browse/SPARK-22985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The from_utc_timestamp and to_utc_timestamp expressions do not properly 
> escape their timezone argument in codegen, leading to compilation errors 
> instead of analysis errors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22985) Fix argument escaping bug in from_utc_timestamp / to_utc_timestamp codegen

2018-01-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315530#comment-16315530
 ] 

Apache Spark commented on SPARK-22985:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/20182

> Fix argument escaping bug in from_utc_timestamp / to_utc_timestamp codegen
> --
>
> Key: SPARK-22985
> URL: https://issues.apache.org/jira/browse/SPARK-22985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The from_utc_timestamp and to_utc_timestamp expressions do not properly 
> escape their timezone argument in codegen, leading to compilation errors 
> instead of analysis errors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22985) Fix argument escaping bug in from_utc_timestamp / to_utc_timestamp codegen

2018-01-07 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-22985:
--

 Summary: Fix argument escaping bug in from_utc_timestamp / 
to_utc_timestamp codegen
 Key: SPARK-22985
 URL: https://issues.apache.org/jira/browse/SPARK-22985
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.1.0, 2.0.0, 2.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen


The from_utc_timestamp and to_utc_timestamp expressions do not properly escape 
their timezone argument in codegen, leading to compilation errors instead of 
analysis errors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22984) Fix incorrect bitmap copying and offset shifting in GenerateUnsafeRowJoiner

2018-01-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22984:


Assignee: Josh Rosen  (was: Apache Spark)

> Fix incorrect bitmap copying and offset shifting in GenerateUnsafeRowJoiner
> ---
>
> Key: SPARK-22984
> URL: https://issues.apache.org/jira/browse/SPARK-22984
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
>  Labels: correctness
>
> The following query returns an incorrect answer:
> {code}
> set spark.sql.autoBroadcastJoinThreshold=-1;
> create table a as select * from values 1;
> create table b as select * from values 2;
> SELECT
> t3.col1,
> t1.col1
> FROM a t1
> CROSS JOIN b t2
> CROSS JOIN b t3
> {code}
> This should return the row {{2, 1}} but instead it returns {{null, 1}}. If 
> you permute the order of the columns in the select statement or the order of 
> the joins then it returns a valid answer (i.e. one without incorrect NULLs).
> This turns out to be due to two longstanding bugs in GenerateUnsafeRowJoiner, 
> which I'll describe in more detail in my PR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22984) Fix incorrect bitmap copying and offset shifting in GenerateUnsafeRowJoiner

2018-01-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315520#comment-16315520
 ] 

Apache Spark commented on SPARK-22984:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/20181

> Fix incorrect bitmap copying and offset shifting in GenerateUnsafeRowJoiner
> ---
>
> Key: SPARK-22984
> URL: https://issues.apache.org/jira/browse/SPARK-22984
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
>  Labels: correctness
>
> The following query returns an incorrect answer:
> {code}
> set spark.sql.autoBroadcastJoinThreshold=-1;
> create table a as select * from values 1;
> create table b as select * from values 2;
> SELECT
> t3.col1,
> t1.col1
> FROM a t1
> CROSS JOIN b t2
> CROSS JOIN b t3
> {code}
> This should return the row {{2, 1}} but instead it returns {{null, 1}}. If 
> you permute the order of the columns in the select statement or the order of 
> the joins then it returns a valid answer (i.e. one without incorrect NULLs).
> This turns out to be due to two longstanding bugs in GenerateUnsafeRowJoiner, 
> which I'll describe in more detail in my PR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22984) Fix incorrect bitmap copying and offset shifting in GenerateUnsafeRowJoiner

2018-01-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22984:


Assignee: Apache Spark  (was: Josh Rosen)

> Fix incorrect bitmap copying and offset shifting in GenerateUnsafeRowJoiner
> ---
>
> Key: SPARK-22984
> URL: https://issues.apache.org/jira/browse/SPARK-22984
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Critical
>  Labels: correctness
>
> The following query returns an incorrect answer:
> {code}
> set spark.sql.autoBroadcastJoinThreshold=-1;
> create table a as select * from values 1;
> create table b as select * from values 2;
> SELECT
> t3.col1,
> t1.col1
> FROM a t1
> CROSS JOIN b t2
> CROSS JOIN b t3
> {code}
> This should return the row {{2, 1}} but instead it returns {{null, 1}}. If 
> you permute the order of the columns in the select statement or the order of 
> the joins then it returns a valid answer (i.e. one without incorrect NULLs).
> This turns out to be due to two longstanding bugs in GenerateUnsafeRowJoiner, 
> which I'll describe in more detail in my PR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22984) Fix incorrect bitmap copying and offset shifting in GenerateUnsafeRowJoiner

2018-01-07 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-22984:
--

 Summary: Fix incorrect bitmap copying and offset shifting in 
GenerateUnsafeRowJoiner
 Key: SPARK-22984
 URL: https://issues.apache.org/jira/browse/SPARK-22984
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.1.0, 2.0.0, 1.6.0, 1.5.0, 2.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical


The following query returns an incorrect answer:

{code}
set spark.sql.autoBroadcastJoinThreshold=-1;

create table a as select * from values 1;
create table b as select * from values 2;

SELECT
t3.col1,
t1.col1
FROM a t1
CROSS JOIN b t2
CROSS JOIN b t3
{code}

This should return the row {{2, 1}} but instead it returns {{null, 1}}. If you 
permute the order of the columns in the select statement or the order of the 
joins then it returns a valid answer (i.e. one without incorrect NULLs).

This turns out to be due to two longstanding bugs in GenerateUnsafeRowJoiner, 
which I'll describe in more detail in my PR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22983) Don't push filters beneath aggregates with empty grouping expressions

2018-01-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22983:


Assignee: Apache Spark  (was: Josh Rosen)

> Don't push filters beneath aggregates with empty grouping expressions
> -
>
> Key: SPARK-22983
> URL: https://issues.apache.org/jira/browse/SPARK-22983
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Critical
>  Labels: correctness
>
> The following SQL query should return zero rows, but in Spark it actually 
> returns one row:
> {code}
> SELECT 1 from (
>   SELECT 1 AS z,
>   MIN(a.x)
>   FROM (select 1 as x) a
>   WHERE false
> ) b
> where b.z != b.z
> {code}
> The problem stems from the `PushDownPredicate` rule: when this rule 
> encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, 
> it removes the original filter and adds a new filter onto Aggregate's child, 
> e.g. `Agg(Filter(...))`. This is often okay, but the case above is a 
> counterexample: because there is no explicit `GROUP BY`, we are implicitly 
> computing a global aggregate over the entire table so the original filter was 
> not acting like a `HAVING` clause filtering the number of groups: if we push 
> this filter then it fails to actually reduce the cardinality of the Aggregate 
> output, leading to the wrong answer.
> A simple fix is to never push down filters beneath aggregates when there are 
> no grouping expressions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22983) Don't push filters beneath aggregates with empty grouping expressions

2018-01-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22983:


Assignee: Josh Rosen  (was: Apache Spark)

> Don't push filters beneath aggregates with empty grouping expressions
> -
>
> Key: SPARK-22983
> URL: https://issues.apache.org/jira/browse/SPARK-22983
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
>  Labels: correctness
>
> The following SQL query should return zero rows, but in Spark it actually 
> returns one row:
> {code}
> SELECT 1 from (
>   SELECT 1 AS z,
>   MIN(a.x)
>   FROM (select 1 as x) a
>   WHERE false
> ) b
> where b.z != b.z
> {code}
> The problem stems from the `PushDownPredicate` rule: when this rule 
> encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, 
> it removes the original filter and adds a new filter onto Aggregate's child, 
> e.g. `Agg(Filter(...))`. This is often okay, but the case above is a 
> counterexample: because there is no explicit `GROUP BY`, we are implicitly 
> computing a global aggregate over the entire table so the original filter was 
> not acting like a `HAVING` clause filtering the number of groups: if we push 
> this filter then it fails to actually reduce the cardinality of the Aggregate 
> output, leading to the wrong answer.
> A simple fix is to never push down filters beneath aggregates when there are 
> no grouping expressions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22983) Don't push filters beneath aggregates with empty grouping expressions

2018-01-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315517#comment-16315517
 ] 

Apache Spark commented on SPARK-22983:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/20180

> Don't push filters beneath aggregates with empty grouping expressions
> -
>
> Key: SPARK-22983
> URL: https://issues.apache.org/jira/browse/SPARK-22983
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
>  Labels: correctness
>
> The following SQL query should return zero rows, but in Spark it actually 
> returns one row:
> {code}
> SELECT 1 from (
>   SELECT 1 AS z,
>   MIN(a.x)
>   FROM (select 1 as x) a
>   WHERE false
> ) b
> where b.z != b.z
> {code}
> The problem stems from the `PushDownPredicate` rule: when this rule 
> encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, 
> it removes the original filter and adds a new filter onto Aggregate's child, 
> e.g. `Agg(Filter(...))`. This is often okay, but the case above is a 
> counterexample: because there is no explicit `GROUP BY`, we are implicitly 
> computing a global aggregate over the entire table so the original filter was 
> not acting like a `HAVING` clause filtering the number of groups: if we push 
> this filter then it fails to actually reduce the cardinality of the Aggregate 
> output, leading to the wrong answer.
> A simple fix is to never push down filters beneath aggregates when there are 
> no grouping expressions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22983) Don't push filters beneath aggregates with empty grouping expressions

2018-01-07 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-22983:
--

 Summary: Don't push filters beneath aggregates with empty grouping 
expressions
 Key: SPARK-22983
 URL: https://issues.apache.org/jira/browse/SPARK-22983
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.1.0, 2.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical


The following SQL query should return zero rows, but in Spark it actually 
returns one row:

{code}
SELECT 1 from (
  SELECT 1 AS z,
  MIN(a.x)
  FROM (select 1 as x) a
  WHERE false
) b
where b.z != b.z
{code}

The problem stems from the `PushDownPredicate` rule: when this rule encounters 
a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, it removes 
the original filter and adds a new filter onto Aggregate's child, e.g. 
`Agg(Filter(...))`. This is often okay, but the case above is a counterexample: 
because there is no explicit `GROUP BY`, we are implicitly computing a global 
aggregate over the entire table so the original filter was not acting like a 
`HAVING` clause filtering the number of groups: if we push this filter then it 
fails to actually reduce the cardinality of the Aggregate output, leading to 
the wrong answer.

A simple fix is to never push down filters beneath aggregates when there are no 
grouping expressions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel

2018-01-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22982:


Assignee: Apache Spark  (was: Josh Rosen)

> Remove unsafe asynchronous close() call from FileDownloadChannel
> 
>
> Key: SPARK-22982
> URL: https://issues.apache.org/jira/browse/SPARK-22982
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Blocker
>  Labels: correctness
>
> Spark's Netty-based file transfer code contains an asynchronous IO bug which 
> may lead to incorrect query results.
> At a high-level, the problem is that an unsafe asynchronous `close()` of a 
> pipe's source channel creates a race condition where file transfer code 
> closes a file descriptor then attempts to read from it. If the closed file 
> descriptor's number has been reused by an `open()` call then this invalid 
> read may cause unrelated file operations to return incorrect results due to 
> reading different data than intended.
> I have a small, surgical fix for this bug and will submit a PR with more 
> description on the specific race condition / underlying bug.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel

2018-01-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22982:


Assignee: Josh Rosen  (was: Apache Spark)

> Remove unsafe asynchronous close() call from FileDownloadChannel
> 
>
> Key: SPARK-22982
> URL: https://issues.apache.org/jira/browse/SPARK-22982
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>  Labels: correctness
>
> Spark's Netty-based file transfer code contains an asynchronous IO bug which 
> may lead to incorrect query results.
> At a high-level, the problem is that an unsafe asynchronous `close()` of a 
> pipe's source channel creates a race condition where file transfer code 
> closes a file descriptor then attempts to read from it. If the closed file 
> descriptor's number has been reused by an `open()` call then this invalid 
> read may cause unrelated file operations to return incorrect results due to 
> reading different data than intended.
> I have a small, surgical fix for this bug and will submit a PR with more 
> description on the specific race condition / underlying bug.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel

2018-01-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315514#comment-16315514
 ] 

Apache Spark commented on SPARK-22982:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/20179

> Remove unsafe asynchronous close() call from FileDownloadChannel
> 
>
> Key: SPARK-22982
> URL: https://issues.apache.org/jira/browse/SPARK-22982
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>  Labels: correctness
>
> Spark's Netty-based file transfer code contains an asynchronous IO bug which 
> may lead to incorrect query results.
> At a high-level, the problem is that an unsafe asynchronous `close()` of a 
> pipe's source channel creates a race condition where file transfer code 
> closes a file descriptor then attempts to read from it. If the closed file 
> descriptor's number has been reused by an `open()` call then this invalid 
> read may cause unrelated file operations to return incorrect results due to 
> reading different data than intended.
> I have a small, surgical fix for this bug and will submit a PR with more 
> description on the specific race condition / underlying bug.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel

2018-01-07 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-22982:
--

 Summary: Remove unsafe asynchronous close() call from 
FileDownloadChannel
 Key: SPARK-22982
 URL: https://issues.apache.org/jira/browse/SPARK-22982
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0, 2.1.0, 2.0.0, 1.6.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker


Spark's Netty-based file transfer code contains an asynchronous IO bug which 
may lead to incorrect query results.

At a high-level, the problem is that an unsafe asynchronous `close()` of a 
pipe's source channel creates a race condition where file transfer code closes 
a file descriptor then attempts to read from it. If the closed file 
descriptor's number has been reused by an `open()` call then this invalid read 
may cause unrelated file operations to return incorrect results due to reading 
different data than intended.

I have a small, surgical fix for this bug and will submit a PR with more 
description on the specific race condition / underlying bug.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22980) Wrong answer when using pandas_udf

2018-01-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22980.
--
Resolution: Not A Problem

Please reopen this if I misunderstood. I am taking an action quick because it's 
set to a blocker.

> Wrong answer when using pandas_udf
> --
>
> Key: SPARK-22980
> URL: https://issues.apache.org/jira/browse/SPARK-22980
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Blocker
>
> {noformat}
> from pyspark.sql.functions import pandas_udf
> from pyspark.sql.functions import col, lit
> from pyspark.sql.types import LongType
> df = spark.range(3)
> f = pandas_udf(lambda x, y: len(x) + y, LongType())
> df.select(f(lit('text'), col('id'))).show()
> {noformat}
> {noformat}
> from pyspark.sql.functions import udf
> from pyspark.sql.functions import col, lit
> from pyspark.sql.types import LongType
> df = spark.range(3)
> f = udf(lambda x, y: len(x) + y, LongType())
> df.select(f(lit('text'), col('id'))).show()
> {noformat}
> The results of pandas_udf are different from udf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22918) sbt test (spark - local) fail after upgrading to 2.2.1 with: java.security.AccessControlException: access denied org.apache.derby.security.SystemPermission( "engine",

2018-01-07 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315438#comment-16315438
 ] 

Felix Cheung commented on SPARK-22918:
--

[~sameerag]
we might want to check this for 2.3.0 release

> sbt test (spark - local) fail after upgrading to 2.2.1 with: 
> java.security.AccessControlException: access denied 
> org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" )
> 
>
> Key: SPARK-22918
> URL: https://issues.apache.org/jira/browse/SPARK-22918
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Damian Momot
>
> After upgrading 2.2.0 -> 2.2.1 sbt test command in one of my projects started 
> to fail with following exception:
> {noformat}
> java.security.AccessControlException: access denied 
> org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" )
>   at 
> java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
>   at 
> java.security.AccessController.checkPermission(AccessController.java:884)
>   at 
> org.apache.derby.iapi.security.SecurityUtil.checkDerbyInternalsPrivilege(Unknown
>  Source)
>   at org.apache.derby.iapi.services.monitor.Monitor.startMonitor(Unknown 
> Source)
>   at org.apache.derby.iapi.jdbc.JDBCBoot$1.run(Unknown Source)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source)
>   at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source)
>   at org.apache.derby.jdbc.EmbeddedDriver.boot(Unknown Source)
>   at org.apache.derby.jdbc.EmbeddedDriver.(Unknown Source)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at java.lang.Class.newInstance(Class.java:442)
>   at 
> org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47)
>   at 
> org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:325)
>   at 
> org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:282)
>   at 
> org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:240)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:286)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
>   at 
> org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187)
>   at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java

[jira] [Commented] (SPARK-22632) Fix the behavior of timestamp values for R's DataFrame to respect session timezone

2018-01-07 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315436#comment-16315436
 ] 

Felix Cheung commented on SPARK-22632:
--

yes, first I'd agree we should generalize this to R & Python
second, I think in general the different treatment of timezone between language 
and Spark has been a source of confusion (has been reported at least a few 
times)
lastly, this isn't a regression AFAIK, so not necessarily a blocker for 2.3, 
although might be very good to have.


> Fix the behavior of timestamp values for R's DataFrame to respect session 
> timezone
> --
>
> Key: SPARK-22632
> URL: https://issues.apache.org/jira/browse/SPARK-22632
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> Note: wording is borrowed from SPARK-22395. Symptom is similar and I think 
> that JIRA is well descriptive.
> When converting R's DataFrame from/to Spark DataFrame using 
> {{createDataFrame}} or {{collect}}, timestamp values behave to respect R 
> system timezone instead of session timezone.
> For example, let's say we use "America/Los_Angeles" as session timezone and 
> have a timestamp value "1970-01-01 00:00:01" in the timezone. Btw, I'm in 
> South Korea so R timezone would be "KST".
> The timestamp value from current collect() will be the following:
> {code}
> > sparkR.session(master = "local[*]", sparkConfig = 
> > list(spark.sql.session.timeZone = "America/Los_Angeles"))
> > collect(sql("SELECT cast(cast(28801 as timestamp) as string) as ts"))
>ts
> 1 1970-01-01 00:00:01
> > collect(sql("SELECT cast(28801 as timestamp) as ts"))
>ts
> 1 1970-01-01 17:00:01
> {code}
> As you can see, the value becomes "1970-01-01 17:00:01" because it respects R 
> system timezone.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2018-01-07 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315434#comment-16315434
 ] 

Felix Cheung commented on SPARK-21727:
--

I think we should use
is.atomic(object)

?

> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil Alexander McQuarrie
>Assignee: Neil Alexander McQuarrie
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22952) Deprecate stageAttemptId in favour of stageAttemptNumber

2018-01-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22952:


Assignee: Apache Spark

> Deprecate stageAttemptId in favour of stageAttemptNumber
> 
>
> Key: SPARK-22952
> URL: https://issues.apache.org/jira/browse/SPARK-22952
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Xianjin YE
>Assignee: Apache Spark
>Priority: Minor
>
> As discussed in [PR-20082|https://github.com/apache/spark/pull/20082] for 
> SPARK-22897, we prefer stageAttemptNumber over stageAttemptId. 
> This is the followup to deprecate stageAttemptId which will make public APIs 
> more consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22952) Deprecate stageAttemptId in favour of stageAttemptNumber

2018-01-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22952:


Assignee: (was: Apache Spark)

> Deprecate stageAttemptId in favour of stageAttemptNumber
> 
>
> Key: SPARK-22952
> URL: https://issues.apache.org/jira/browse/SPARK-22952
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Xianjin YE
>Priority: Minor
>
> As discussed in [PR-20082|https://github.com/apache/spark/pull/20082] for 
> SPARK-22897, we prefer stageAttemptNumber over stageAttemptId. 
> This is the followup to deprecate stageAttemptId which will make public APIs 
> more consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22952) Deprecate stageAttemptId in favour of stageAttemptNumber

2018-01-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315356#comment-16315356
 ] 

Apache Spark commented on SPARK-22952:
--

User 'advancedxy' has created a pull request for this issue:
https://github.com/apache/spark/pull/20178

> Deprecate stageAttemptId in favour of stageAttemptNumber
> 
>
> Key: SPARK-22952
> URL: https://issues.apache.org/jira/browse/SPARK-22952
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Xianjin YE
>Priority: Minor
>
> As discussed in [PR-20082|https://github.com/apache/spark/pull/20082] for 
> SPARK-22897, we prefer stageAttemptNumber over stageAttemptId. 
> This is the followup to deprecate stageAttemptId which will make public APIs 
> more consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22954) ANALYZE TABLE fails with NoSuchTableException for temporary tables (but should have reported "not supported on views")

2018-01-07 Thread Suchith J N (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315336#comment-16315336
 ] 

Suchith J N commented on SPARK-22954:
-

I have opened a pull request. 

> ANALYZE TABLE fails with NoSuchTableException for temporary tables (but 
> should have reported "not supported on views")
> --
>
> Key: SPARK-22954
> URL: https://issues.apache.org/jira/browse/SPARK-22954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: {code}
> $ ./bin/spark-shell --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_152
> Branch master
> Compiled by user jacek on 2018-01-04T05:44:05Z
> Revision 7d045c5f00e2c7c67011830e2169a4e130c3ace8
> {code}
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {{ANALYZE TABLE}} fails with {{NoSuchTableException: Table or view 'names' 
> not found in database 'default';}} for temporary tables (views) while the 
> reason is that it can only work with permanent tables (which [it can 
> report|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala#L38]
>  if it had a chance).
> {code}
> scala> names.createOrReplaceTempView("names")
> scala> sql("ANALYZE TABLE names COMPUTE STATISTICS")
> org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 
> 'names' not found in database 'default';
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:181)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:398)
>   at 
> org.apache.spark.sql.execution.command.AnalyzeTableCommand.run(AnalyzeTableCommand.scala:36)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:187)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:187)
>   at org.apache.spark.sql.Dataset$$anonfun$51.apply(Dataset.scala:3244)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3243)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:187)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:72)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
>   ... 50 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22954) ANALYZE TABLE fails with NoSuchTableException for temporary tables (but should have reported "not supported on views")

2018-01-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315334#comment-16315334
 ] 

Apache Spark commented on SPARK-22954:
--

User 'suchithjn225' has created a pull request for this issue:
https://github.com/apache/spark/pull/20177

> ANALYZE TABLE fails with NoSuchTableException for temporary tables (but 
> should have reported "not supported on views")
> --
>
> Key: SPARK-22954
> URL: https://issues.apache.org/jira/browse/SPARK-22954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: {code}
> $ ./bin/spark-shell --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_152
> Branch master
> Compiled by user jacek on 2018-01-04T05:44:05Z
> Revision 7d045c5f00e2c7c67011830e2169a4e130c3ace8
> {code}
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {{ANALYZE TABLE}} fails with {{NoSuchTableException: Table or view 'names' 
> not found in database 'default';}} for temporary tables (views) while the 
> reason is that it can only work with permanent tables (which [it can 
> report|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala#L38]
>  if it had a chance).
> {code}
> scala> names.createOrReplaceTempView("names")
> scala> sql("ANALYZE TABLE names COMPUTE STATISTICS")
> org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 
> 'names' not found in database 'default';
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:181)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:398)
>   at 
> org.apache.spark.sql.execution.command.AnalyzeTableCommand.run(AnalyzeTableCommand.scala:36)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:187)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:187)
>   at org.apache.spark.sql.Dataset$$anonfun$51.apply(Dataset.scala:3244)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3243)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:187)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:72)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
>   ... 50 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22954) ANALYZE TABLE fails with NoSuchTableException for temporary tables (but should have reported "not supported on views")

2018-01-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22954:


Assignee: (was: Apache Spark)

> ANALYZE TABLE fails with NoSuchTableException for temporary tables (but 
> should have reported "not supported on views")
> --
>
> Key: SPARK-22954
> URL: https://issues.apache.org/jira/browse/SPARK-22954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: {code}
> $ ./bin/spark-shell --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_152
> Branch master
> Compiled by user jacek on 2018-01-04T05:44:05Z
> Revision 7d045c5f00e2c7c67011830e2169a4e130c3ace8
> {code}
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {{ANALYZE TABLE}} fails with {{NoSuchTableException: Table or view 'names' 
> not found in database 'default';}} for temporary tables (views) while the 
> reason is that it can only work with permanent tables (which [it can 
> report|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala#L38]
>  if it had a chance).
> {code}
> scala> names.createOrReplaceTempView("names")
> scala> sql("ANALYZE TABLE names COMPUTE STATISTICS")
> org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 
> 'names' not found in database 'default';
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:181)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:398)
>   at 
> org.apache.spark.sql.execution.command.AnalyzeTableCommand.run(AnalyzeTableCommand.scala:36)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:187)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:187)
>   at org.apache.spark.sql.Dataset$$anonfun$51.apply(Dataset.scala:3244)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3243)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:187)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:72)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
>   ... 50 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22954) ANALYZE TABLE fails with NoSuchTableException for temporary tables (but should have reported "not supported on views")

2018-01-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22954:


Assignee: Apache Spark

> ANALYZE TABLE fails with NoSuchTableException for temporary tables (but 
> should have reported "not supported on views")
> --
>
> Key: SPARK-22954
> URL: https://issues.apache.org/jira/browse/SPARK-22954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: {code}
> $ ./bin/spark-shell --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_152
> Branch master
> Compiled by user jacek on 2018-01-04T05:44:05Z
> Revision 7d045c5f00e2c7c67011830e2169a4e130c3ace8
> {code}
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Minor
>
> {{ANALYZE TABLE}} fails with {{NoSuchTableException: Table or view 'names' 
> not found in database 'default';}} for temporary tables (views) while the 
> reason is that it can only work with permanent tables (which [it can 
> report|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala#L38]
>  if it had a chance).
> {code}
> scala> names.createOrReplaceTempView("names")
> scala> sql("ANALYZE TABLE names COMPUTE STATISTICS")
> org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 
> 'names' not found in database 'default';
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:181)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:398)
>   at 
> org.apache.spark.sql.execution.command.AnalyzeTableCommand.run(AnalyzeTableCommand.scala:36)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:187)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:187)
>   at org.apache.spark.sql.Dataset$$anonfun$51.apply(Dataset.scala:3244)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3243)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:187)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:72)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
>   ... 50 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22980) Wrong answer when using pandas_udf

2018-01-07 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315217#comment-16315217
 ] 

Hyukjin Kwon commented on SPARK-22980:
--

I think that's because we expect Pandas's Series in Scala pandas's udf.

> Wrong answer when using pandas_udf
> --
>
> Key: SPARK-22980
> URL: https://issues.apache.org/jira/browse/SPARK-22980
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Blocker
>
> {noformat}
> from pyspark.sql.functions import pandas_udf
> from pyspark.sql.functions import col, lit
> from pyspark.sql.types import LongType
> df = spark.range(3)
> f = pandas_udf(lambda x, y: len(x) + y, LongType())
> df.select(f(lit('text'), col('id'))).show()
> {noformat}
> {noformat}
> from pyspark.sql.functions import udf
> from pyspark.sql.functions import col, lit
> from pyspark.sql.types import LongType
> df = spark.range(3)
> f = udf(lambda x, y: len(x) + y, LongType())
> df.select(f(lit('text'), col('id'))).show()
> {noformat}
> The results of pandas_udf are different from udf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

53 matches

Mail list logo