date:20190408

[jira] [Assigned] (SPARK-27384) File source V2: Prune unnecessary partition columns

2019-04-08 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27384:
---

Assignee: Gengliang Wang

> File source V2: Prune unnecessary partition columns
> ---
>
> Key: SPARK-27384
> URL: https://issues.apache.org/jira/browse/SPARK-27384
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> When scanning file sources, we can prune unnecessary partition columns on 
> constructing input partitions, so that:
> 1. Reduce the data transformation from Driver to Executors
> 2. Make it easier to implement columnar batch readers, since the partition 
> columns are already pruned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27384) File source V2: Prune unnecessary partition columns

2019-04-08 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27384.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24296
[https://github.com/apache/spark/pull/24296]

> File source V2: Prune unnecessary partition columns
> ---
>
> Key: SPARK-27384
> URL: https://issues.apache.org/jira/browse/SPARK-27384
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> When scanning file sources, we can prune unnecessary partition columns on 
> constructing input partitions, so that:
> 1. Reduce the data transformation from Driver to Executors
> 2. Make it easier to implement columnar batch readers, since the partition 
> columns are already pruned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27289) spark-submit explicit configuration does not take effect but Spark UI shows it's effective

2019-04-08 Thread Udbhav Agrawal (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812242#comment-16812242
 ] 

Udbhav Agrawal commented on SPARK-27289:


yes intermediate data is written in the spark.local.dir which is configured 
through --conf parameter while running spark-submit, it will overwrite the one 
you have mentioned in spark-default.conf

> spark-submit explicit configuration does not take effect but Spark UI shows 
> it's effective
> --
>
> Key: SPARK-27289
> URL: https://issues.apache.org/jira/browse/SPARK-27289
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Documentation, Spark Submit, Web UI
>Affects Versions: 2.3.3
>Reporter: KaiXu
>Priority: Minor
> Attachments: Capture.PNG
>
>
> The [doc 
> |https://spark.apache.org/docs/latest/submitting-applications.html]says that  
> "In general, configuration values explicitly set on a {{SparkConf}} take the 
> highest precedence, then flags passed to {{spark-submit}}, then values in the 
> defaults file", but when setting spark.local.dir through --conf with 
> spark-submit, it still uses the values from 
> ${SPARK_HOME}/conf/spark-defaults.conf, what's more, the Spark runtime UI 
> environment variables shows the value from --conf, which is really misleading.
> e.g.
> I set submit my application through the command:
> /opt/spark233/bin/spark-submit --properties-file /opt/spark.conf --conf 
> spark.local.dir=/tmp/spark_local -v --class 
> org.apache.spark.examples.mllib.SparseNaiveBayes --master 
> spark://bdw-slave20:7077 
> /opt/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar 
> hdfs://bdw-slave20:8020/Bayes/Input
>  
> the spark.local.dir in ${SPARK_HOME}/conf/spark-defaults.conf is:
> spark.local.dir=/mnt/nvme1/spark_local
> when the application is running, I found the intermediate shuffle data was 
> wrote to /mnt/nvme1/spark_local, which is set through 
> ${SPARK_HOME}/conf/spark-defaults.conf, but the Web UI shows that the 
> environment value spark.local.dir=/tmp/spark_local.
> The spark-submit verbose also shows spark.local.dir=/tmp/spark_local, it's 
> misleading. 
>  
> !image-2019-03-27-10-59-38-377.png!
> spark-submit verbose:
> 
> Spark properties used, including those specified through
>  --conf and those from the properties file /opt/spark.conf:
>  (spark.local.dir,/tmp/spark_local)
>  (spark.default.parallelism,132)
>  (spark.driver.memory,10g)
>  (spark.executor.memory,352g)
> X



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27406) UnsafeArrayData serialization breaks when two machines have different Oops size

2019-04-08 Thread Sandeep Katta (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812302#comment-16812302
 ] 

Sandeep Katta commented on SPARK-27406:
---

[~pengbo] thanks for raising this issue, soon I will raise PR for this

> UnsafeArrayData serialization breaks when two machines have different Oops 
> size
> ---
>
> Key: SPARK-27406
> URL: https://issues.apache.org/jira/browse/SPARK-27406
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: peng bo
>Priority: Major
>
> ApproxCountDistinctForIntervals holds the UnsafeArrayData data to initialize 
> endpoints. When the UnsafeArrayData is serialized with Java serialization, 
> the BYTE_ARRAY_OFFSET in memory can change if two machines have different 
> pointer width (Oops in JVM).
> It's similar to SPARK-10914.
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals$$anonfun$endpoints$1.apply(ApproxCountDistinctForIntervals.scala:69)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals$$anonfun$endpoints$1.apply(ApproxCountDistinctForIntervals.scala:69)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.endpoints$lzycompute(ApproxCountDistinctForIntervals.scala:69)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.endpoints(ApproxCountDistinctForIntervals.scala:66)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$hllppArray$lzycompute(ApproxCountDistinctForIntervals.scala:94)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$hllppArray(ApproxCountDistinctForIntervals.scala:93)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$numWordsPerHllpp$lzycompute(ApproxCountDistinctForIntervals.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$numWordsPerHllpp(ApproxCountDistinctForIntervals.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.totalNumWords$lzycompute(ApproxCountDistinctForIntervals.scala:106)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.totalNumWords(ApproxCountDistinctForIntervals.scala:106)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.createAggregationBuffer(ApproxCountDistinctForIntervals.scala:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.createAggregationBuffer(ApproxCountDistinctForIntervals.scala:44)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.initialize(interfaces.scala:528)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator$$anonfun$initAggregationBuffer$2.apply(ObjectAggregationIterator.scala:120)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator$$anonfun$initAggregationBuffer$2.apply(ObjectAggregationIterator.scala:120)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.initAggregationBuffer(ObjectAggregationIterator.scala:120)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.org$apache$spark$sql$execution$aggregate$ObjectAggregationIterator$$createNewAggregationBuffer(ObjectAggregationIterator.scala:112)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.getAggregationBufferByKey(ObjectA

[jira] [Commented] (SPARK-27406) UnsafeArrayData serialization breaks when two machines have different Oops size

2019-04-08 Thread peng bo (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812376#comment-16812376
 ] 

peng bo commented on SPARK-27406:
-

[~sandeep.katta2007] 

Actually, I have already submitted PR for this, can you please review it?

https://github.com/apache/spark/pull/24317/files

> UnsafeArrayData serialization breaks when two machines have different Oops 
> size
> ---
>
> Key: SPARK-27406
> URL: https://issues.apache.org/jira/browse/SPARK-27406
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: peng bo
>Priority: Major
>
> ApproxCountDistinctForIntervals holds the UnsafeArrayData data to initialize 
> endpoints. When the UnsafeArrayData is serialized with Java serialization, 
> the BYTE_ARRAY_OFFSET in memory can change if two machines have different 
> pointer width (Oops in JVM).
> It's similar to SPARK-10914.
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals$$anonfun$endpoints$1.apply(ApproxCountDistinctForIntervals.scala:69)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals$$anonfun$endpoints$1.apply(ApproxCountDistinctForIntervals.scala:69)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.endpoints$lzycompute(ApproxCountDistinctForIntervals.scala:69)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.endpoints(ApproxCountDistinctForIntervals.scala:66)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$hllppArray$lzycompute(ApproxCountDistinctForIntervals.scala:94)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$hllppArray(ApproxCountDistinctForIntervals.scala:93)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$numWordsPerHllpp$lzycompute(ApproxCountDistinctForIntervals.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$numWordsPerHllpp(ApproxCountDistinctForIntervals.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.totalNumWords$lzycompute(ApproxCountDistinctForIntervals.scala:106)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.totalNumWords(ApproxCountDistinctForIntervals.scala:106)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.createAggregationBuffer(ApproxCountDistinctForIntervals.scala:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.createAggregationBuffer(ApproxCountDistinctForIntervals.scala:44)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.initialize(interfaces.scala:528)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator$$anonfun$initAggregationBuffer$2.apply(ObjectAggregationIterator.scala:120)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator$$anonfun$initAggregationBuffer$2.apply(ObjectAggregationIterator.scala:120)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.initAggregationBuffer(ObjectAggregationIterator.scala:120)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.org$apache$spark$sql$execution$aggregate$ObjectAggregationIterator$$createNewAggregationBuffer(ObjectAggregationIterator.scala:112)
>   at 
> org.apache.spark.sql.execution.aggre

[jira] [Commented] (SPARK-27348) HeartbeatReceiver doesn't remove lost executors from CoarseGrainedSchedulerBackend

2019-04-08 Thread Sandeep Katta (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812396#comment-16812396
 ] 

Sandeep Katta commented on SPARK-27348:
---

[~zsxwing] do you have any test code or scenario which can suffice your 
statement ?

> HeartbeatReceiver doesn't remove lost executors from 
> CoarseGrainedSchedulerBackend
> --
>
> Key: SPARK-27348
> URL: https://issues.apache.org/jira/browse/SPARK-27348
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> When a heartbeat timeout happens in HeartbeatReceiver, it doesn't remove lost 
> executors from CoarseGrainedSchedulerBackend. When a connection of an 
> executor is not gracefully shut down, CoarseGrainedSchedulerBackend may not 
> receive a disconnect event. In this case, CoarseGrainedSchedulerBackend still 
> thinks a lost executor is still alive. CoarseGrainedSchedulerBackend may ask 
> TaskScheduler to run tasks on this lost executor. This task will never finish 
> and the job will hang forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27407) File source V2: Invalidate cache data on overwrite/append

2019-04-08 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-27407:
--

 Summary: File source V2: Invalidate cache data on overwrite/append
 Key: SPARK-27407
 URL: https://issues.apache.org/jira/browse/SPARK-27407
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


File source V2 currently incorrectly continues to use cached data even if the 
underlying data is overwritten. 
We should follow https://github.com/apache/spark/pull/13566 and fix it by 
invalidating and refreshes all the cached data (and the associated metadata) 
for any Dataframe that contains the given data source path.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25407) Spark throws a `ParquetDecodingException` when attempting to read a field from a complex type in certain cases of schema merging

2019-04-08 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25407.
--
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 3.0.0

Fixed in https://github.com/apache/spark/pull/24307

> Spark throws a `ParquetDecodingException` when attempting to read a field 
> from a complex type in certain cases of schema merging
> 
>
> Key: SPARK-25407
> URL: https://issues.apache.org/jira/browse/SPARK-25407
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Michael Allman
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Spark supports merging schemata across table partitions in which one 
> partition is missing a subfield that's present in another. However, 
> attempting to select that missing field with a query that includes a 
> partition pruning predicate that filters out the partitions that include that 
> field results in a `ParquetDecodingException` when attempting to get the 
> query results.
> This bug is specifically exercised by the failing (but ignored) test case 
> [https://github.com/apache/spark/blob/f2d35427eedeacceb6edb8a51974a7e8bbb94bc2/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala#L125-L131].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16548) java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data

2019-04-08 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-16548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812469#comment-16812469
 ] 

Wenchen Fan commented on SPARK-16548:
-

Do you have a small dateset to reproduce it?

> java.io.CharConversionException: Invalid UTF-32 character  prevents me from 
> querying my data
> 
>
> Key: SPARK-16548
> URL: https://issues.apache.org/jira/browse/SPARK-16548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Egor Pahomov
>Priority: Minor
> Fix For: 2.2.0, 2.3.0
>
>
> Basically, when I query my json data I get 
> {code}
> java.io.CharConversionException: Invalid UTF-32 character 0x7b2265(above 
> 10)  at char #192, byte #771)
>   at 
> com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189)
>   at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1855)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:571)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:142)
> {code}
> I do not like it. If you can not process one json among 100500 please return 
> null, do not fail everything. I have dirty one line fix, and I understand how 
> I can make it more reasonable. What is our position - what behaviour we wanna 
> get?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27364) User-facing APIs for GPU-aware scheduling

2019-04-08 Thread Thomas Graves (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812488#comment-16812488
 ] 

Thomas Graves commented on SPARK-27364:
---

There are 3 main user facing impacts for the user for this are the taskContext 
interface to fetch the resources, the user api to specify the gpu count, and 
then how the executor discovers the gpu's or is told the gpus. Below is more 
detail:

 

1) How the user gets the resources from the TaskContext and BarrierTaskContext

  For the taskContext interface I propose we add an api like:

{color:#80}def {color}getResources(): 
{color:#20999d}Map{color}[{color:#20999d}String{color}, ResourceInformation]

Where the Map key is the resource type.  So examples would be "gpu", "fpga", 
etc.  "gpu" would be the only one we officially support to start with.

ResourceInformation would be a class with a name, units, count, and addresses.  
The name would be "gpu", the units for gpu would be empty "", but for other 
resources types like memory it could be GiB or similar, the count is the number 
of them, so for gpu's it would be the number allocated, and finally the address 
Array of strings could be whatever we want, in the gpu case it would just be 
the indexes of the gpu's allocated to the task, ie ["0", "2", "3"]. I made this 
a string so its very flexible as to what the address is based on different 
resources types.  Now the user has to know how to inpret this, but depending on 
what you are doing with them even the same tools have multiple ways to specify. 
For instance with tensorflow{{ you can specify in CUDA_VISIBLE_DEVICES=2,3 or 
you can speicify like:
for d in ['/device:GPU:2', '/device:GPU:3']:
}}

{color:#80}private val {color}name: {color:#20999d}String{color},
{color:#80}private val {color}units: {color:#20999d}String{color},
{color:#80}private val {color}count: Long,
{color:#80}private val {color}addresses: 
Array[{color:#20999d}String{color}] = Array.empty

{color:#80}def {color}getName(): {color:#20999d}String {color}= name
 {color:#80}def {color}getUnits(): {color:#20999d}String {color}= units
 {color:#80}def {color}getCount(): Long = count
 {color:#80}def {color}getAddresses(): Array[{color:#20999d}String{color}] 
= addresses



2) How the user specifies the gpu resources upon application submission

Here we need multiple configs:

   a) one for the user to specify the gpus per task, that config, to make it 
extensible for other resources, I propose: *spark.task.resource.\{resource 
type}.count* .  This implementation would only support gpu but it gives us 
flexibility to add more. This allows for multiple resources as well as multiple 
configs for that resource. For instance resource type here would be gpu, but 
you could add fpga.  It also would allow you to add more configs instead of 
count.  You could add in like type for I want a certain gpu type for instance.

   b) User has to specify how many gpu's per executor and driver.  This one is 
a bit more complicated since it has to work with the resource managers to 
actually acquire those but I think it makes sense to have common configs like 
we do for cores and memory. So we can have *spark.executor.resource.\{resource 
type}.count* and *spark.driver.resource.\{resource type}.count*.   This 
implementation would only support gpu.  The tricky thing here is some of the 
resource managers already have configs for asking for gpu's.  Yarn has 
{{spark.yarn.executor.resource.\{resource-type}}} although it was added in 3.0 
and hasn't shipped yet, but we can't just remove it since you could ask yarn 
for other resource types spark doesn't know about.  Kubernetes you have to 
request via the pod template so I think it would be on the user to make sure 
those match. mesos has {{spark.mesos.gpus.max}}.  So we just need to make sure 
the new configs maps into those and having the duplicate configs might make it 
a bit weird to the user.

3) how the executor discovers or is told the gpu resources it has.

Here I think we have 2 options for the user/resource manager.  

  a) I propose we add a config *spark.\{executor, 
driver}.resource.gpu.discoverScript* to allow the user to specify a discovery 
script. This script gets run when the executor starts and the user requested 
gpus to discover what Gpu's the executor has.   A simple example of this would 
be the script simply runs "nvidia-smi --query-gpu=index --format=csv,noheader'" 
to get the gpu indexes for nvidia cards.  You could make this script super 
simple or complicated depending on your setup.

  b) Also add an option to the executor launch *--gpuDevices* that allows the 
resource manager to specify the indexes of the gpu devices it has.   This 
allows insecure or non-containerized resource managers like standalone mode to 
allocate gpu's per executor without having containers and isolation all 
implemented

[jira] [Comment Edited] (SPARK-27364) User-facing APIs for GPU-aware scheduling

2019-04-08 Thread Thomas Graves (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812488#comment-16812488
 ] 

Thomas Graves edited comment on SPARK-27364 at 4/8/19 3:01 PM:
---

There are 3 main user facing impacts for the user for this are the taskContext 
interface to fetch the resources, the user api to specify the gpu count, and 
then how the executor discovers the gpu's or is told the gpus. Below is more 
detail:

 

1) How the user gets the resources from the TaskContext and BarrierTaskContext

  For the taskContext interface I propose we add an api like:

*{color:#80}def {color}getResources(): 
{color:#20999d}Map{color}[{color:#20999d}String{color}, ResourceInformation]*

Where the Map key is the resource type.  So examples would be "gpu", "fpga", 
etc.  "gpu" would be the only one we officially support to start with.

ResourceInformation would be a class with a name, units, count, and addresses.  
The name would be "gpu", the units for gpu would be empty "", but for other 
resources types like memory it could be GiB or similar, the count is the number 
of them, so for gpu's it would be the number allocated, and finally the address 
Array of strings could be whatever we want, in the gpu case it would just be 
the indexes of the gpu's allocated to the task, ie ["0", "2", "3"]. I made this 
a string so its very flexible as to what the address is based on different 
resources types.  Now the user has to know how to inpret this, but depending on 
what you are doing with them even the same tools have multiple ways to specify. 
For instance with tensorflow{{ you can specify in CUDA_VISIBLE_DEVICES=2,3 or 
you can speicify like:
 for d in ['/device:GPU:2', '/device:GPU:3']:
 }}

*{color:#80}private val {color}name: {color:#20999d}String{color},*
 *{color:#80}private val {color}units: {color:#20999d}String{color},*
 *{color:#80}private val {color}count: Long,*
 *{color:#80}private val {color}addresses: 
Array[{color:#20999d}String{color}] = Array.empty*

*{color:#80}def {color}getName(): {color:#20999d}String {color}= name*
 *{color:#80}def {color}getUnits(): {color:#20999d}String {color}= units*
 *{color:#80}def {color}getCount(): Long = count*
 *{color:#80}def {color}getAddresses(): Array[{color:#20999d}String{color}] 
= addresses*

2) How the user specifies the gpu resources upon application submission

Here we need multiple configs:

   a) one for the user to specify the gpus per task, that config, to make it 
extensible for other resources, I propose: *spark.task.resource.\{resource 
type}.count* .  This implementation would only support gpu but it gives us 
flexibility to add more. This allows for multiple resources as well as multiple 
configs for that resource. For instance resource type here would be gpu, but 
you could add fpga.  It also would allow you to add more configs instead of 
count.  You could add in like type for I want a certain gpu type for instance.

   b) User has to specify how many gpu's per executor and driver.  This one is 
a bit more complicated since it has to work with the resource managers to 
actually acquire those but I think it makes sense to have common configs like 
we do for cores and memory. So we can have *spark.executor.resource.\{resource 
type}.count* and *spark.driver.resource.\{resource type}.count*.   This 
implementation would only support gpu.  The tricky thing here is some of the 
resource managers already have configs for asking for gpu's.  Yarn has 
{{spark.yarn.executor.resource.{resource-type}}} although it was added in 3.0 
and hasn't shipped yet, but we can't just remove it since you could ask yarn 
for other resource types spark doesn't know about.  Kubernetes you have to 
request via the pod template so I think it would be on the user to make sure 
those match. mesos has {{spark.mesos.gpus.max}}.  So we just need to make sure 
the new configs maps into those and having the duplicate configs might make it 
a bit weird to the user.

3) how the executor discovers or is told the gpu resources it has.

Here I think we have 2 options for the user/resource manager.  

  a) I propose we add a config *spark.\{executor, 
driver}.resource.gpu.discoverScript* to allow the user to specify a discovery 
script. This script gets run when the executor starts and the user requested 
gpus to discover what Gpu's the executor has.   A simple example of this would 
be the script simply runs "nvidia-smi --query-gpu=index --format=csv,noheader'" 
to get the gpu indexes for nvidia cards.  You could make this script super 
simple or complicated depending on your setup.

  b) Also add an option to the executor launch *--gpuDevices* that allows the 
resource manager to specify the indexes of the gpu devices it has.   This 
allows insecure or non-containerized resource managers like standalone mode to 
allocate gpu's pe

[jira] [Comment Edited] (SPARK-27364) User-facing APIs for GPU-aware scheduling

2019-04-08 Thread Thomas Graves (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812488#comment-16812488
 ] 

Thomas Graves edited comment on SPARK-27364 at 4/8/19 3:10 PM:
---

There are 3 main user facing impacts for the user for this are the taskContext 
interface to fetch the resources, the user api to specify the gpu count, and 
then how the executor discovers the gpu's or is told the gpus. Below is more 
detail:

 

1) How the user gets the resources from the TaskContext and BarrierTaskContext

  For the taskContext interface I propose we add an api like:

*{color:#80}def {color}getResources(): 
{color:#20999d}Map{color}[{color:#20999d}String{color}, ResourceInformation]*

Where the Map key is the resource type.  So examples would be "gpu", "fpga", 
etc.  "gpu" would be the only one we officially support to start with.

ResourceInformation would be a class with a name, units, count, and addresses.  
The name would be "gpu", the units for gpu would be empty "", but for other 
resources types like memory it could be GiB or similar, the count is the number 
of them, so for gpu's it would be the number allocated, and finally the address 
Array of strings could be whatever we want, in the gpu case it would just be 
the indexes of the gpu's allocated to the task, ie ["0", "2", "3"]. I made this 
a string so its very flexible as to what the address is based on different 
resources types.  Now the user has to know how to inpret this, but depending on 
what you are doing with them even the same tools have multiple ways to specify. 
For instance with tensorflow{{ you can specify in CUDA_VISIBLE_DEVICES=2,3 or 
you can speicify like:
 for d in ['/device:GPU:2', '/device:GPU:3']:
 }}

*{color:#80}private val {color}name: {color:#20999d}String{color},*
 *{color:#80}private val {color}units: {color:#20999d}String{color},*
 *{color:#80}private val {color}count: Long,*
 *{color:#80}private val {color}addresses: 
Array[{color:#20999d}String{color}] = Array.empty*

*{color:#80}def {color}getName(): {color:#20999d}String {color}= name*
 *{color:#80}def {color}getUnits(): {color:#20999d}String {color}= units*
 *{color:#80}def {color}getCount(): Long = count*
 *{color:#80}def {color}getAddresses(): Array[{color:#20999d}String{color}] 
= addresses*

2) How the user specifies the gpu resources upon application submission

Here we need multiple configs:

   a) one for the user to specify the gpus per task, that config, to make it 
extensible for other resources, I propose: *spark.task.resource.\{resource 
type}.count* .  This implementation would only support gpu but it gives us 
flexibility to add more. This allows for multiple resources as well as multiple 
configs for that resource. For instance resource type here would be gpu, but 
you could add fpga.  It also would allow you to add more configs instead of 
count.  You could add in like type for I want a certain gpu type for instance.

   b) User has to specify how many gpu's per executor and driver.  This one is 
a bit more complicated since it has to work with the resource managers to 
actually acquire those but I think it makes sense to have common configs like 
we do for cores and memory. So we can have *spark.executor.resource.\{resource 
type}.count* and *spark.driver.resource.\{resource type}.count*.   This 
implementation would only support gpu.  The tricky thing here is some of the 
resource managers already have configs for asking for gpu's.  Yarn has 
{{spark.yarn.executor.resource.

{resource-type}

}} although it was added in 3.0 and hasn't shipped yet, but we can't just 
remove it since you could ask yarn for other resource types spark doesn't know 
about.  Kubernetes you have to request via the pod template so I think it would 
be on the user to make sure those match. mesos has {{spark.mesos.gpus.max}}.  
So we just need to make sure the new configs maps into those and having the 
duplicate configs might make it a bit weird to the user.

3) how the executor discovers or is told the gpu resources it has.

Here I think we have 2 options for the user/resource manager.  

  a) I propose we add a config *spark.\{executor, 
driver}.resource.gpu.discoverScript* to allow the user to specify a discovery 
script. This script gets run when the executor starts and the user requested 
gpus to discover what Gpu's the executor has.   A simple example of this would 
be the script simply runs "nvidia-smi --query-gpu=index --format=csv,noheader'" 
to get the gpu indexes for nvidia cards.  You could make this script super 
simple or complicated depending on your setup.  The API for the script is that 
its callable with no parameters and then the scripts returns a string of comma 
separated values.  Normally I would expected indexes like "0,1,2,3".

  b) Also add an option to the executor launch *--gpuDevices* that

[jira] [Created] (SPARK-27408) functions.coalesce working on csv but not on Mongospark

2019-04-08 Thread yashwanth (JIRA)

yashwanth created SPARK-27408:
-

 Summary: functions.coalesce working on csv but not on Mongospark
 Key: SPARK-27408
 URL: https://issues.apache.org/jira/browse/SPARK-27408
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.2.0
Reporter: yashwanth


e1.csv
 
id,code,type
1,,A
2,,
3,123,I
 

e2.csv

id,code,type
1,456,A
2,789,A1
3,,C


 Dataset goldenCopy = e1.as("a").join(e2.as("b")).where("a.id == b.id");

goldenCopy.select(functions.coalesce(e1.col("code"),e2.col("code"))).show();

I can't able to run the above code on dataset got from mongo-spark, I had 
imported same csv files into mongodb using mongoimport .
 
refer stackoverflow 
https://stackoverflow.com/questions/55570984/spark-functions-coalesce-not-working-on-mongodb-collections-but-works-on-csvs




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27176) Upgrade hadoop-3's built-in Hive maven dependencies to 2.3.4

2019-04-08 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-27176.
-
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 3.0.0

> Upgrade hadoop-3's built-in Hive maven dependencies to 2.3.4
> 
>
> Key: SPARK-27176
> URL: https://issues.apache.org/jira/browse/SPARK-27176
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13704) TaskSchedulerImpl.createTaskSetManager can be expensive, and result in lost executors due to blocked heartbeats

2019-04-08 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-13704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-13704.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24245
[https://github.com/apache/spark/pull/24245]

> TaskSchedulerImpl.createTaskSetManager can be expensive, and result in lost 
> executors due to blocked heartbeats
> ---
>
> Key: SPARK-13704
> URL: https://issues.apache.org/jira/browse/SPARK-13704
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.0
>Reporter: Zhong Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> In some cases, TaskSchedulerImpl.createTaskSetManager can be expensive. For 
> example, in a Yarn cluster, it may call the topology script for rack 
> awareness. When submit a very large job in a very large Yarn cluster, the 
> topology script may take signifiant time to run. And this blocks receiving 
> executors' heartbeats, which may result in lost executors
> Stacktraces we observed which is related to this issue:
> {code}
> "dag-scheduler-event-loop" daemon prio=10 tid=0x7f8392875800 nid=0x26e8 
> runnable [0x7f83576f4000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileInputStream.readBytes(Native Method)
> at java.io.FileInputStream.read(FileInputStream.java:272)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> - locked <0xf551f460> (a 
> java.lang.UNIXProcess$ProcessPipeInputStream)
> at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
> at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
> at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
> - locked <0xf5529740> (a java.io.InputStreamReader)
> at java.io.InputStreamReader.read(InputStreamReader.java:184)
> at java.io.BufferedReader.fill(BufferedReader.java:154)
> at java.io.BufferedReader.read1(BufferedReader.java:205)
> at java.io.BufferedReader.read(BufferedReader.java:279)
> - locked <0xf5529740> (a java.io.InputStreamReader)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:728)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:524)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
> at 
> org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251)
> at 
> org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188)
> at 
> org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
> at 
> org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101)
> at 
> org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81)
> at 
> org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:38)
> at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:210)
> at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:189)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.TaskSetManager.org$apache$spark$scheduler$TaskSetManager$$addPendingTask(TaskSetManager.scala:189)
> at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:158)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at 
> org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:157)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:187)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:161)
> - locked <0xea3b8a88> (a 
> org.apache.spark.scheduler.cluster.YarnScheduler)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:872)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJ

[jira] [Assigned] (SPARK-13704) TaskSchedulerImpl.createTaskSetManager can be expensive, and result in lost executors due to blocked heartbeats

2019-04-08 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-13704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-13704:


Assignee: Lantao Jin

> TaskSchedulerImpl.createTaskSetManager can be expensive, and result in lost 
> executors due to blocked heartbeats
> ---
>
> Key: SPARK-13704
> URL: https://issues.apache.org/jira/browse/SPARK-13704
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.0
>Reporter: Zhong Wang
>Assignee: Lantao Jin
>Priority: Major
> Fix For: 3.0.0
>
>
> In some cases, TaskSchedulerImpl.createTaskSetManager can be expensive. For 
> example, in a Yarn cluster, it may call the topology script for rack 
> awareness. When submit a very large job in a very large Yarn cluster, the 
> topology script may take signifiant time to run. And this blocks receiving 
> executors' heartbeats, which may result in lost executors
> Stacktraces we observed which is related to this issue:
> {code}
> "dag-scheduler-event-loop" daemon prio=10 tid=0x7f8392875800 nid=0x26e8 
> runnable [0x7f83576f4000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileInputStream.readBytes(Native Method)
> at java.io.FileInputStream.read(FileInputStream.java:272)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> - locked <0xf551f460> (a 
> java.lang.UNIXProcess$ProcessPipeInputStream)
> at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
> at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
> at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
> - locked <0xf5529740> (a java.io.InputStreamReader)
> at java.io.InputStreamReader.read(InputStreamReader.java:184)
> at java.io.BufferedReader.fill(BufferedReader.java:154)
> at java.io.BufferedReader.read1(BufferedReader.java:205)
> at java.io.BufferedReader.read(BufferedReader.java:279)
> - locked <0xf5529740> (a java.io.InputStreamReader)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:728)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:524)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
> at 
> org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251)
> at 
> org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188)
> at 
> org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
> at 
> org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101)
> at 
> org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81)
> at 
> org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:38)
> at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:210)
> at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:189)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.TaskSetManager.org$apache$spark$scheduler$TaskSetManager$$addPendingTask(TaskSetManager.scala:189)
> at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:158)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at 
> org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:157)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:187)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:161)
> - locked <0xea3b8a88> (a 
> org.apache.spark.scheduler.cluster.YarnScheduler)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:872)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:762)
> at 
> org.apache.spark.

[jira] [Updated] (SPARK-23710) Upgrade the built-in Hive to 2.3.4 for hadoop-3.2

2019-04-08 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23710:

Target Version/s: 3.0.0

> Upgrade the built-in Hive to 2.3.4 for hadoop-3.2
> -
>
> Key: SPARK-23710
> URL: https://issues.apache.org/jira/browse/SPARK-23710
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Critical
>
> Spark fail to run on Hadoop 3.x, because Hive's ShimLoader considers Hadoop 
> 3.x to be an unknown Hadoop version. see SPARK-18673 and HIVE-16081 for more 
> details. So we need to upgrade the built-in Hive for Hadoop-3.x. This is an 
> umbrella JIRA to track this upgrade.
>  
> *Upgrade Plan*:
>  # SPARK-27054 Remove the Calcite dependency. This can avoid some jar 
> conflicts.
>  # SPARK-23749 Replace built-in Hive API (isSub/toKryo) and remove 
> OrcProto.Type usage
>  # SPARK-27158, SPARK-27130 Update dev/* to support dynamic change profiles 
> when testing
>  # Fix ORC dependency conflict to makes it test passed on Hive 1.2.1 and 
> compile passed on Hive 2.3.4
>  # Add an empty hive-thriftserverV2 module. then we could test all test cases 
> in next step
>  # Make Hadoop-3.1 with Hive 2.3.4 test passed
>  # Adapted hive-thriftserverV2 from hive-thriftserver with Hive 2.3.4's 
> [TCLIService.thrift|https://github.com/apache/hive/blob/rel/release-2.3.4/service-rpc/if/TCLIService.thrift]
>  
> I have completed the [initial 
> work|https://github.com/apache/spark/pull/24044] and plan to finish this 
> upgrade step by step.
>   
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-08 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812628#comment-16812628
 ] 

shane knapp commented on SPARK-27389:
-

JDKs haven't changed on the jenkins workers in a while, and neither have the 
python pytz packages...

i'm not really sure what's going on here and why this just started failing.  
i'll poke around more (later) today, after i get caught up from the latter half 
of last week.

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: shane knapp
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27348) HeartbeatReceiver doesn't remove lost executors from CoarseGrainedSchedulerBackend

2019-04-08 Thread Shixiong Zhu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812657#comment-16812657
 ] 

Shixiong Zhu commented on SPARK-27348:
--

[~sandeep.katta2007] I cannot reproduce this locally. Ideally, when we decide 
to remove an executor, we should remove it from all places rather than counting 
on a TCP disconnect event which may not happen sometimes. 

> HeartbeatReceiver doesn't remove lost executors from 
> CoarseGrainedSchedulerBackend
> --
>
> Key: SPARK-27348
> URL: https://issues.apache.org/jira/browse/SPARK-27348
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> When a heartbeat timeout happens in HeartbeatReceiver, it doesn't remove lost 
> executors from CoarseGrainedSchedulerBackend. When a connection of an 
> executor is not gracefully shut down, CoarseGrainedSchedulerBackend may not 
> receive a disconnect event. In this case, CoarseGrainedSchedulerBackend still 
> thinks a lost executor is still alive. CoarseGrainedSchedulerBackend may ask 
> TaskScheduler to run tasks on this lost executor. This task will never finish 
> and the job will hang forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27409) Micro-batch support for Kafka Source in Spark 2.3

2019-04-08 Thread Prabhjot Singh Bharaj (JIRA)

Prabhjot Singh Bharaj created SPARK-27409:
-

 Summary: Micro-batch support for Kafka Source in Spark 2.3
 Key: SPARK-27409
 URL: https://issues.apache.org/jira/browse/SPARK-27409
 Project: Spark
  Issue Type: Question
  Components: Structured Streaming
Affects Versions: 2.3.2
Reporter: Prabhjot Singh Bharaj


It seems with this change - 
[https://github.com/apache/spark/commit/0a441d2edb0a3f6c6c7c370db8917e1c07f211e7#diff-eeac5bdf3a1ecd7b9f8aaf10fff37f05R50]
 in Spark 2.3 for Kafka Source Provider, a Kafka source can not be run in 
micro-batch mode but only in continuous mode. Is that understanding correct ?
{code:java}
E Py4JJavaError: An error occurred while calling o217.load.
E : org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
E at 
org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:717)
E at 
org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:566)
E at 
org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:549)
E at 
org.apache.spark.sql.kafka010.SubscribeStrategy.createConsumer(ConsumerStrategy.scala:62)
E at 
org.apache.spark.sql.kafka010.KafkaOffsetReader.createConsumer(KafkaOffsetReader.scala:314)
E at 
org.apache.spark.sql.kafka010.KafkaOffsetReader.(KafkaOffsetReader.scala:78)
E at 
org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130)
E at 
org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43)
E at 
org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185)
E at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
E at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
E at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
E at java.lang.reflect.Method.invoke(Method.java:498)
E at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
E at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
E at py4j.Gateway.invoke(Gateway.java:282)
E at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
E at py4j.commands.CallCommand.execute(CallCommand.java:79)
E at py4j.GatewayConnection.run(GatewayConnection.java:238)
E at java.lang.Thread.run(Thread.java:748)
E Caused by: org.apache.kafka.common.KafkaException: 
org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: 
non-existent (No such file or directory)
E at 
org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:44)
E at 
org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:93)
E at 
org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:51)
E at 
org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:84)
E at 
org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:657)
E ... 19 more
E Caused by: org.apache.kafka.common.KafkaException: 
java.io.FileNotFoundException: non-existent (No such file or directory)
E at 
org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:121)
E at 
org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:41)
E ... 23 more
E Caused by: java.io.FileNotFoundException: non-existent (No such file or 
directory)
E at java.io.FileInputStream.open0(Native Method)
E at java.io.FileInputStream.open(FileInputStream.java:195)
E at java.io.FileInputStream.(FileInputStream.java:138)
E at java.io.FileInputStream.(FileInputStream.java:93)
E at 
org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.load(SslFactory.java:216)
E at 
org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.access$000(SslFactory.java:201)
E at 
org.apache.kafka.common.security.ssl.SslFactory.createSSLContext(SslFactory.java:137)
E at 
org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:119)
E ... 24 more{code}
 When running a simple data stream loader for kafka without an SSL cert, it 
goes through this code block - 
 
{code:java}
...
...
org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130)
E at 
org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43)
E at 
org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185)
...
...{code}
 
Note that I haven't selected `trigger=continuous...` when creating the 
dataframe, still the code is going through the continuous path. My 
understanding was that `continuous` is optional and not the default.
 
Please clarify.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.or

[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-08 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812704#comment-16812704
 ] 

shane knapp commented on SPARK-27389:
-

is this even really a valid timezone?

plus, i really don't think this is a jenkins issue per se.  i whipped up some 
java to check for this timezone, which is there:

{code}
$ java DisplayZoneAndOffSet|grep Pacific-New
 US/Pacific-New (UTC-07:00)
{code}

but it's definitely not a valid pytz timezone:
{code}
$ python2.7 -c 'import pytz; print "US/Pacific-New" in pytz.all_timezones'
False
{code}

as a work-around...  i *could* hack 
{code}/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py{code} to 
include US/Pacific-New on all of the workers.  ;)

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: shane knapp
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-08 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812704#comment-16812704
 ] 

shane knapp edited comment on SPARK-27389 at 4/8/19 6:56 PM:
-

is this even really a valid timezone?

plus, i really don't think this is a jenkins issue per se.  i whipped up some 
java to check for this timezone, which is there:

{code}
$ java DisplayZoneAndOffSet|grep Pacific-New
 US/Pacific-New (UTC-07:00)
{code}

but it's definitely not a valid pytz timezone:
{code}
$ python2.7 -c 'import pytz; print "US/Pacific-New" in pytz.all_timezones'
False
{code}

we're also running the latest version of pytz (according to pip at least):
{code}
$ pip2.7 install -U pytz
Requirement already up-to-date: pytz in 
/home/anaconda/lib/python2.7/site-packages (2018.9)
$ pip2.7 show pytz
Name: pytz
Version: 2018.9
Summary: World timezone definitions, modern and historical
Home-page: http://pythonhosted.org/pytz
Author: Stuart Bishop
Author-email: stu...@stuartbishop.net
License: MIT
Location: /home/anaconda/lib/python2.7/site-packages
Requires:
Required-by: pandas
{code}

as a work-around...  i *could* hack 
{code}/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py{code} to 
include US/Pacific-New on all of the workers.  ;)


was (Author: shaneknapp):
is this even really a valid timezone?

plus, i really don't think this is a jenkins issue per se.  i whipped up some 
java to check for this timezone, which is there:

{code}
$ java DisplayZoneAndOffSet|grep Pacific-New
 US/Pacific-New (UTC-07:00)
{code}

but it's definitely not a valid pytz timezone:
{code}
$ python2.7 -c 'import pytz; print "US/Pacific-New" in pytz.all_timezones'
False
{code}

as a work-around...  i *could* hack 
{code}/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py{code} to 
include US/Pacific-New on all of the workers.  ;)

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: shane knapp
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilde

[jira] [Assigned] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-08 Thread shane knapp (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp reassigned SPARK-27389:
---

Assignee: (was: shane knapp)

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.6

2019-04-08 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812756#comment-16812756
 ] 

shane knapp commented on SPARK-25079:
-

waiting on [~bryanc] to release pyarrow 0.12.1 before merging 
https://github.com/apache/spark/pull/24266

> [PYTHON] upgrade python 3.4 -> 3.6
> --
>
> Key: SPARK-25079
> URL: https://issues.apache.org/jira/browse/SPARK-25079
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> for the impending arrow upgrade 
> (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python 
> 3.4 -> 3.5.
> i have been testing this here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69]
> my methodology:
> 1) upgrade python + arrow to 3.5 and 0.10.0
> 2) run python tests
> 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and 
> upgrade centos workers to python3.5
> 4) simultaneously do the following: 
>   - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that 
> points to python3.5 (this is currently being tested here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)]
>   - push a change to python/run-tests.py replacing 3.4 with 3.5
> 5) once the python3.5 change to run-tests.py is merged, we will need to 
> back-port this to all existing branches
> 6) then and only then can i remove the python3.4 -> python3.5 symlink



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-08 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812765#comment-16812765
 ] 

Sean Owen commented on SPARK-27389:
---

On the question of what the heck it is, comically:
https://mm.icann.org/pipermail/tz/2009-February/015448.html

So.. hm does this suggest it is the OS with something about this installed 
somewhere? This bug was reported against pytz over a decade ago

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-08 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812767#comment-16812767
 ] 

Sean Owen commented on SPARK-27389:
---

What about updating tzdata?

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-08 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812784#comment-16812784
 ] 

shane knapp commented on SPARK-27389:
-

well, this started happening ~6am PST on april 2nd as best as i can tell.

regarding the tzinfo on the centos workers (where this is failing), nothing has 
changed for a year:
{noformat}
$ ls -l /usr/share/zoneinfo/US
total 52
-rw-r--r--. 2 root root 2354 Apr  3  2017 Alaska
-rw-r--r--. 3 root root 2339 Apr  3  2017 Aleutian
-rw-r--r--. 2 root root  327 Apr  3  2017 Arizona
-rw-r--r--. 2 root root 3543 Apr  3  2017 Central
-rw-r--r--. 3 root root 3519 Apr  3  2017 Eastern
-rw-r--r--. 4 root root 1649 Apr  3  2017 East-Indiana
-rw-r--r--. 3 root root  250 Apr  3  2017 Hawaii
-rw-r--r--. 3 root root 2395 Apr  3  2017 Indiana-Starke
-rw-r--r--. 2 root root 2202 Apr  3  2017 Michigan
-rw-r--r--. 4 root root 2427 Apr  3  2017 Mountain
-rw-r--r--. 3 root root 2819 Apr  3  2017 Pacific
-rw-r--r--. 3 root root 2819 Apr  3  2017 Pacific-New
-rw-r--r--. 4 root root  174 Apr  3  2017 Samoa
{noformat}

anyways:  i still believe that this is a pyspark problem, not a jenkins worker 
configuration problem.



> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-

[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-08 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812785#comment-16812785
 ] 

shane knapp commented on SPARK-27389:
-

[~srowen] sure, i can update the tzdata package on the centos workers...  let's 
see if that does anything.  this will take ~5 mins.

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-08 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812789#comment-16812789
 ] 

shane knapp commented on SPARK-27389:
-

updating tzdata didn't do anything noticeable:

{noformat}
[sknapp@amp-jenkins-worker-04 ~]$ python2.7 -c 'import pytz; print 
"US/Pacific-New" in pytz.all_timezones'
False
[sknapp@amp-jenkins-worker-04 ~]$ which python2.7
/home/anaconda/bin/python2.7
{noformat}

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-08 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812789#comment-16812789
 ] 

shane knapp edited comment on SPARK-27389 at 4/8/19 9:03 PM:
-

updating tzdata didn't do anything noticeable:

{noformat}
[sknapp@amp-jenkins-worker-04 ~]$ python2.7 -c 'import pytz; print 
"US/Pacific-New" in pytz.all_timezones'
False
[sknapp@amp-jenkins-worker-04 ~]$ which python2.7
/home/anaconda/bin/python2.7
{noformat}

this is actually expected as pytz stores it's OWN tzdata (see my earlier 
comment about hacking anaconda/lib/python2.7/site-packages/pytz/__init__.py).




was (Author: shaneknapp):
updating tzdata didn't do anything noticeable:

{noformat}
[sknapp@amp-jenkins-worker-04 ~]$ python2.7 -c 'import pytz; print 
"US/Pacific-New" in pytz.all_timezones'
False
[sknapp@amp-jenkins-worker-04 ~]$ which python2.7
/home/anaconda/bin/python2.7
{noformat}

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-08 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812789#comment-16812789
 ] 

shane knapp edited comment on SPARK-27389 at 4/8/19 9:09 PM:
-

updating tzdata (via pip and yum) didn't do anything noticeable:

{noformat}
[sknapp@amp-jenkins-worker-04 ~]$ python2.7 -c 'import pytz; print 
"US/Pacific-New" in pytz.all_timezones'
False
[sknapp@amp-jenkins-worker-04 ~]$ which python2.7
/home/anaconda/bin/python2.7
{noformat}

this is actually expected as pytz stores it's OWN tzdata (see my earlier 
comment about hacking anaconda/lib/python2.7/site-packages/pytz/__init__.py).




was (Author: shaneknapp):
updating tzdata didn't do anything noticeable:

{noformat}
[sknapp@amp-jenkins-worker-04 ~]$ python2.7 -c 'import pytz; print 
"US/Pacific-New" in pytz.all_timezones'
False
[sknapp@amp-jenkins-worker-04 ~]$ which python2.7
/home/anaconda/bin/python2.7
{noformat}

this is actually expected as pytz stores it's OWN tzdata (see my earlier 
comment about hacking anaconda/lib/python2.7/site-packages/pytz/__init__.py).



> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@

[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-08 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812802#comment-16812802
 ] 

Sean Owen commented on SPARK-27389:
---

I wonder what has created /usr/share/zoneinfo/US/Pacific-New ? AFAICT that 
shouldn't be there. It was updated at about the same time -- not just that one 
TZ but the whole thing. Doesn't sound like it's pytz; that's just the Python 
timezone library. Can't really be Pyspark; this isn't something in the Spark 
code at all.

Here's a complaint about tzdata providing this from a few years ago:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=815200

Removed in 2018d-1?
https://launchpad.net/ubuntu/+source/tzdata/+changelog


> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-08 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812838#comment-16812838
 ] 

shane knapp commented on SPARK-27389:
-

well, according to [~bryanc]:

"""
>From the stacktrace, it looks like it's getting this from 
>"spark.sql.session.timeZone" which defaults to Java.util 
>TimeZone.getDefault.getID()
"""

here are the versions of tzdata* installed on the workers having this problem:
{noformat}
tzdata-2019a-1.el6.noarch
tzdata-java-2019a-1.el6.noarch
{noformat}

looks like we're on the latest, but US/Pacific-New is STILL showing up in 
/usr/share/zoneinfo/US.  

when i dig in to the java tzdata package, i am finding the following:

{noformat}
$ strings /usr/share/javazi/ZoneInfoMappings
...bunch of cruft deleted...
US/Pacific
America/Los_Angeles
US/Pacific-New
America/Los_Angeles
{noformat}

so, it appears to me that:
1) the OS still sees US/Pacific-New via tzdata
2) java still sees US/Pacific-New via tzdata-java
3) python has no idea WTF US/Pacific-New is and (occasionally) barfs during 
pyspark unit tests

so, should i go ahead and manually hack 
lib/python2.7/site-packages/pytz/__init__.py and add 'US/Pacific-New' which 
will fix the symptom (w/o fixing the cause)?

other than doing that, i'm actually stumped as to why this literally just 
started failing.

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone

[jira] [Comment Edited] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-08 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812841#comment-16812841
 ] 

shane knapp edited comment on SPARK-27389 at 4/8/19 10:06 PM:
--

also, java8 appears to believe i'm in the US/Pacific (not Pacific-New) TZ:

{noformat}
[sknapp@amp-jenkins-worker-04 ~]$ cat tz.java
import java.util.TimeZone;

public class tz {
public static void main(String[] args) {
TimeZone tz = TimeZone.getDefault();
System.out.println(tz.getID());
}

}
[sknapp@amp-jenkins-worker-04 ~]$ javac tz.java
[sknapp@amp-jenkins-worker-04 ~]$ java tz
US/Pacific
[sknapp@amp-jenkins-worker-04 ~]$ java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
{noormat}


was (Author: shaneknapp):
also, java8 appears to believe i'm in the US/Pacific (not Pacific-New) TZ:

{preformat}
[sknapp@amp-jenkins-worker-04 ~]$ cat tz.java
import java.util.TimeZone;

public class tz {
public static void main(String[] args) {
TimeZone tz = TimeZone.getDefault();
System.out.println(tz.getID());
}

}
[sknapp@amp-jenkins-worker-04 ~]$ javac tz.java
[sknapp@amp-jenkins-worker-04 ~]$ java tz
US/Pacific
[sknapp@amp-jenkins-worker-04 ~]$ java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
{preformat}

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_t

[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-08 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812841#comment-16812841
 ] 

shane knapp commented on SPARK-27389:
-

also, java8 appears to believe i'm in the US/Pacific (not Pacific-New) TZ:

{preformat}
[sknapp@amp-jenkins-worker-04 ~]$ cat tz.java
import java.util.TimeZone;

public class tz {
public static void main(String[] args) {
TimeZone tz = TimeZone.getDefault();
System.out.println(tz.getID());
}

}
[sknapp@amp-jenkins-worker-04 ~]$ javac tz.java
[sknapp@amp-jenkins-worker-04 ~]$ java tz
US/Pacific
[sknapp@amp-jenkins-worker-04 ~]$ java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
{preformat}

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-08 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812841#comment-16812841
 ] 

shane knapp edited comment on SPARK-27389 at 4/8/19 10:07 PM:
--

also, java8 appears to believe i'm in the US/Pacific (not Pacific-New) TZ:

{noformat}
[sknapp@amp-jenkins-worker-04 ~]$ cat tz.java
import java.util.TimeZone;

public class tz {
public static void main(String[] args) {
TimeZone tz = TimeZone.getDefault();
System.out.println(tz.getID());
}

}
[sknapp@amp-jenkins-worker-04 ~]$ javac tz.java
[sknapp@amp-jenkins-worker-04 ~]$ java tz
US/Pacific
[sknapp@amp-jenkins-worker-04 ~]$ java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
{noformat}


was (Author: shaneknapp):
also, java8 appears to believe i'm in the US/Pacific (not Pacific-New) TZ:

{noformat}
[sknapp@amp-jenkins-worker-04 ~]$ cat tz.java
import java.util.TimeZone;

public class tz {
public static void main(String[] args) {
TimeZone tz = TimeZone.getDefault();
System.out.println(tz.getID());
}

}
[sknapp@amp-jenkins-worker-04 ~]$ javac tz.java
[sknapp@amp-jenkins-worker-04 ~]$ java tz
US/Pacific
[sknapp@amp-jenkins-worker-04 ~]$ java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
{noormat}

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz

[jira] [Updated] (SPARK-16548) java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data

2019-04-08 Thread Bijith Kumar (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-16548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bijith Kumar updated SPARK-16548:
-
Attachment: corrupted.json

> java.io.CharConversionException: Invalid UTF-32 character  prevents me from 
> querying my data
> 
>
> Key: SPARK-16548
> URL: https://issues.apache.org/jira/browse/SPARK-16548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Egor Pahomov
>Priority: Minor
> Fix For: 2.2.0, 2.3.0
>
> Attachments: corrupted.json
>
>
> Basically, when I query my json data I get 
> {code}
> java.io.CharConversionException: Invalid UTF-32 character 0x7b2265(above 
> 10)  at char #192, byte #771)
>   at 
> com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189)
>   at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1855)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:571)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:142)
> {code}
> I do not like it. If you can not process one json among 100500 please return 
> null, do not fail everything. I have dirty one line fix, and I understand how 
> I can make it more reasonable. What is our position - what behaviour we wanna 
> get?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16548) java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data

2019-04-08 Thread Bijith Kumar (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-16548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812858#comment-16812858
 ] 

Bijith Kumar commented on SPARK-16548:
--

[~cloud_fan], I couldn't find the specific character of the corrupted data that 
is causing the issue. However, here is the corrupted section from file to 
reproduce the issue. Please see attached - [^corrupted.json]. 

> java.io.CharConversionException: Invalid UTF-32 character  prevents me from 
> querying my data
> 
>
> Key: SPARK-16548
> URL: https://issues.apache.org/jira/browse/SPARK-16548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Egor Pahomov
>Priority: Minor
> Fix For: 2.2.0, 2.3.0
>
> Attachments: corrupted.json
>
>
> Basically, when I query my json data I get 
> {code}
> java.io.CharConversionException: Invalid UTF-32 character 0x7b2265(above 
> 10)  at char #192, byte #771)
>   at 
> com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189)
>   at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1855)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:571)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:142)
> {code}
> I do not like it. If you can not process one json among 100500 please return 
> null, do not fail everything. I have dirty one line fix, and I understand how 
> I can make it more reasonable. What is our position - what behaviour we wanna 
> get?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27410) Remove deprecated/no-op mllib.Kmeans get/setRuns methods

2019-04-08 Thread Sean Owen (JIRA)

Sean Owen created SPARK-27410:
-

 Summary: Remove deprecated/no-op mllib.Kmeans get/setRuns methods
 Key: SPARK-27410
 URL: https://issues.apache.org/jira/browse/SPARK-27410
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 3.0.0
Reporter: Sean Owen
Assignee: Sean Owen


mllib.KMeans has getRuns, setRuns methods which haven't done anything since 
Spark 2.1. They're deprecated, and no-ops, and should be removed for Spark 3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-08 Thread Bryan Cutler (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812877#comment-16812877
 ] 

Bryan Cutler commented on SPARK-27389:
--

[~shaneknapp], I had a couple of successful tests with worker-4. Do you know if 
the problem consistent on certain workers or just random on all of them?

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25407) Spark throws a `ParquetDecodingException` when attempting to read a field from a complex type in certain cases of schema merging

2019-04-08 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25407:
-

Assignee: Michael Allman  (was: Dongjoon Hyun)

> Spark throws a `ParquetDecodingException` when attempting to read a field 
> from a complex type in certain cases of schema merging
> 
>
> Key: SPARK-25407
> URL: https://issues.apache.org/jira/browse/SPARK-25407
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Michael Allman
>Assignee: Michael Allman
>Priority: Major
> Fix For: 3.0.0
>
>
> Spark supports merging schemata across table partitions in which one 
> partition is missing a subfield that's present in another. However, 
> attempting to select that missing field with a query that includes a 
> partition pruning predicate that filters out the partitions that include that 
> field results in a `ParquetDecodingException` when attempting to get the 
> query results.
> This bug is specifically exercised by the failing (but ignored) test case 
> [https://github.com/apache/spark/blob/f2d35427eedeacceb6edb8a51974a7e8bbb94bc2/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala#L125-L131].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-08 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812883#comment-16812883
 ] 

shane knapp commented on SPARK-27389:
-

no, it appears to be random.

[https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-sbt-hadoop-2.7/365/]

[https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-sbt-hadoop-2.7/364/]

these two identical builds ran w/the same python/java/whathaveyou setup on the 
*same physical worker*.  one passes, one fails w/the date thing.

 

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-08 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812883#comment-16812883
 ] 

shane knapp edited comment on SPARK-27389 at 4/9/19 12:05 AM:
--

-no, it appears to be random.-

-[https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-sbt-hadoop-2.7/365/]-

-[https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-sbt-hadoop-2.7/364/]-

-these two identical builds ran w/the same python/java/whathaveyou setup on the 
*same physical worker*.  one passes, one fails w/the date thing.-

bad example, pls hold.


was (Author: shaneknapp):
no, it appears to be random.

[https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-sbt-hadoop-2.7/365/]

[https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-sbt-hadoop-2.7/364/]

these two identical builds ran w/the same python/java/whathaveyou setup on the 
*same physical worker*.  one passes, one fails w/the date thing.

 

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-08 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812883#comment-16812883
 ] 

shane knapp edited comment on SPARK-27389 at 4/9/19 12:21 AM:
--

-no, it appears to be random.-

-[https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-sbt-hadoop-2.7/365/]-

-[https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-sbt-hadoop-2.7/364/]-

-these two identical builds ran w/the same python/java/whathaveyou setup on the 
*same physical worker*.  one passes, one fails w/the date thing.-

bad example, pls hold.

i need to do some more build archaeology this evening and tomorrow.  i'm aware 
that this is important.  :)


was (Author: shaneknapp):
-no, it appears to be random.-

-[https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-sbt-hadoop-2.7/365/]-

-[https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-sbt-hadoop-2.7/364/]-

-these two identical builds ran w/the same python/java/whathaveyou setup on the 
*same physical worker*.  one passes, one fails w/the date thing.-

bad example, pls hold.

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-ma

[jira] [Assigned] (SPARK-26881) Scaling issue with Gramian computation for RowMatrix: too many results sent to driver

2019-04-08 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26881:
-

Assignee: Rafael RENAUDIN-AVINO

> Scaling issue with Gramian computation for RowMatrix: too many results sent 
> to driver
> -
>
> Key: SPARK-26881
> URL: https://issues.apache.org/jira/browse/SPARK-26881
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Rafael RENAUDIN-AVINO
>Assignee: Rafael RENAUDIN-AVINO
>Priority: Minor
>
> This issue hit me when running PCA on large dataset (~1Billion rows, ~30k 
> columns).
> Computing Gramian of a big RowMatrix allows to reproduce the issue.
>  
> The problem arises in the treeAggregate phase of the gramian matrix 
> computation: results sent to driver are enormous.
> A potential solution to this could be to replace the hard coded depth (2) of 
> the tree aggregation by a heuristic computed based on the number of 
> partitions, driver max result size, and memory size of the dense vectors that 
> are being aggregated, cf below for more detail:
> (nb_partitions)^(1/depth) * dense_vector_size <= driver_max_result_size
> I have a potential fix ready (currently testing it at scale), but I'd like to 
> hear the community opinion about such a fix to know if it's worth investing 
> my time into a clean pull request.
>  
> Note that I only faced this issue with spark 2.2 but I suspect it affects 
> later versions aswell. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26881) Scaling issue with Gramian computation for RowMatrix: too many results sent to driver

2019-04-08 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26881.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23983
[https://github.com/apache/spark/pull/23983]

> Scaling issue with Gramian computation for RowMatrix: too many results sent 
> to driver
> -
>
> Key: SPARK-26881
> URL: https://issues.apache.org/jira/browse/SPARK-26881
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Rafael RENAUDIN-AVINO
>Assignee: Rafael RENAUDIN-AVINO
>Priority: Minor
> Fix For: 3.0.0
>
>
> This issue hit me when running PCA on large dataset (~1Billion rows, ~30k 
> columns).
> Computing Gramian of a big RowMatrix allows to reproduce the issue.
>  
> The problem arises in the treeAggregate phase of the gramian matrix 
> computation: results sent to driver are enormous.
> A potential solution to this could be to replace the hard coded depth (2) of 
> the tree aggregation by a heuristic computed based on the number of 
> partitions, driver max result size, and memory size of the dense vectors that 
> are being aggregated, cf below for more detail:
> (nb_partitions)^(1/depth) * dense_vector_size <= driver_max_result_size
> I have a potential fix ready (currently testing it at scale), but I'd like to 
> hear the community opinion about such a fix to know if it's worth investing 
> my time into a clean pull request.
>  
> Note that I only faced this issue with spark 2.2 but I suspect it affects 
> later versions aswell. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27409) Micro-batch support for Kafka Source in Spark 2.3

2019-04-08 Thread Shivu Sondur (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812991#comment-16812991
 ] 

Shivu Sondur commented on SPARK-27409:
--

i am checking this

> Micro-batch support for Kafka Source in Spark 2.3
> -
>
> Key: SPARK-27409
> URL: https://issues.apache.org/jira/browse/SPARK-27409
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 2.3.2
>Reporter: Prabhjot Singh Bharaj
>Priority: Major
>
> It seems with this change - 
> [https://github.com/apache/spark/commit/0a441d2edb0a3f6c6c7c370db8917e1c07f211e7#diff-eeac5bdf3a1ecd7b9f8aaf10fff37f05R50]
>  in Spark 2.3 for Kafka Source Provider, a Kafka source can not be run in 
> micro-batch mode but only in continuous mode. Is that understanding correct ?
> {code:java}
> E Py4JJavaError: An error occurred while calling o217.load.
> E : org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:717)
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:566)
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:549)
> E at 
> org.apache.spark.sql.kafka010.SubscribeStrategy.createConsumer(ConsumerStrategy.scala:62)
> E at 
> org.apache.spark.sql.kafka010.KafkaOffsetReader.createConsumer(KafkaOffsetReader.scala:314)
> E at 
> org.apache.spark.sql.kafka010.KafkaOffsetReader.(KafkaOffsetReader.scala:78)
> E at 
> org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130)
> E at 
> org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43)
> E at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185)
> E at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> E at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> E at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> E at java.lang.reflect.Method.invoke(Method.java:498)
> E at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> E at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> E at py4j.Gateway.invoke(Gateway.java:282)
> E at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> E at py4j.commands.CallCommand.execute(CallCommand.java:79)
> E at py4j.GatewayConnection.run(GatewayConnection.java:238)
> E at java.lang.Thread.run(Thread.java:748)
> E Caused by: org.apache.kafka.common.KafkaException: 
> org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: 
> non-existent (No such file or directory)
> E at 
> org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:44)
> E at 
> org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:93)
> E at 
> org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:51)
> E at 
> org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:84)
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:657)
> E ... 19 more
> E Caused by: org.apache.kafka.common.KafkaException: 
> java.io.FileNotFoundException: non-existent (No such file or directory)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:121)
> E at 
> org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:41)
> E ... 23 more
> E Caused by: java.io.FileNotFoundException: non-existent (No such file or 
> directory)
> E at java.io.FileInputStream.open0(Native Method)
> E at java.io.FileInputStream.open(FileInputStream.java:195)
> E at java.io.FileInputStream.(FileInputStream.java:138)
> E at java.io.FileInputStream.(FileInputStream.java:93)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.load(SslFactory.java:216)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.access$000(SslFactory.java:201)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory.createSSLContext(SslFactory.java:137)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:119)
> E ... 24 more{code}
>  When running a simple data stream loader for kafka without an SSL cert, it 
> goes through this code block - 
>  
> {code:java}
> ...
> ...
> org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130)
> E at 
> org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43)
> E at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185)
> ...
> ...{code}
>  
> Note that I haven't selected `trigger=continu

[jira] [Assigned] (SPARK-27328) Create 'deprecate' property in ExpressionDescription for SQL functions documentation

2019-04-08 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27328:
---

Assignee: Hyukjin Kwon

> Create 'deprecate' property in ExpressionDescription for SQL functions 
> documentation
> 
>
> Key: SPARK-27328
> URL: https://issues.apache.org/jira/browse/SPARK-27328
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently, there looks no way to show SQL functions are deprecated. See 
> https://spark.apache.org/docs/2.3.0/api/sql/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27328) Create 'deprecate' property in ExpressionDescription for SQL functions documentation

2019-04-08 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27328.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24259
[https://github.com/apache/spark/pull/24259]

> Create 'deprecate' property in ExpressionDescription for SQL functions 
> documentation
> 
>
> Key: SPARK-27328
> URL: https://issues.apache.org/jira/browse/SPARK-27328
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, there looks no way to show SQL functions are deprecated. See 
> https://spark.apache.org/docs/2.3.0/api/sql/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27411) DataSourceV2Strategy should not eliminate subquery

2019-04-08 Thread Mingcong Han (JIRA)

Mingcong Han created SPARK-27411:


 Summary: DataSourceV2Strategy should not eliminate subquery
 Key: SPARK-27411
 URL: https://issues.apache.org/jira/browse/SPARK-27411
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Mingcong Han
 Fix For: 3.0.0


In DataSourceV2Strategy, it seems we eliminate the subqueries by mistake after 
normalizing filters. Here is an example:
We have an sql with a scalar subquery:
{code:scala}
val plan = spark.sql("select * from t2 where t2a > (select max(t1a) from t1)")
plan.explain(true)
{code}
And we get the log info of DataSourceV2Strategy:
{noformat}
Pushing operators to csv:examples/src/main/resources/t2.txt
Pushed Filters: 
Post-Scan Filters: isnotnull(t2a#30)
Output: t2a#30, t2b#31
{noformat}
The `Post-Scan Filters` should contain the scalar subquery, but we eliminate it 
by mistake.
{noformat}
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('t2a > scalar-subquery#56 [])
   :  +- 'Project [unresolvedalias('max('t1a), None)]
   : +- 'UnresolvedRelation `t1`
   +- 'UnresolvedRelation `t2`

== Analyzed Logical Plan ==
t2a: string, t2b: string
Project [t2a#30, t2b#31]
+- Filter (t2a#30 > scalar-subquery#56 [])
   :  +- Aggregate [max(t1a#13) AS max(t1a)#63]
   : +- SubqueryAlias `t1`
   :+- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt
   +- SubqueryAlias `t2`
  +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt

== Optimized Logical Plan ==
Filter (isnotnull(t2a#30) && (t2a#30 > scalar-subquery#56 []))
:  +- Aggregate [max(t1a#13) AS max(t1a)#63]
: +- Project [t1a#13]
:+- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt
+- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt

== Physical Plan ==
*(1) Project [t2a#30, t2b#31]
+- *(1) Filter isnotnull(t2a#30)
   +- *(1) BatchScan[t2a#30, t2b#31] class 
org.apache.spark.sql.execution.datasources.v2.csv.CSVScan
{noformat}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27412) Add a new shuffle manager to use Persistent Memory as shuffle and spilling storage

2019-04-08 Thread Chendi.Xue (JIRA)

Chendi.Xue created SPARK-27412:
--

 Summary: Add a new shuffle manager to use Persistent Memory as 
shuffle and spilling storage
 Key: SPARK-27412
 URL: https://issues.apache.org/jira/browse/SPARK-27412
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle, Spark Core
Affects Versions: 3.0.0
Reporter: Chendi.Xue


Add a new shuffle manager called "PmemShuffleManager", by using which, we can 
use Persistent Memory Device as storage for shuffle and external sorter 
spilling.

In this implementation, we leveraged Persistent Memory Development Kit(PMDK) to 
support transaction write with high performance.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27412) Add a new shuffle manager to use Persistent Memory as shuffle and spilling storage

2019-04-08 Thread Chendi.Xue (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chendi.Xue updated SPARK-27412:
---
External issue URL: https://github.com/apache/spark/pull/24322

> Add a new shuffle manager to use Persistent Memory as shuffle and spilling 
> storage
> --
>
> Key: SPARK-27412
> URL: https://issues.apache.org/jira/browse/SPARK-27412
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Chendi.Xue
>Priority: Minor
>  Labels: shuffle
>
> Add a new shuffle manager called "PmemShuffleManager", by using which, we can 
> use Persistent Memory Device as storage for shuffle and external sorter 
> spilling.
> In this implementation, we leveraged Persistent Memory Development Kit(PMDK) 
> to support transaction write with high performance.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27412) Add a new shuffle manager to use Persistent Memory as shuffle and spilling storage

2019-04-08 Thread Chendi.Xue (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chendi.Xue updated SPARK-27412:
---
External issue URL:   (was: https://github.com/apache/spark/pull/24322)

> Add a new shuffle manager to use Persistent Memory as shuffle and spilling 
> storage
> --
>
> Key: SPARK-27412
> URL: https://issues.apache.org/jira/browse/SPARK-27412
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Chendi.Xue
>Priority: Minor
>  Labels: shuffle
>
> Add a new shuffle manager called "PmemShuffleManager", by using which, we can 
> use Persistent Memory Device as storage for shuffle and external sorter 
> spilling.
> In this implementation, we leveraged Persistent Memory Development Kit(PMDK) 
> to support transaction write with high performance.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

53 matches

Mail list logo