[jira] [Assigned] (SPARK-27384) File source V2: Prune unnecessary partition columns
[ https://issues.apache.org/jira/browse/SPARK-27384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27384: --- Assignee: Gengliang Wang > File source V2: Prune unnecessary partition columns > --- > > Key: SPARK-27384 > URL: https://issues.apache.org/jira/browse/SPARK-27384 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > When scanning file sources, we can prune unnecessary partition columns on > constructing input partitions, so that: > 1. Reduce the data transformation from Driver to Executors > 2. Make it easier to implement columnar batch readers, since the partition > columns are already pruned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27384) File source V2: Prune unnecessary partition columns
[ https://issues.apache.org/jira/browse/SPARK-27384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27384. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24296 [https://github.com/apache/spark/pull/24296] > File source V2: Prune unnecessary partition columns > --- > > Key: SPARK-27384 > URL: https://issues.apache.org/jira/browse/SPARK-27384 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > When scanning file sources, we can prune unnecessary partition columns on > constructing input partitions, so that: > 1. Reduce the data transformation from Driver to Executors > 2. Make it easier to implement columnar batch readers, since the partition > columns are already pruned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27289) spark-submit explicit configuration does not take effect but Spark UI shows it's effective
[ https://issues.apache.org/jira/browse/SPARK-27289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812242#comment-16812242 ] Udbhav Agrawal commented on SPARK-27289: yes intermediate data is written in the spark.local.dir which is configured through --conf parameter while running spark-submit, it will overwrite the one you have mentioned in spark-default.conf > spark-submit explicit configuration does not take effect but Spark UI shows > it's effective > -- > > Key: SPARK-27289 > URL: https://issues.apache.org/jira/browse/SPARK-27289 > Project: Spark > Issue Type: Bug > Components: Deploy, Documentation, Spark Submit, Web UI >Affects Versions: 2.3.3 >Reporter: KaiXu >Priority: Minor > Attachments: Capture.PNG > > > The [doc > |https://spark.apache.org/docs/latest/submitting-applications.html]says that > "In general, configuration values explicitly set on a {{SparkConf}} take the > highest precedence, then flags passed to {{spark-submit}}, then values in the > defaults file", but when setting spark.local.dir through --conf with > spark-submit, it still uses the values from > ${SPARK_HOME}/conf/spark-defaults.conf, what's more, the Spark runtime UI > environment variables shows the value from --conf, which is really misleading. > e.g. > I set submit my application through the command: > /opt/spark233/bin/spark-submit --properties-file /opt/spark.conf --conf > spark.local.dir=/tmp/spark_local -v --class > org.apache.spark.examples.mllib.SparseNaiveBayes --master > spark://bdw-slave20:7077 > /opt/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar > hdfs://bdw-slave20:8020/Bayes/Input > > the spark.local.dir in ${SPARK_HOME}/conf/spark-defaults.conf is: > spark.local.dir=/mnt/nvme1/spark_local > when the application is running, I found the intermediate shuffle data was > wrote to /mnt/nvme1/spark_local, which is set through > ${SPARK_HOME}/conf/spark-defaults.conf, but the Web UI shows that the > environment value spark.local.dir=/tmp/spark_local. > The spark-submit verbose also shows spark.local.dir=/tmp/spark_local, it's > misleading. > > !image-2019-03-27-10-59-38-377.png! > spark-submit verbose: > > Spark properties used, including those specified through > --conf and those from the properties file /opt/spark.conf: > (spark.local.dir,/tmp/spark_local) > (spark.default.parallelism,132) > (spark.driver.memory,10g) > (spark.executor.memory,352g) > X -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27406) UnsafeArrayData serialization breaks when two machines have different Oops size
[ https://issues.apache.org/jira/browse/SPARK-27406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812302#comment-16812302 ] Sandeep Katta commented on SPARK-27406: --- [~pengbo] thanks for raising this issue, soon I will raise PR for this > UnsafeArrayData serialization breaks when two machines have different Oops > size > --- > > Key: SPARK-27406 > URL: https://issues.apache.org/jira/browse/SPARK-27406 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1 >Reporter: peng bo >Priority: Major > > ApproxCountDistinctForIntervals holds the UnsafeArrayData data to initialize > endpoints. When the UnsafeArrayData is serialized with Java serialization, > the BYTE_ARRAY_OFFSET in memory can change if two machines have different > pointer width (Oops in JVM). > It's similar to SPARK-10914. > {code:java} > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals$$anonfun$endpoints$1.apply(ApproxCountDistinctForIntervals.scala:69) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals$$anonfun$endpoints$1.apply(ApproxCountDistinctForIntervals.scala:69) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.endpoints$lzycompute(ApproxCountDistinctForIntervals.scala:69) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.endpoints(ApproxCountDistinctForIntervals.scala:66) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$hllppArray$lzycompute(ApproxCountDistinctForIntervals.scala:94) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$hllppArray(ApproxCountDistinctForIntervals.scala:93) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$numWordsPerHllpp$lzycompute(ApproxCountDistinctForIntervals.scala:104) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$numWordsPerHllpp(ApproxCountDistinctForIntervals.scala:104) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.totalNumWords$lzycompute(ApproxCountDistinctForIntervals.scala:106) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.totalNumWords(ApproxCountDistinctForIntervals.scala:106) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.createAggregationBuffer(ApproxCountDistinctForIntervals.scala:110) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.createAggregationBuffer(ApproxCountDistinctForIntervals.scala:44) > at > org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.initialize(interfaces.scala:528) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator$$anonfun$initAggregationBuffer$2.apply(ObjectAggregationIterator.scala:120) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator$$anonfun$initAggregationBuffer$2.apply(ObjectAggregationIterator.scala:120) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.initAggregationBuffer(ObjectAggregationIterator.scala:120) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.org$apache$spark$sql$execution$aggregate$ObjectAggregationIterator$$createNewAggregationBuffer(ObjectAggregationIterator.scala:112) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.getAggregationBufferByKey(ObjectA
[jira] [Commented] (SPARK-27406) UnsafeArrayData serialization breaks when two machines have different Oops size
[ https://issues.apache.org/jira/browse/SPARK-27406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812376#comment-16812376 ] peng bo commented on SPARK-27406: - [~sandeep.katta2007] Actually, I have already submitted PR for this, can you please review it? https://github.com/apache/spark/pull/24317/files > UnsafeArrayData serialization breaks when two machines have different Oops > size > --- > > Key: SPARK-27406 > URL: https://issues.apache.org/jira/browse/SPARK-27406 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1 >Reporter: peng bo >Priority: Major > > ApproxCountDistinctForIntervals holds the UnsafeArrayData data to initialize > endpoints. When the UnsafeArrayData is serialized with Java serialization, > the BYTE_ARRAY_OFFSET in memory can change if two machines have different > pointer width (Oops in JVM). > It's similar to SPARK-10914. > {code:java} > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals$$anonfun$endpoints$1.apply(ApproxCountDistinctForIntervals.scala:69) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals$$anonfun$endpoints$1.apply(ApproxCountDistinctForIntervals.scala:69) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.endpoints$lzycompute(ApproxCountDistinctForIntervals.scala:69) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.endpoints(ApproxCountDistinctForIntervals.scala:66) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$hllppArray$lzycompute(ApproxCountDistinctForIntervals.scala:94) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$hllppArray(ApproxCountDistinctForIntervals.scala:93) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$numWordsPerHllpp$lzycompute(ApproxCountDistinctForIntervals.scala:104) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$numWordsPerHllpp(ApproxCountDistinctForIntervals.scala:104) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.totalNumWords$lzycompute(ApproxCountDistinctForIntervals.scala:106) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.totalNumWords(ApproxCountDistinctForIntervals.scala:106) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.createAggregationBuffer(ApproxCountDistinctForIntervals.scala:110) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.createAggregationBuffer(ApproxCountDistinctForIntervals.scala:44) > at > org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.initialize(interfaces.scala:528) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator$$anonfun$initAggregationBuffer$2.apply(ObjectAggregationIterator.scala:120) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator$$anonfun$initAggregationBuffer$2.apply(ObjectAggregationIterator.scala:120) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.initAggregationBuffer(ObjectAggregationIterator.scala:120) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.org$apache$spark$sql$execution$aggregate$ObjectAggregationIterator$$createNewAggregationBuffer(ObjectAggregationIterator.scala:112) > at > org.apache.spark.sql.execution.aggre
[jira] [Commented] (SPARK-27348) HeartbeatReceiver doesn't remove lost executors from CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-27348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812396#comment-16812396 ] Sandeep Katta commented on SPARK-27348: --- [~zsxwing] do you have any test code or scenario which can suffice your statement ? > HeartbeatReceiver doesn't remove lost executors from > CoarseGrainedSchedulerBackend > -- > > Key: SPARK-27348 > URL: https://issues.apache.org/jira/browse/SPARK-27348 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Priority: Major > > When a heartbeat timeout happens in HeartbeatReceiver, it doesn't remove lost > executors from CoarseGrainedSchedulerBackend. When a connection of an > executor is not gracefully shut down, CoarseGrainedSchedulerBackend may not > receive a disconnect event. In this case, CoarseGrainedSchedulerBackend still > thinks a lost executor is still alive. CoarseGrainedSchedulerBackend may ask > TaskScheduler to run tasks on this lost executor. This task will never finish > and the job will hang forever. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27407) File source V2: Invalidate cache data on overwrite/append
Gengliang Wang created SPARK-27407: -- Summary: File source V2: Invalidate cache data on overwrite/append Key: SPARK-27407 URL: https://issues.apache.org/jira/browse/SPARK-27407 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Gengliang Wang File source V2 currently incorrectly continues to use cached data even if the underlying data is overwritten. We should follow https://github.com/apache/spark/pull/13566 and fix it by invalidating and refreshes all the cached data (and the associated metadata) for any Dataframe that contains the given data source path. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25407) Spark throws a `ParquetDecodingException` when attempting to read a field from a complex type in certain cases of schema merging
[ https://issues.apache.org/jira/browse/SPARK-25407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25407. -- Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 3.0.0 Fixed in https://github.com/apache/spark/pull/24307 > Spark throws a `ParquetDecodingException` when attempting to read a field > from a complex type in certain cases of schema merging > > > Key: SPARK-25407 > URL: https://issues.apache.org/jira/browse/SPARK-25407 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Michael Allman >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > Spark supports merging schemata across table partitions in which one > partition is missing a subfield that's present in another. However, > attempting to select that missing field with a query that includes a > partition pruning predicate that filters out the partitions that include that > field results in a `ParquetDecodingException` when attempting to get the > query results. > This bug is specifically exercised by the failing (but ignored) test case > [https://github.com/apache/spark/blob/f2d35427eedeacceb6edb8a51974a7e8bbb94bc2/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala#L125-L131]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16548) java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data
[ https://issues.apache.org/jira/browse/SPARK-16548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812469#comment-16812469 ] Wenchen Fan commented on SPARK-16548: - Do you have a small dateset to reproduce it? > java.io.CharConversionException: Invalid UTF-32 character prevents me from > querying my data > > > Key: SPARK-16548 > URL: https://issues.apache.org/jira/browse/SPARK-16548 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Egor Pahomov >Priority: Minor > Fix For: 2.2.0, 2.3.0 > > > Basically, when I query my json data I get > {code} > java.io.CharConversionException: Invalid UTF-32 character 0x7b2265(above > 10) at char #192, byte #771) > at > com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189) > at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1855) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:571) > at > org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:142) > {code} > I do not like it. If you can not process one json among 100500 please return > null, do not fail everything. I have dirty one line fix, and I understand how > I can make it more reasonable. What is our position - what behaviour we wanna > get? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27364) User-facing APIs for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812488#comment-16812488 ] Thomas Graves commented on SPARK-27364: --- There are 3 main user facing impacts for the user for this are the taskContext interface to fetch the resources, the user api to specify the gpu count, and then how the executor discovers the gpu's or is told the gpus. Below is more detail: 1) How the user gets the resources from the TaskContext and BarrierTaskContext For the taskContext interface I propose we add an api like: {color:#80}def {color}getResources(): {color:#20999d}Map{color}[{color:#20999d}String{color}, ResourceInformation] Where the Map key is the resource type. So examples would be "gpu", "fpga", etc. "gpu" would be the only one we officially support to start with. ResourceInformation would be a class with a name, units, count, and addresses. The name would be "gpu", the units for gpu would be empty "", but for other resources types like memory it could be GiB or similar, the count is the number of them, so for gpu's it would be the number allocated, and finally the address Array of strings could be whatever we want, in the gpu case it would just be the indexes of the gpu's allocated to the task, ie ["0", "2", "3"]. I made this a string so its very flexible as to what the address is based on different resources types. Now the user has to know how to inpret this, but depending on what you are doing with them even the same tools have multiple ways to specify. For instance with tensorflow{{ you can specify in CUDA_VISIBLE_DEVICES=2,3 or you can speicify like: for d in ['/device:GPU:2', '/device:GPU:3']: }} {color:#80}private val {color}name: {color:#20999d}String{color}, {color:#80}private val {color}units: {color:#20999d}String{color}, {color:#80}private val {color}count: Long, {color:#80}private val {color}addresses: Array[{color:#20999d}String{color}] = Array.empty {color:#80}def {color}getName(): {color:#20999d}String {color}= name {color:#80}def {color}getUnits(): {color:#20999d}String {color}= units {color:#80}def {color}getCount(): Long = count {color:#80}def {color}getAddresses(): Array[{color:#20999d}String{color}] = addresses 2) How the user specifies the gpu resources upon application submission Here we need multiple configs: a) one for the user to specify the gpus per task, that config, to make it extensible for other resources, I propose: *spark.task.resource.\{resource type}.count* . This implementation would only support gpu but it gives us flexibility to add more. This allows for multiple resources as well as multiple configs for that resource. For instance resource type here would be gpu, but you could add fpga. It also would allow you to add more configs instead of count. You could add in like type for I want a certain gpu type for instance. b) User has to specify how many gpu's per executor and driver. This one is a bit more complicated since it has to work with the resource managers to actually acquire those but I think it makes sense to have common configs like we do for cores and memory. So we can have *spark.executor.resource.\{resource type}.count* and *spark.driver.resource.\{resource type}.count*. This implementation would only support gpu. The tricky thing here is some of the resource managers already have configs for asking for gpu's. Yarn has {{spark.yarn.executor.resource.\{resource-type}}} although it was added in 3.0 and hasn't shipped yet, but we can't just remove it since you could ask yarn for other resource types spark doesn't know about. Kubernetes you have to request via the pod template so I think it would be on the user to make sure those match. mesos has {{spark.mesos.gpus.max}}. So we just need to make sure the new configs maps into those and having the duplicate configs might make it a bit weird to the user. 3) how the executor discovers or is told the gpu resources it has. Here I think we have 2 options for the user/resource manager. a) I propose we add a config *spark.\{executor, driver}.resource.gpu.discoverScript* to allow the user to specify a discovery script. This script gets run when the executor starts and the user requested gpus to discover what Gpu's the executor has. A simple example of this would be the script simply runs "nvidia-smi --query-gpu=index --format=csv,noheader'" to get the gpu indexes for nvidia cards. You could make this script super simple or complicated depending on your setup. b) Also add an option to the executor launch *--gpuDevices* that allows the resource manager to specify the indexes of the gpu devices it has. This allows insecure or non-containerized resource managers like standalone mode to allocate gpu's per executor without having containers and isolation all implemented
[jira] [Comment Edited] (SPARK-27364) User-facing APIs for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812488#comment-16812488 ] Thomas Graves edited comment on SPARK-27364 at 4/8/19 3:01 PM: --- There are 3 main user facing impacts for the user for this are the taskContext interface to fetch the resources, the user api to specify the gpu count, and then how the executor discovers the gpu's or is told the gpus. Below is more detail: 1) How the user gets the resources from the TaskContext and BarrierTaskContext For the taskContext interface I propose we add an api like: *{color:#80}def {color}getResources(): {color:#20999d}Map{color}[{color:#20999d}String{color}, ResourceInformation]* Where the Map key is the resource type. So examples would be "gpu", "fpga", etc. "gpu" would be the only one we officially support to start with. ResourceInformation would be a class with a name, units, count, and addresses. The name would be "gpu", the units for gpu would be empty "", but for other resources types like memory it could be GiB or similar, the count is the number of them, so for gpu's it would be the number allocated, and finally the address Array of strings could be whatever we want, in the gpu case it would just be the indexes of the gpu's allocated to the task, ie ["0", "2", "3"]. I made this a string so its very flexible as to what the address is based on different resources types. Now the user has to know how to inpret this, but depending on what you are doing with them even the same tools have multiple ways to specify. For instance with tensorflow{{ you can specify in CUDA_VISIBLE_DEVICES=2,3 or you can speicify like: for d in ['/device:GPU:2', '/device:GPU:3']: }} *{color:#80}private val {color}name: {color:#20999d}String{color},* *{color:#80}private val {color}units: {color:#20999d}String{color},* *{color:#80}private val {color}count: Long,* *{color:#80}private val {color}addresses: Array[{color:#20999d}String{color}] = Array.empty* *{color:#80}def {color}getName(): {color:#20999d}String {color}= name* *{color:#80}def {color}getUnits(): {color:#20999d}String {color}= units* *{color:#80}def {color}getCount(): Long = count* *{color:#80}def {color}getAddresses(): Array[{color:#20999d}String{color}] = addresses* 2) How the user specifies the gpu resources upon application submission Here we need multiple configs: a) one for the user to specify the gpus per task, that config, to make it extensible for other resources, I propose: *spark.task.resource.\{resource type}.count* . This implementation would only support gpu but it gives us flexibility to add more. This allows for multiple resources as well as multiple configs for that resource. For instance resource type here would be gpu, but you could add fpga. It also would allow you to add more configs instead of count. You could add in like type for I want a certain gpu type for instance. b) User has to specify how many gpu's per executor and driver. This one is a bit more complicated since it has to work with the resource managers to actually acquire those but I think it makes sense to have common configs like we do for cores and memory. So we can have *spark.executor.resource.\{resource type}.count* and *spark.driver.resource.\{resource type}.count*. This implementation would only support gpu. The tricky thing here is some of the resource managers already have configs for asking for gpu's. Yarn has {{spark.yarn.executor.resource.{resource-type}}} although it was added in 3.0 and hasn't shipped yet, but we can't just remove it since you could ask yarn for other resource types spark doesn't know about. Kubernetes you have to request via the pod template so I think it would be on the user to make sure those match. mesos has {{spark.mesos.gpus.max}}. So we just need to make sure the new configs maps into those and having the duplicate configs might make it a bit weird to the user. 3) how the executor discovers or is told the gpu resources it has. Here I think we have 2 options for the user/resource manager. a) I propose we add a config *spark.\{executor, driver}.resource.gpu.discoverScript* to allow the user to specify a discovery script. This script gets run when the executor starts and the user requested gpus to discover what Gpu's the executor has. A simple example of this would be the script simply runs "nvidia-smi --query-gpu=index --format=csv,noheader'" to get the gpu indexes for nvidia cards. You could make this script super simple or complicated depending on your setup. b) Also add an option to the executor launch *--gpuDevices* that allows the resource manager to specify the indexes of the gpu devices it has. This allows insecure or non-containerized resource managers like standalone mode to allocate gpu's pe
[jira] [Comment Edited] (SPARK-27364) User-facing APIs for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812488#comment-16812488 ] Thomas Graves edited comment on SPARK-27364 at 4/8/19 3:10 PM: --- There are 3 main user facing impacts for the user for this are the taskContext interface to fetch the resources, the user api to specify the gpu count, and then how the executor discovers the gpu's or is told the gpus. Below is more detail: 1) How the user gets the resources from the TaskContext and BarrierTaskContext For the taskContext interface I propose we add an api like: *{color:#80}def {color}getResources(): {color:#20999d}Map{color}[{color:#20999d}String{color}, ResourceInformation]* Where the Map key is the resource type. So examples would be "gpu", "fpga", etc. "gpu" would be the only one we officially support to start with. ResourceInformation would be a class with a name, units, count, and addresses. The name would be "gpu", the units for gpu would be empty "", but for other resources types like memory it could be GiB or similar, the count is the number of them, so for gpu's it would be the number allocated, and finally the address Array of strings could be whatever we want, in the gpu case it would just be the indexes of the gpu's allocated to the task, ie ["0", "2", "3"]. I made this a string so its very flexible as to what the address is based on different resources types. Now the user has to know how to inpret this, but depending on what you are doing with them even the same tools have multiple ways to specify. For instance with tensorflow{{ you can specify in CUDA_VISIBLE_DEVICES=2,3 or you can speicify like: for d in ['/device:GPU:2', '/device:GPU:3']: }} *{color:#80}private val {color}name: {color:#20999d}String{color},* *{color:#80}private val {color}units: {color:#20999d}String{color},* *{color:#80}private val {color}count: Long,* *{color:#80}private val {color}addresses: Array[{color:#20999d}String{color}] = Array.empty* *{color:#80}def {color}getName(): {color:#20999d}String {color}= name* *{color:#80}def {color}getUnits(): {color:#20999d}String {color}= units* *{color:#80}def {color}getCount(): Long = count* *{color:#80}def {color}getAddresses(): Array[{color:#20999d}String{color}] = addresses* 2) How the user specifies the gpu resources upon application submission Here we need multiple configs: a) one for the user to specify the gpus per task, that config, to make it extensible for other resources, I propose: *spark.task.resource.\{resource type}.count* . This implementation would only support gpu but it gives us flexibility to add more. This allows for multiple resources as well as multiple configs for that resource. For instance resource type here would be gpu, but you could add fpga. It also would allow you to add more configs instead of count. You could add in like type for I want a certain gpu type for instance. b) User has to specify how many gpu's per executor and driver. This one is a bit more complicated since it has to work with the resource managers to actually acquire those but I think it makes sense to have common configs like we do for cores and memory. So we can have *spark.executor.resource.\{resource type}.count* and *spark.driver.resource.\{resource type}.count*. This implementation would only support gpu. The tricky thing here is some of the resource managers already have configs for asking for gpu's. Yarn has {{spark.yarn.executor.resource. {resource-type} }} although it was added in 3.0 and hasn't shipped yet, but we can't just remove it since you could ask yarn for other resource types spark doesn't know about. Kubernetes you have to request via the pod template so I think it would be on the user to make sure those match. mesos has {{spark.mesos.gpus.max}}. So we just need to make sure the new configs maps into those and having the duplicate configs might make it a bit weird to the user. 3) how the executor discovers or is told the gpu resources it has. Here I think we have 2 options for the user/resource manager. a) I propose we add a config *spark.\{executor, driver}.resource.gpu.discoverScript* to allow the user to specify a discovery script. This script gets run when the executor starts and the user requested gpus to discover what Gpu's the executor has. A simple example of this would be the script simply runs "nvidia-smi --query-gpu=index --format=csv,noheader'" to get the gpu indexes for nvidia cards. You could make this script super simple or complicated depending on your setup. The API for the script is that its callable with no parameters and then the scripts returns a string of comma separated values. Normally I would expected indexes like "0,1,2,3". b) Also add an option to the executor launch *--gpuDevices* that
[jira] [Created] (SPARK-27408) functions.coalesce working on csv but not on Mongospark
yashwanth created SPARK-27408: - Summary: functions.coalesce working on csv but not on Mongospark Key: SPARK-27408 URL: https://issues.apache.org/jira/browse/SPARK-27408 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.2.0 Reporter: yashwanth e1.csv id,code,type 1,,A 2,, 3,123,I e2.csv id,code,type 1,456,A 2,789,A1 3,,C Dataset goldenCopy = e1.as("a").join(e2.as("b")).where("a.id == b.id"); goldenCopy.select(functions.coalesce(e1.col("code"),e2.col("code"))).show(); I can't able to run the above code on dataset got from mongo-spark, I had imported same csv files into mongodb using mongoimport . refer stackoverflow https://stackoverflow.com/questions/55570984/spark-functions-coalesce-not-working-on-mongodb-collections-but-works-on-csvs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27176) Upgrade hadoop-3's built-in Hive maven dependencies to 2.3.4
[ https://issues.apache.org/jira/browse/SPARK-27176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-27176. - Resolution: Fixed Assignee: Yuming Wang Fix Version/s: 3.0.0 > Upgrade hadoop-3's built-in Hive maven dependencies to 2.3.4 > > > Key: SPARK-27176 > URL: https://issues.apache.org/jira/browse/SPARK-27176 > Project: Spark > Issue Type: Sub-task > Components: Build, SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13704) TaskSchedulerImpl.createTaskSetManager can be expensive, and result in lost executors due to blocked heartbeats
[ https://issues.apache.org/jira/browse/SPARK-13704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-13704. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24245 [https://github.com/apache/spark/pull/24245] > TaskSchedulerImpl.createTaskSetManager can be expensive, and result in lost > executors due to blocked heartbeats > --- > > Key: SPARK-13704 > URL: https://issues.apache.org/jira/browse/SPARK-13704 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.0 >Reporter: Zhong Wang >Priority: Major > Fix For: 3.0.0 > > > In some cases, TaskSchedulerImpl.createTaskSetManager can be expensive. For > example, in a Yarn cluster, it may call the topology script for rack > awareness. When submit a very large job in a very large Yarn cluster, the > topology script may take signifiant time to run. And this blocks receiving > executors' heartbeats, which may result in lost executors > Stacktraces we observed which is related to this issue: > {code} > "dag-scheduler-event-loop" daemon prio=10 tid=0x7f8392875800 nid=0x26e8 > runnable [0x7f83576f4000] >java.lang.Thread.State: RUNNABLE > at java.io.FileInputStream.readBytes(Native Method) > at java.io.FileInputStream.read(FileInputStream.java:272) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:273) > at java.io.BufferedInputStream.read(BufferedInputStream.java:334) > - locked <0xf551f460> (a > java.lang.UNIXProcess$ProcessPipeInputStream) > at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177) > - locked <0xf5529740> (a java.io.InputStreamReader) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:154) > at java.io.BufferedReader.read1(BufferedReader.java:205) > at java.io.BufferedReader.read(BufferedReader.java:279) > - locked <0xf5529740> (a java.io.InputStreamReader) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:728) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:524) > at org.apache.hadoop.util.Shell.run(Shell.java:455) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) > at > org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251) > at > org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188) > at > org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119) > at > org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101) > at > org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81) > at > org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:38) > at > org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:210) > at > org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:189) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.TaskSetManager.org$apache$spark$scheduler$TaskSetManager$$addPendingTask(TaskSetManager.scala:189) > at > org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:158) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > at > org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:157) > at > org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:187) > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:161) > - locked <0xea3b8a88> (a > org.apache.spark.scheduler.cluster.YarnScheduler) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:872) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778) > at > org.apache.spark.scheduler.DAGScheduler.handleJ
[jira] [Assigned] (SPARK-13704) TaskSchedulerImpl.createTaskSetManager can be expensive, and result in lost executors due to blocked heartbeats
[ https://issues.apache.org/jira/browse/SPARK-13704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-13704: Assignee: Lantao Jin > TaskSchedulerImpl.createTaskSetManager can be expensive, and result in lost > executors due to blocked heartbeats > --- > > Key: SPARK-13704 > URL: https://issues.apache.org/jira/browse/SPARK-13704 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.0 >Reporter: Zhong Wang >Assignee: Lantao Jin >Priority: Major > Fix For: 3.0.0 > > > In some cases, TaskSchedulerImpl.createTaskSetManager can be expensive. For > example, in a Yarn cluster, it may call the topology script for rack > awareness. When submit a very large job in a very large Yarn cluster, the > topology script may take signifiant time to run. And this blocks receiving > executors' heartbeats, which may result in lost executors > Stacktraces we observed which is related to this issue: > {code} > "dag-scheduler-event-loop" daemon prio=10 tid=0x7f8392875800 nid=0x26e8 > runnable [0x7f83576f4000] >java.lang.Thread.State: RUNNABLE > at java.io.FileInputStream.readBytes(Native Method) > at java.io.FileInputStream.read(FileInputStream.java:272) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:273) > at java.io.BufferedInputStream.read(BufferedInputStream.java:334) > - locked <0xf551f460> (a > java.lang.UNIXProcess$ProcessPipeInputStream) > at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177) > - locked <0xf5529740> (a java.io.InputStreamReader) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:154) > at java.io.BufferedReader.read1(BufferedReader.java:205) > at java.io.BufferedReader.read(BufferedReader.java:279) > - locked <0xf5529740> (a java.io.InputStreamReader) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:728) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:524) > at org.apache.hadoop.util.Shell.run(Shell.java:455) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) > at > org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251) > at > org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188) > at > org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119) > at > org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101) > at > org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81) > at > org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:38) > at > org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:210) > at > org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:189) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.TaskSetManager.org$apache$spark$scheduler$TaskSetManager$$addPendingTask(TaskSetManager.scala:189) > at > org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:158) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > at > org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:157) > at > org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:187) > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:161) > - locked <0xea3b8a88> (a > org.apache.spark.scheduler.cluster.YarnScheduler) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:872) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:762) > at > org.apache.spark.
[jira] [Updated] (SPARK-23710) Upgrade the built-in Hive to 2.3.4 for hadoop-3.2
[ https://issues.apache.org/jira/browse/SPARK-23710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23710: Target Version/s: 3.0.0 > Upgrade the built-in Hive to 2.3.4 for hadoop-3.2 > - > > Key: SPARK-23710 > URL: https://issues.apache.org/jira/browse/SPARK-23710 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Critical > > Spark fail to run on Hadoop 3.x, because Hive's ShimLoader considers Hadoop > 3.x to be an unknown Hadoop version. see SPARK-18673 and HIVE-16081 for more > details. So we need to upgrade the built-in Hive for Hadoop-3.x. This is an > umbrella JIRA to track this upgrade. > > *Upgrade Plan*: > # SPARK-27054 Remove the Calcite dependency. This can avoid some jar > conflicts. > # SPARK-23749 Replace built-in Hive API (isSub/toKryo) and remove > OrcProto.Type usage > # SPARK-27158, SPARK-27130 Update dev/* to support dynamic change profiles > when testing > # Fix ORC dependency conflict to makes it test passed on Hive 1.2.1 and > compile passed on Hive 2.3.4 > # Add an empty hive-thriftserverV2 module. then we could test all test cases > in next step > # Make Hadoop-3.1 with Hive 2.3.4 test passed > # Adapted hive-thriftserverV2 from hive-thriftserver with Hive 2.3.4's > [TCLIService.thrift|https://github.com/apache/hive/blob/rel/release-2.3.4/service-rpc/if/TCLIService.thrift] > > I have completed the [initial > work|https://github.com/apache/spark/pull/24044] and plan to finish this > upgrade step by step. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812628#comment-16812628 ] shane knapp commented on SPARK-27389: - JDKs haven't changed on the jenkins workers in a while, and neither have the python pytz packages... i'm not really sure what's going on here and why this just started failing. i'll poke around more (later) today, after i get caught up from the latter half of last week. > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: shane knapp >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27348) HeartbeatReceiver doesn't remove lost executors from CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-27348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812657#comment-16812657 ] Shixiong Zhu commented on SPARK-27348: -- [~sandeep.katta2007] I cannot reproduce this locally. Ideally, when we decide to remove an executor, we should remove it from all places rather than counting on a TCP disconnect event which may not happen sometimes. > HeartbeatReceiver doesn't remove lost executors from > CoarseGrainedSchedulerBackend > -- > > Key: SPARK-27348 > URL: https://issues.apache.org/jira/browse/SPARK-27348 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Priority: Major > > When a heartbeat timeout happens in HeartbeatReceiver, it doesn't remove lost > executors from CoarseGrainedSchedulerBackend. When a connection of an > executor is not gracefully shut down, CoarseGrainedSchedulerBackend may not > receive a disconnect event. In this case, CoarseGrainedSchedulerBackend still > thinks a lost executor is still alive. CoarseGrainedSchedulerBackend may ask > TaskScheduler to run tasks on this lost executor. This task will never finish > and the job will hang forever. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27409) Micro-batch support for Kafka Source in Spark 2.3
Prabhjot Singh Bharaj created SPARK-27409: - Summary: Micro-batch support for Kafka Source in Spark 2.3 Key: SPARK-27409 URL: https://issues.apache.org/jira/browse/SPARK-27409 Project: Spark Issue Type: Question Components: Structured Streaming Affects Versions: 2.3.2 Reporter: Prabhjot Singh Bharaj It seems with this change - [https://github.com/apache/spark/commit/0a441d2edb0a3f6c6c7c370db8917e1c07f211e7#diff-eeac5bdf3a1ecd7b9f8aaf10fff37f05R50] in Spark 2.3 for Kafka Source Provider, a Kafka source can not be run in micro-batch mode but only in continuous mode. Is that understanding correct ? {code:java} E Py4JJavaError: An error occurred while calling o217.load. E : org.apache.kafka.common.KafkaException: Failed to construct kafka consumer E at org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:717) E at org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:566) E at org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:549) E at org.apache.spark.sql.kafka010.SubscribeStrategy.createConsumer(ConsumerStrategy.scala:62) E at org.apache.spark.sql.kafka010.KafkaOffsetReader.createConsumer(KafkaOffsetReader.scala:314) E at org.apache.spark.sql.kafka010.KafkaOffsetReader.(KafkaOffsetReader.scala:78) E at org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130) E at org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43) E at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185) E at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) E at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) E at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) E at java.lang.reflect.Method.invoke(Method.java:498) E at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) E at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) E at py4j.Gateway.invoke(Gateway.java:282) E at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) E at py4j.commands.CallCommand.execute(CallCommand.java:79) E at py4j.GatewayConnection.run(GatewayConnection.java:238) E at java.lang.Thread.run(Thread.java:748) E Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: non-existent (No such file or directory) E at org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:44) E at org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:93) E at org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:51) E at org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:84) E at org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:657) E ... 19 more E Caused by: org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: non-existent (No such file or directory) E at org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:121) E at org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:41) E ... 23 more E Caused by: java.io.FileNotFoundException: non-existent (No such file or directory) E at java.io.FileInputStream.open0(Native Method) E at java.io.FileInputStream.open(FileInputStream.java:195) E at java.io.FileInputStream.(FileInputStream.java:138) E at java.io.FileInputStream.(FileInputStream.java:93) E at org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.load(SslFactory.java:216) E at org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.access$000(SslFactory.java:201) E at org.apache.kafka.common.security.ssl.SslFactory.createSSLContext(SslFactory.java:137) E at org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:119) E ... 24 more{code} When running a simple data stream loader for kafka without an SSL cert, it goes through this code block - {code:java} ... ... org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130) E at org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43) E at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185) ... ...{code} Note that I haven't selected `trigger=continuous...` when creating the dataframe, still the code is going through the continuous path. My understanding was that `continuous` is optional and not the default. Please clarify. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.or
[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812704#comment-16812704 ] shane knapp commented on SPARK-27389: - is this even really a valid timezone? plus, i really don't think this is a jenkins issue per se. i whipped up some java to check for this timezone, which is there: {code} $ java DisplayZoneAndOffSet|grep Pacific-New US/Pacific-New (UTC-07:00) {code} but it's definitely not a valid pytz timezone: {code} $ python2.7 -c 'import pytz; print "US/Pacific-New" in pytz.all_timezones' False {code} as a work-around... i *could* hack {code}/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py{code} to include US/Pacific-New on all of the workers. ;) > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: shane knapp >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812704#comment-16812704 ] shane knapp edited comment on SPARK-27389 at 4/8/19 6:56 PM: - is this even really a valid timezone? plus, i really don't think this is a jenkins issue per se. i whipped up some java to check for this timezone, which is there: {code} $ java DisplayZoneAndOffSet|grep Pacific-New US/Pacific-New (UTC-07:00) {code} but it's definitely not a valid pytz timezone: {code} $ python2.7 -c 'import pytz; print "US/Pacific-New" in pytz.all_timezones' False {code} we're also running the latest version of pytz (according to pip at least): {code} $ pip2.7 install -U pytz Requirement already up-to-date: pytz in /home/anaconda/lib/python2.7/site-packages (2018.9) $ pip2.7 show pytz Name: pytz Version: 2018.9 Summary: World timezone definitions, modern and historical Home-page: http://pythonhosted.org/pytz Author: Stuart Bishop Author-email: stu...@stuartbishop.net License: MIT Location: /home/anaconda/lib/python2.7/site-packages Requires: Required-by: pandas {code} as a work-around... i *could* hack {code}/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py{code} to include US/Pacific-New on all of the workers. ;) was (Author: shaneknapp): is this even really a valid timezone? plus, i really don't think this is a jenkins issue per se. i whipped up some java to check for this timezone, which is there: {code} $ java DisplayZoneAndOffSet|grep Pacific-New US/Pacific-New (UTC-07:00) {code} but it's definitely not a valid pytz timezone: {code} $ python2.7 -c 'import pytz; print "US/Pacific-New" in pytz.all_timezones' False {code} as a work-around... i *could* hack {code}/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py{code} to include US/Pacific-New on all of the workers. ;) > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: shane knapp >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilde
[jira] [Assigned] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp reassigned SPARK-27389: --- Assignee: (was: shane knapp) > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.6
[ https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812756#comment-16812756 ] shane knapp commented on SPARK-25079: - waiting on [~bryanc] to release pyarrow 0.12.1 before merging https://github.com/apache/spark/pull/24266 > [PYTHON] upgrade python 3.4 -> 3.6 > -- > > Key: SPARK-25079 > URL: https://issues.apache.org/jira/browse/SPARK-25079 > Project: Spark > Issue Type: Improvement > Components: Build, PySpark >Affects Versions: 2.3.1 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > for the impending arrow upgrade > (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python > 3.4 -> 3.5. > i have been testing this here: > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69] > my methodology: > 1) upgrade python + arrow to 3.5 and 0.10.0 > 2) run python tests > 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and > upgrade centos workers to python3.5 > 4) simultaneously do the following: > - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that > points to python3.5 (this is currently being tested here: > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)] > - push a change to python/run-tests.py replacing 3.4 with 3.5 > 5) once the python3.5 change to run-tests.py is merged, we will need to > back-port this to all existing branches > 6) then and only then can i remove the python3.4 -> python3.5 symlink -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812765#comment-16812765 ] Sean Owen commented on SPARK-27389: --- On the question of what the heck it is, comically: https://mm.icann.org/pipermail/tz/2009-February/015448.html So.. hm does this suggest it is the OS with something about this installed somewhere? This bug was reported against pytz over a decade ago > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812767#comment-16812767 ] Sean Owen commented on SPARK-27389: --- What about updating tzdata? > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812784#comment-16812784 ] shane knapp commented on SPARK-27389: - well, this started happening ~6am PST on april 2nd as best as i can tell. regarding the tzinfo on the centos workers (where this is failing), nothing has changed for a year: {noformat} $ ls -l /usr/share/zoneinfo/US total 52 -rw-r--r--. 2 root root 2354 Apr 3 2017 Alaska -rw-r--r--. 3 root root 2339 Apr 3 2017 Aleutian -rw-r--r--. 2 root root 327 Apr 3 2017 Arizona -rw-r--r--. 2 root root 3543 Apr 3 2017 Central -rw-r--r--. 3 root root 3519 Apr 3 2017 Eastern -rw-r--r--. 4 root root 1649 Apr 3 2017 East-Indiana -rw-r--r--. 3 root root 250 Apr 3 2017 Hawaii -rw-r--r--. 3 root root 2395 Apr 3 2017 Indiana-Starke -rw-r--r--. 2 root root 2202 Apr 3 2017 Michigan -rw-r--r--. 4 root root 2427 Apr 3 2017 Mountain -rw-r--r--. 3 root root 2819 Apr 3 2017 Pacific -rw-r--r--. 3 root root 2819 Apr 3 2017 Pacific-New -rw-r--r--. 4 root root 174 Apr 3 2017 Samoa {noformat} anyways: i still believe that this is a pyspark problem, not a jenkins worker configuration problem. > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) -
[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812785#comment-16812785 ] shane knapp commented on SPARK-27389: - [~srowen] sure, i can update the tzdata package on the centos workers... let's see if that does anything. this will take ~5 mins. > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812789#comment-16812789 ] shane knapp commented on SPARK-27389: - updating tzdata didn't do anything noticeable: {noformat} [sknapp@amp-jenkins-worker-04 ~]$ python2.7 -c 'import pytz; print "US/Pacific-New" in pytz.all_timezones' False [sknapp@amp-jenkins-worker-04 ~]$ which python2.7 /home/anaconda/bin/python2.7 {noformat} > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812789#comment-16812789 ] shane knapp edited comment on SPARK-27389 at 4/8/19 9:03 PM: - updating tzdata didn't do anything noticeable: {noformat} [sknapp@amp-jenkins-worker-04 ~]$ python2.7 -c 'import pytz; print "US/Pacific-New" in pytz.all_timezones' False [sknapp@amp-jenkins-worker-04 ~]$ which python2.7 /home/anaconda/bin/python2.7 {noformat} this is actually expected as pytz stores it's OWN tzdata (see my earlier comment about hacking anaconda/lib/python2.7/site-packages/pytz/__init__.py). was (Author: shaneknapp): updating tzdata didn't do anything noticeable: {noformat} [sknapp@amp-jenkins-worker-04 ~]$ python2.7 -c 'import pytz; print "US/Pacific-New" in pytz.all_timezones' False [sknapp@amp-jenkins-worker-04 ~]$ which python2.7 /home/anaconda/bin/python2.7 {noformat} > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812789#comment-16812789 ] shane knapp edited comment on SPARK-27389 at 4/8/19 9:09 PM: - updating tzdata (via pip and yum) didn't do anything noticeable: {noformat} [sknapp@amp-jenkins-worker-04 ~]$ python2.7 -c 'import pytz; print "US/Pacific-New" in pytz.all_timezones' False [sknapp@amp-jenkins-worker-04 ~]$ which python2.7 /home/anaconda/bin/python2.7 {noformat} this is actually expected as pytz stores it's OWN tzdata (see my earlier comment about hacking anaconda/lib/python2.7/site-packages/pytz/__init__.py). was (Author: shaneknapp): updating tzdata didn't do anything noticeable: {noformat} [sknapp@amp-jenkins-worker-04 ~]$ python2.7 -c 'import pytz; print "US/Pacific-New" in pytz.all_timezones' False [sknapp@amp-jenkins-worker-04 ~]$ which python2.7 /home/anaconda/bin/python2.7 {noformat} this is actually expected as pytz stores it's OWN tzdata (see my earlier comment about hacking anaconda/lib/python2.7/site-packages/pytz/__init__.py). > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@
[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812802#comment-16812802 ] Sean Owen commented on SPARK-27389: --- I wonder what has created /usr/share/zoneinfo/US/Pacific-New ? AFAICT that shouldn't be there. It was updated at about the same time -- not just that one TZ but the whole thing. Doesn't sound like it's pytz; that's just the Python timezone library. Can't really be Pyspark; this isn't something in the Spark code at all. Here's a complaint about tzdata providing this from a few years ago: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=815200 Removed in 2018d-1? https://launchpad.net/ubuntu/+source/tzdata/+changelog > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812838#comment-16812838 ] shane knapp commented on SPARK-27389: - well, according to [~bryanc]: """ >From the stacktrace, it looks like it's getting this from >"spark.sql.session.timeZone" which defaults to Java.util >TimeZone.getDefault.getID() """ here are the versions of tzdata* installed on the workers having this problem: {noformat} tzdata-2019a-1.el6.noarch tzdata-java-2019a-1.el6.noarch {noformat} looks like we're on the latest, but US/Pacific-New is STILL showing up in /usr/share/zoneinfo/US. when i dig in to the java tzdata package, i am finding the following: {noformat} $ strings /usr/share/javazi/ZoneInfoMappings ...bunch of cruft deleted... US/Pacific America/Los_Angeles US/Pacific-New America/Los_Angeles {noformat} so, it appears to me that: 1) the OS still sees US/Pacific-New via tzdata 2) java still sees US/Pacific-New via tzdata-java 3) python has no idea WTF US/Pacific-New is and (occasionally) barfs during pyspark unit tests so, should i go ahead and manually hack lib/python2.7/site-packages/pytz/__init__.py and add 'US/Pacific-New' which will fix the symptom (w/o fixing the cause)? other than doing that, i'm actually stumped as to why this literally just started failing. > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone
[jira] [Comment Edited] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812841#comment-16812841 ] shane knapp edited comment on SPARK-27389 at 4/8/19 10:06 PM: -- also, java8 appears to believe i'm in the US/Pacific (not Pacific-New) TZ: {noformat} [sknapp@amp-jenkins-worker-04 ~]$ cat tz.java import java.util.TimeZone; public class tz { public static void main(String[] args) { TimeZone tz = TimeZone.getDefault(); System.out.println(tz.getID()); } } [sknapp@amp-jenkins-worker-04 ~]$ javac tz.java [sknapp@amp-jenkins-worker-04 ~]$ java tz US/Pacific [sknapp@amp-jenkins-worker-04 ~]$ java -version java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) {noormat} was (Author: shaneknapp): also, java8 appears to believe i'm in the US/Pacific (not Pacific-New) TZ: {preformat} [sknapp@amp-jenkins-worker-04 ~]$ cat tz.java import java.util.TimeZone; public class tz { public static void main(String[] args) { TimeZone tz = TimeZone.getDefault(); System.out.println(tz.getID()); } } [sknapp@amp-jenkins-worker-04 ~]$ javac tz.java [sknapp@amp-jenkins-worker-04 ~]$ java tz US/Pacific [sknapp@amp-jenkins-worker-04 ~]$ java -version java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) {preformat} > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_t
[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812841#comment-16812841 ] shane knapp commented on SPARK-27389: - also, java8 appears to believe i'm in the US/Pacific (not Pacific-New) TZ: {preformat} [sknapp@amp-jenkins-worker-04 ~]$ cat tz.java import java.util.TimeZone; public class tz { public static void main(String[] args) { TimeZone tz = TimeZone.getDefault(); System.out.println(tz.getID()); } } [sknapp@amp-jenkins-worker-04 ~]$ javac tz.java [sknapp@amp-jenkins-worker-04 ~]$ java tz US/Pacific [sknapp@amp-jenkins-worker-04 ~]$ java -version java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) {preformat} > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812841#comment-16812841 ] shane knapp edited comment on SPARK-27389 at 4/8/19 10:07 PM: -- also, java8 appears to believe i'm in the US/Pacific (not Pacific-New) TZ: {noformat} [sknapp@amp-jenkins-worker-04 ~]$ cat tz.java import java.util.TimeZone; public class tz { public static void main(String[] args) { TimeZone tz = TimeZone.getDefault(); System.out.println(tz.getID()); } } [sknapp@amp-jenkins-worker-04 ~]$ javac tz.java [sknapp@amp-jenkins-worker-04 ~]$ java tz US/Pacific [sknapp@amp-jenkins-worker-04 ~]$ java -version java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) {noformat} was (Author: shaneknapp): also, java8 appears to believe i'm in the US/Pacific (not Pacific-New) TZ: {noformat} [sknapp@amp-jenkins-worker-04 ~]$ cat tz.java import java.util.TimeZone; public class tz { public static void main(String[] args) { TimeZone tz = TimeZone.getDefault(); System.out.println(tz.getID()); } } [sknapp@amp-jenkins-worker-04 ~]$ javac tz.java [sknapp@amp-jenkins-worker-04 ~]$ java tz US/Pacific [sknapp@amp-jenkins-worker-04 ~]$ java -version java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) {noormat} > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz
[jira] [Updated] (SPARK-16548) java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data
[ https://issues.apache.org/jira/browse/SPARK-16548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bijith Kumar updated SPARK-16548: - Attachment: corrupted.json > java.io.CharConversionException: Invalid UTF-32 character prevents me from > querying my data > > > Key: SPARK-16548 > URL: https://issues.apache.org/jira/browse/SPARK-16548 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Egor Pahomov >Priority: Minor > Fix For: 2.2.0, 2.3.0 > > Attachments: corrupted.json > > > Basically, when I query my json data I get > {code} > java.io.CharConversionException: Invalid UTF-32 character 0x7b2265(above > 10) at char #192, byte #771) > at > com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189) > at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1855) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:571) > at > org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:142) > {code} > I do not like it. If you can not process one json among 100500 please return > null, do not fail everything. I have dirty one line fix, and I understand how > I can make it more reasonable. What is our position - what behaviour we wanna > get? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16548) java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data
[ https://issues.apache.org/jira/browse/SPARK-16548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812858#comment-16812858 ] Bijith Kumar commented on SPARK-16548: -- [~cloud_fan], I couldn't find the specific character of the corrupted data that is causing the issue. However, here is the corrupted section from file to reproduce the issue. Please see attached - [^corrupted.json]. > java.io.CharConversionException: Invalid UTF-32 character prevents me from > querying my data > > > Key: SPARK-16548 > URL: https://issues.apache.org/jira/browse/SPARK-16548 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Egor Pahomov >Priority: Minor > Fix For: 2.2.0, 2.3.0 > > Attachments: corrupted.json > > > Basically, when I query my json data I get > {code} > java.io.CharConversionException: Invalid UTF-32 character 0x7b2265(above > 10) at char #192, byte #771) > at > com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189) > at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1855) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:571) > at > org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:142) > {code} > I do not like it. If you can not process one json among 100500 please return > null, do not fail everything. I have dirty one line fix, and I understand how > I can make it more reasonable. What is our position - what behaviour we wanna > get? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27410) Remove deprecated/no-op mllib.Kmeans get/setRuns methods
Sean Owen created SPARK-27410: - Summary: Remove deprecated/no-op mllib.Kmeans get/setRuns methods Key: SPARK-27410 URL: https://issues.apache.org/jira/browse/SPARK-27410 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 3.0.0 Reporter: Sean Owen Assignee: Sean Owen mllib.KMeans has getRuns, setRuns methods which haven't done anything since Spark 2.1. They're deprecated, and no-ops, and should be removed for Spark 3. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812877#comment-16812877 ] Bryan Cutler commented on SPARK-27389: -- [~shaneknapp], I had a couple of successful tests with worker-4. Do you know if the problem consistent on certain workers or just random on all of them? > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25407) Spark throws a `ParquetDecodingException` when attempting to read a field from a complex type in certain cases of schema merging
[ https://issues.apache.org/jira/browse/SPARK-25407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-25407: - Assignee: Michael Allman (was: Dongjoon Hyun) > Spark throws a `ParquetDecodingException` when attempting to read a field > from a complex type in certain cases of schema merging > > > Key: SPARK-25407 > URL: https://issues.apache.org/jira/browse/SPARK-25407 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Michael Allman >Assignee: Michael Allman >Priority: Major > Fix For: 3.0.0 > > > Spark supports merging schemata across table partitions in which one > partition is missing a subfield that's present in another. However, > attempting to select that missing field with a query that includes a > partition pruning predicate that filters out the partitions that include that > field results in a `ParquetDecodingException` when attempting to get the > query results. > This bug is specifically exercised by the failing (but ignored) test case > [https://github.com/apache/spark/blob/f2d35427eedeacceb6edb8a51974a7e8bbb94bc2/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala#L125-L131]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812883#comment-16812883 ] shane knapp commented on SPARK-27389: - no, it appears to be random. [https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-sbt-hadoop-2.7/365/] [https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-sbt-hadoop-2.7/364/] these two identical builds ran w/the same python/java/whathaveyou setup on the *same physical worker*. one passes, one fails w/the date thing. > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812883#comment-16812883 ] shane knapp edited comment on SPARK-27389 at 4/9/19 12:05 AM: -- -no, it appears to be random.- -[https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-sbt-hadoop-2.7/365/]- -[https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-sbt-hadoop-2.7/364/]- -these two identical builds ran w/the same python/java/whathaveyou setup on the *same physical worker*. one passes, one fails w/the date thing.- bad example, pls hold. was (Author: shaneknapp): no, it appears to be random. [https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-sbt-hadoop-2.7/365/] [https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-sbt-hadoop-2.7/364/] these two identical builds ran w/the same python/java/whathaveyou setup on the *same physical worker*. one passes, one fails w/the date thing. > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812883#comment-16812883 ] shane knapp edited comment on SPARK-27389 at 4/9/19 12:21 AM: -- -no, it appears to be random.- -[https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-sbt-hadoop-2.7/365/]- -[https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-sbt-hadoop-2.7/364/]- -these two identical builds ran w/the same python/java/whathaveyou setup on the *same physical worker*. one passes, one fails w/the date thing.- bad example, pls hold. i need to do some more build archaeology this evening and tomorrow. i'm aware that this is important. :) was (Author: shaneknapp): -no, it appears to be random.- -[https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-sbt-hadoop-2.7/365/]- -[https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-sbt-hadoop-2.7/364/]- -these two identical builds ran w/the same python/java/whathaveyou setup on the *same physical worker*. one passes, one fails w/the date thing.- bad example, pls hold. > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-ma
[jira] [Assigned] (SPARK-26881) Scaling issue with Gramian computation for RowMatrix: too many results sent to driver
[ https://issues.apache.org/jira/browse/SPARK-26881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-26881: - Assignee: Rafael RENAUDIN-AVINO > Scaling issue with Gramian computation for RowMatrix: too many results sent > to driver > - > > Key: SPARK-26881 > URL: https://issues.apache.org/jira/browse/SPARK-26881 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 >Reporter: Rafael RENAUDIN-AVINO >Assignee: Rafael RENAUDIN-AVINO >Priority: Minor > > This issue hit me when running PCA on large dataset (~1Billion rows, ~30k > columns). > Computing Gramian of a big RowMatrix allows to reproduce the issue. > > The problem arises in the treeAggregate phase of the gramian matrix > computation: results sent to driver are enormous. > A potential solution to this could be to replace the hard coded depth (2) of > the tree aggregation by a heuristic computed based on the number of > partitions, driver max result size, and memory size of the dense vectors that > are being aggregated, cf below for more detail: > (nb_partitions)^(1/depth) * dense_vector_size <= driver_max_result_size > I have a potential fix ready (currently testing it at scale), but I'd like to > hear the community opinion about such a fix to know if it's worth investing > my time into a clean pull request. > > Note that I only faced this issue with spark 2.2 but I suspect it affects > later versions aswell. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26881) Scaling issue with Gramian computation for RowMatrix: too many results sent to driver
[ https://issues.apache.org/jira/browse/SPARK-26881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26881. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23983 [https://github.com/apache/spark/pull/23983] > Scaling issue with Gramian computation for RowMatrix: too many results sent > to driver > - > > Key: SPARK-26881 > URL: https://issues.apache.org/jira/browse/SPARK-26881 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 >Reporter: Rafael RENAUDIN-AVINO >Assignee: Rafael RENAUDIN-AVINO >Priority: Minor > Fix For: 3.0.0 > > > This issue hit me when running PCA on large dataset (~1Billion rows, ~30k > columns). > Computing Gramian of a big RowMatrix allows to reproduce the issue. > > The problem arises in the treeAggregate phase of the gramian matrix > computation: results sent to driver are enormous. > A potential solution to this could be to replace the hard coded depth (2) of > the tree aggregation by a heuristic computed based on the number of > partitions, driver max result size, and memory size of the dense vectors that > are being aggregated, cf below for more detail: > (nb_partitions)^(1/depth) * dense_vector_size <= driver_max_result_size > I have a potential fix ready (currently testing it at scale), but I'd like to > hear the community opinion about such a fix to know if it's worth investing > my time into a clean pull request. > > Note that I only faced this issue with spark 2.2 but I suspect it affects > later versions aswell. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27409) Micro-batch support for Kafka Source in Spark 2.3
[ https://issues.apache.org/jira/browse/SPARK-27409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812991#comment-16812991 ] Shivu Sondur commented on SPARK-27409: -- i am checking this > Micro-batch support for Kafka Source in Spark 2.3 > - > > Key: SPARK-27409 > URL: https://issues.apache.org/jira/browse/SPARK-27409 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Affects Versions: 2.3.2 >Reporter: Prabhjot Singh Bharaj >Priority: Major > > It seems with this change - > [https://github.com/apache/spark/commit/0a441d2edb0a3f6c6c7c370db8917e1c07f211e7#diff-eeac5bdf3a1ecd7b9f8aaf10fff37f05R50] > in Spark 2.3 for Kafka Source Provider, a Kafka source can not be run in > micro-batch mode but only in continuous mode. Is that understanding correct ? > {code:java} > E Py4JJavaError: An error occurred while calling o217.load. > E : org.apache.kafka.common.KafkaException: Failed to construct kafka consumer > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:717) > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:566) > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:549) > E at > org.apache.spark.sql.kafka010.SubscribeStrategy.createConsumer(ConsumerStrategy.scala:62) > E at > org.apache.spark.sql.kafka010.KafkaOffsetReader.createConsumer(KafkaOffsetReader.scala:314) > E at > org.apache.spark.sql.kafka010.KafkaOffsetReader.(KafkaOffsetReader.scala:78) > E at > org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130) > E at > org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43) > E at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185) > E at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > E at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > E at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > E at java.lang.reflect.Method.invoke(Method.java:498) > E at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > E at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > E at py4j.Gateway.invoke(Gateway.java:282) > E at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > E at py4j.commands.CallCommand.execute(CallCommand.java:79) > E at py4j.GatewayConnection.run(GatewayConnection.java:238) > E at java.lang.Thread.run(Thread.java:748) > E Caused by: org.apache.kafka.common.KafkaException: > org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: > non-existent (No such file or directory) > E at > org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:44) > E at > org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:93) > E at > org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:51) > E at > org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:84) > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:657) > E ... 19 more > E Caused by: org.apache.kafka.common.KafkaException: > java.io.FileNotFoundException: non-existent (No such file or directory) > E at > org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:121) > E at > org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:41) > E ... 23 more > E Caused by: java.io.FileNotFoundException: non-existent (No such file or > directory) > E at java.io.FileInputStream.open0(Native Method) > E at java.io.FileInputStream.open(FileInputStream.java:195) > E at java.io.FileInputStream.(FileInputStream.java:138) > E at java.io.FileInputStream.(FileInputStream.java:93) > E at > org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.load(SslFactory.java:216) > E at > org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.access$000(SslFactory.java:201) > E at > org.apache.kafka.common.security.ssl.SslFactory.createSSLContext(SslFactory.java:137) > E at > org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:119) > E ... 24 more{code} > When running a simple data stream loader for kafka without an SSL cert, it > goes through this code block - > > {code:java} > ... > ... > org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130) > E at > org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43) > E at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185) > ... > ...{code} > > Note that I haven't selected `trigger=continu
[jira] [Assigned] (SPARK-27328) Create 'deprecate' property in ExpressionDescription for SQL functions documentation
[ https://issues.apache.org/jira/browse/SPARK-27328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27328: --- Assignee: Hyukjin Kwon > Create 'deprecate' property in ExpressionDescription for SQL functions > documentation > > > Key: SPARK-27328 > URL: https://issues.apache.org/jira/browse/SPARK-27328 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Currently, there looks no way to show SQL functions are deprecated. See > https://spark.apache.org/docs/2.3.0/api/sql/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27328) Create 'deprecate' property in ExpressionDescription for SQL functions documentation
[ https://issues.apache.org/jira/browse/SPARK-27328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27328. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24259 [https://github.com/apache/spark/pull/24259] > Create 'deprecate' property in ExpressionDescription for SQL functions > documentation > > > Key: SPARK-27328 > URL: https://issues.apache.org/jira/browse/SPARK-27328 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > Currently, there looks no way to show SQL functions are deprecated. See > https://spark.apache.org/docs/2.3.0/api/sql/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27411) DataSourceV2Strategy should not eliminate subquery
Mingcong Han created SPARK-27411: Summary: DataSourceV2Strategy should not eliminate subquery Key: SPARK-27411 URL: https://issues.apache.org/jira/browse/SPARK-27411 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Mingcong Han Fix For: 3.0.0 In DataSourceV2Strategy, it seems we eliminate the subqueries by mistake after normalizing filters. Here is an example: We have an sql with a scalar subquery: {code:scala} val plan = spark.sql("select * from t2 where t2a > (select max(t1a) from t1)") plan.explain(true) {code} And we get the log info of DataSourceV2Strategy: {noformat} Pushing operators to csv:examples/src/main/resources/t2.txt Pushed Filters: Post-Scan Filters: isnotnull(t2a#30) Output: t2a#30, t2b#31 {noformat} The `Post-Scan Filters` should contain the scalar subquery, but we eliminate it by mistake. {noformat} == Parsed Logical Plan == 'Project [*] +- 'Filter ('t2a > scalar-subquery#56 []) : +- 'Project [unresolvedalias('max('t1a), None)] : +- 'UnresolvedRelation `t1` +- 'UnresolvedRelation `t2` == Analyzed Logical Plan == t2a: string, t2b: string Project [t2a#30, t2b#31] +- Filter (t2a#30 > scalar-subquery#56 []) : +- Aggregate [max(t1a#13) AS max(t1a)#63] : +- SubqueryAlias `t1` :+- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt +- SubqueryAlias `t2` +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt == Optimized Logical Plan == Filter (isnotnull(t2a#30) && (t2a#30 > scalar-subquery#56 [])) : +- Aggregate [max(t1a#13) AS max(t1a)#63] : +- Project [t1a#13] :+- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt == Physical Plan == *(1) Project [t2a#30, t2b#31] +- *(1) Filter isnotnull(t2a#30) +- *(1) BatchScan[t2a#30, t2b#31] class org.apache.spark.sql.execution.datasources.v2.csv.CSVScan {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27412) Add a new shuffle manager to use Persistent Memory as shuffle and spilling storage
Chendi.Xue created SPARK-27412: -- Summary: Add a new shuffle manager to use Persistent Memory as shuffle and spilling storage Key: SPARK-27412 URL: https://issues.apache.org/jira/browse/SPARK-27412 Project: Spark Issue Type: New Feature Components: Shuffle, Spark Core Affects Versions: 3.0.0 Reporter: Chendi.Xue Add a new shuffle manager called "PmemShuffleManager", by using which, we can use Persistent Memory Device as storage for shuffle and external sorter spilling. In this implementation, we leveraged Persistent Memory Development Kit(PMDK) to support transaction write with high performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27412) Add a new shuffle manager to use Persistent Memory as shuffle and spilling storage
[ https://issues.apache.org/jira/browse/SPARK-27412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chendi.Xue updated SPARK-27412: --- External issue URL: https://github.com/apache/spark/pull/24322 > Add a new shuffle manager to use Persistent Memory as shuffle and spilling > storage > -- > > Key: SPARK-27412 > URL: https://issues.apache.org/jira/browse/SPARK-27412 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Affects Versions: 3.0.0 >Reporter: Chendi.Xue >Priority: Minor > Labels: shuffle > > Add a new shuffle manager called "PmemShuffleManager", by using which, we can > use Persistent Memory Device as storage for shuffle and external sorter > spilling. > In this implementation, we leveraged Persistent Memory Development Kit(PMDK) > to support transaction write with high performance. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27412) Add a new shuffle manager to use Persistent Memory as shuffle and spilling storage
[ https://issues.apache.org/jira/browse/SPARK-27412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chendi.Xue updated SPARK-27412: --- External issue URL: (was: https://github.com/apache/spark/pull/24322) > Add a new shuffle manager to use Persistent Memory as shuffle and spilling > storage > -- > > Key: SPARK-27412 > URL: https://issues.apache.org/jira/browse/SPARK-27412 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Affects Versions: 3.0.0 >Reporter: Chendi.Xue >Priority: Minor > Labels: shuffle > > Add a new shuffle manager called "PmemShuffleManager", by using which, we can > use Persistent Memory Device as storage for shuffle and external sorter > spilling. > In this implementation, we leveraged Persistent Memory Development Kit(PMDK) > to support transaction write with high performance. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org