[jira] [Commented] (SPARK-17071) Fetch Parquet schema within driver-side when there is single file to touch without another Spark job
[ https://issues.apache.org/jira/browse/SPARK-17071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422241#comment-15422241 ] Apache Spark commented on SPARK-17071: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/14660 > Fetch Parquet schema within driver-side when there is single file to touch > without another Spark job > > > Key: SPARK-17071 > URL: https://issues.apache.org/jira/browse/SPARK-17071 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > Currently, it seems always launch a Spark distributed job to fetch and merge > Parquet's schemas. > It seems we don't actually have to run another Spark job even when there is > only a single file to touch (meaning without {{mergeSchema}}) but just fetch > the schema within driver-side just like ORC data source is doing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17071) Fetch Parquet schema within driver-side when there is single file to touch without another Spark job
[ https://issues.apache.org/jira/browse/SPARK-17071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17071: Assignee: (was: Apache Spark) > Fetch Parquet schema within driver-side when there is single file to touch > without another Spark job > > > Key: SPARK-17071 > URL: https://issues.apache.org/jira/browse/SPARK-17071 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > Currently, it seems always launch a Spark distributed job to fetch and merge > Parquet's schemas. > It seems we don't actually have to run another Spark job even when there is > only a single file to touch (meaning without {{mergeSchema}}) but just fetch > the schema within driver-side just like ORC data source is doing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17071) Fetch Parquet schema within driver-side when there is single file to touch without another Spark job
[ https://issues.apache.org/jira/browse/SPARK-17071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17071: Assignee: Apache Spark > Fetch Parquet schema within driver-side when there is single file to touch > without another Spark job > > > Key: SPARK-17071 > URL: https://issues.apache.org/jira/browse/SPARK-17071 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark > > Currently, it seems always launch a Spark distributed job to fetch and merge > Parquet's schemas. > It seems we don't actually have to run another Spark job even when there is > only a single file to touch (meaning without {{mergeSchema}}) but just fetch > the schema within driver-side just like ORC data source is doing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17071) Fetch Parquet schema within driver-side when there is single file to touch without another Spark job
Hyukjin Kwon created SPARK-17071: Summary: Fetch Parquet schema within driver-side when there is single file to touch without another Spark job Key: SPARK-17071 URL: https://issues.apache.org/jira/browse/SPARK-17071 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Hyukjin Kwon Currently, it seems always launch a Spark distributed job to fetch and merge Parquet's schemas. It seems we don't actually have to run another Spark job even when there is only a single file to touch (meaning without {{mergeSchema}}) but just fetch the schema within driver-side just like ORC data source is doing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16760) Pass 'jobId' to Task
[ https://issues.apache.org/jira/browse/SPARK-16760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiqing Yang closed SPARK-16760. Resolution: Duplicate The code for this jira has been put into the PR of SPARK-16757 > Pass 'jobId' to Task > > > Key: SPARK-16760 > URL: https://issues.apache.org/jira/browse/SPARK-16760 > Project: Spark > Issue Type: Sub-task >Reporter: Weiqing Yang > > In the end, the spark caller context written into HDFS log will associate > with task id, stage id, job id, app id, etc, but now Task does not know any > job information, so job id will be passed to Task in the patch of this jira. > That is good for Spark users to identify tasks especially if Spark supports > multi-tenant environment in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16916) serde/storage properties should not have limitations
[ https://issues.apache.org/jira/browse/SPARK-16916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-16916. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14506 [https://github.com/apache/spark/pull/14506] > serde/storage properties should not have limitations > > > Key: SPARK-16916 > URL: https://issues.apache.org/jira/browse/SPARK-16916 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16757) Set up caller context to HDFS
[ https://issues.apache.org/jira/browse/SPARK-16757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16757: Assignee: (was: Apache Spark) > Set up caller context to HDFS > - > > Key: SPARK-16757 > URL: https://issues.apache.org/jira/browse/SPARK-16757 > Project: Spark > Issue Type: Sub-task >Reporter: Weiqing Yang > > In this jira, Spark will invoke hadoop caller context api to set up its > caller context to HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16757) Set up caller context to HDFS
[ https://issues.apache.org/jira/browse/SPARK-16757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422160#comment-15422160 ] Apache Spark commented on SPARK-16757: -- User 'Sherry302' has created a pull request for this issue: https://github.com/apache/spark/pull/14659 > Set up caller context to HDFS > - > > Key: SPARK-16757 > URL: https://issues.apache.org/jira/browse/SPARK-16757 > Project: Spark > Issue Type: Sub-task >Reporter: Weiqing Yang > > In this jira, Spark will invoke hadoop caller context api to set up its > caller context to HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16757) Set up caller context to HDFS
[ https://issues.apache.org/jira/browse/SPARK-16757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16757: Assignee: Apache Spark > Set up caller context to HDFS > - > > Key: SPARK-16757 > URL: https://issues.apache.org/jira/browse/SPARK-16757 > Project: Spark > Issue Type: Sub-task >Reporter: Weiqing Yang >Assignee: Apache Spark > > In this jira, Spark will invoke hadoop caller context api to set up its > caller context to HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422141#comment-15422141 ] Apache Spark commented on SPARK-5928: - User 'witgo' has created a pull request for this issue: https://github.com/apache/spark/pull/14658 > Remote Shuffle Blocks cannot be more than 2 GB > -- > > Key: SPARK-5928 > URL: https://issues.apache.org/jira/browse/SPARK-5928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Imran Rashid > > If a shuffle block is over 2GB, the shuffle fails, with an uninformative > exception. The tasks get retried a few times and then eventually the job > fails. > Here is an example program which can cause the exception: > {code} > val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore => > val n = 3e3.toInt > val arr = new Array[Byte](n) > //need to make sure the array doesn't compress to something small > scala.util.Random.nextBytes(arr) > arr > } > rdd.map { x => (1, x)}.groupByKey().count() > {code} > Note that you can't trigger this exception in local mode, it only happens on > remote fetches. I triggered these exceptions running with > {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} > {noformat} > 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, > imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, > imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= > org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds > 2147483647: 3021252889 - discarded > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) > at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) > at > org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame > length exceeds 2147483647: 3021252889 - discarded > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) > at > io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at
[jira] [Assigned] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5928: --- Assignee: (was: Apache Spark) > Remote Shuffle Blocks cannot be more than 2 GB > -- > > Key: SPARK-5928 > URL: https://issues.apache.org/jira/browse/SPARK-5928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Imran Rashid > > If a shuffle block is over 2GB, the shuffle fails, with an uninformative > exception. The tasks get retried a few times and then eventually the job > fails. > Here is an example program which can cause the exception: > {code} > val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore => > val n = 3e3.toInt > val arr = new Array[Byte](n) > //need to make sure the array doesn't compress to something small > scala.util.Random.nextBytes(arr) > arr > } > rdd.map { x => (1, x)}.groupByKey().count() > {code} > Note that you can't trigger this exception in local mode, it only happens on > remote fetches. I triggered these exceptions running with > {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} > {noformat} > 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, > imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, > imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= > org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds > 2147483647: 3021252889 - discarded > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) > at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) > at > org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame > length exceeds 2147483647: 3021252889 - discarded > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) > at > io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at >
[jira] [Assigned] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5928: --- Assignee: Apache Spark > Remote Shuffle Blocks cannot be more than 2 GB > -- > > Key: SPARK-5928 > URL: https://issues.apache.org/jira/browse/SPARK-5928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Imran Rashid >Assignee: Apache Spark > > If a shuffle block is over 2GB, the shuffle fails, with an uninformative > exception. The tasks get retried a few times and then eventually the job > fails. > Here is an example program which can cause the exception: > {code} > val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore => > val n = 3e3.toInt > val arr = new Array[Byte](n) > //need to make sure the array doesn't compress to something small > scala.util.Random.nextBytes(arr) > arr > } > rdd.map { x => (1, x)}.groupByKey().count() > {code} > Note that you can't trigger this exception in local mode, it only happens on > remote fetches. I triggered these exceptions running with > {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} > {noformat} > 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, > imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, > imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= > org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds > 2147483647: 3021252889 - discarded > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) > at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) > at > org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame > length exceeds 2147483647: 3021252889 - discarded > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) > at > io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at >
[jira] [Updated] (SPARK-17070) Zookeeper server refused to accept the client (mesos-master)
[ https://issues.apache.org/jira/browse/SPARK-17070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anh Nguyen updated SPARK-17070: --- Description: I started zookepper server: ./bin/zkServer.sh start-foreground conf/cnf_zoo.cfg and this is zookeeper server log : ZooKeeper JMX enabled by default Using config: conf/cnf_zoo.cfg 2016-08-16 02:59:20,767 [myid:] - INFO [main:QuorumPeerConfig@103] - Reading configuration from: conf/cnf_zoo.cfg 2016-08-16 02:59:20,778 [myid:] - INFO [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 3 2016-08-16 02:59:20,779 [myid:] - INFO [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 0 2016-08-16 02:59:20,780 [myid:] - INFO [main:DatadirCleanupManager@101] - Purge task is not scheduled. 2016-08-16 02:59:20,781 [myid:] - WARN [main:QuorumPeerMain@113] - Either no config or no quorum defined in config, running in standalone mode 2016-08-16 02:59:20,803 [myid:] - INFO [main:QuorumPeerConfig@103] - Reading configuration from: conf/cnf_zoo.cfg 2016-08-16 02:59:20,804 [myid:] - INFO [main:ZooKeeperServerMain@95] - Starting server 2016-08-16 02:59:20,826 [myid:] - INFO [main:Environment@100] - Server environment:zookeeper.version=3.4.8--1, built on 02/06/2016 03:18 GMT 2016-08-16 02:59:20,828 [myid:] - INFO [main:Environment@100] - Server environment:host.name=zookeeper 2016-08-16 02:59:20,829 [myid:] - INFO [main:Environment@100] - Server environment:java.version=1.8.0_91 2016-08-16 02:59:20,830 [myid:] - INFO [main:Environment@100] - Server environment:java.vendor=Oracle Corporation 2016-08-16 02:59:20,831 [myid:] - INFO [main:Environment@100] - Server environment:java.home=/usr/lib/jvm/java-8-oracle/jre 2016-08-16 02:59:20,834 [myid:] - INFO [main:Environment@100] - Server environment:java.class.path=/home/vagrant/mesos/zookeeper-3.4.8/bin/../build/classes:/home/vagrant/mesos/zookeeper-3.4.8/bin/../build/lib/*.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../lib/slf4j-log4j12-1.6.1.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../lib/slf4j-api-1.6.1.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../lib/netty-3.7.0.Final.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../lib/log4j-1.2.16.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../lib/jline-0.9.94.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../zookeeper-3.4.8.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../src/java/lib/*.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../conf: 2016-08-16 02:59:20,835 [myid:] - INFO [main:Environment@100] - Server environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib 2016-08-16 02:59:20,835 [myid:] - INFO [main:Environment@100] - Server environment:java.io.tmpdir=/tmp 2016-08-16 02:59:20,836 [myid:] - INFO [main:Environment@100] - Server environment:java.compiler= 2016-08-16 02:59:20,842 [myid:] - INFO [main:Environment@100] - Server environment:os.name=Linux 2016-08-16 02:59:20,842 [myid:] - INFO [main:Environment@100] - Server environment:os.arch=amd64 2016-08-16 02:59:20,844 [myid:] - INFO [main:Environment@100] - Server environment:os.version=3.13.0-92-generic 2016-08-16 02:59:20,845 [myid:] - INFO [main:Environment@100] - Server environment:user.name=vagrant 2016-08-16 02:59:20,846 [myid:] - INFO [main:Environment@100] - Server environment:user.home=/home/vagrant 2016-08-16 02:59:20,846 [myid:] - INFO [main:Environment@100] - Server environment:user.dir=/home/vagrant/mesos/zookeeper-3.4.8 2016-08-16 02:59:20,861 [myid:] - INFO [main:ZooKeeperServer@787] - tickTime set to 2000 2016-08-16 02:59:20,863 [myid:] - INFO [main:ZooKeeperServer@796] - minSessionTimeout set to -1 2016-08-16 02:59:20,864 [myid:] - INFO [main:ZooKeeperServer@805] - maxSessionTimeout set to -1 2016-08-16 02:59:20,883 [myid:] - INFO [main:NIOServerCnxnFactory@89] - binding to port 0.0.0.0/0.0.0.0:2181 But, when I start mesos-master: mesos-master --zk=zk://192.168.33.10:2128/home/vagrant/mesos/zookeeper-3.4.8 --port=5050 --ip=192.168.33.20 --quorum=1 --work_dir=/home/vagrant/mesos/master1/work --log_dir=/home/vagrant/mesos/master1/logs --advertise_ip=192.168.33.20 --advertise_port=5050 --cluster=mesut-ozil --logging_level=ERROR This is log mesos-master: zookeeper refaused to connect the mesos-master. 2016-08-16 03:29:09,052:2107(0x7f93e2de4700):ZOO_INFO@log_env@747: Client environment:user.name=vagrant 2016-08-16 03:29:09,055:2107(0x7f93e2de4700):ZOO_INFO@log_env@755: Client environment:user.home=/home/vagrant 2016-08-16 03:29:09,055:2107(0x7f93e2de4700):ZOO_INFO@log_env@767: Client environment:user.dir=/home/vagrant 2016-08-16 03:29:09,055:2107(0x7f93e2de4700):ZOO_INFO@zookeeper_init@800: Initiating client connection, host=192.168.33.10:2128 sessionTimeout=1 watcher=0x7f93ec335340 sessionId=0 sessionPasswd= context=0x7f93c8000930 flags=0 2016-08-16 03:29:09,052:2107(0x7f93e3de6700):ZOO_INFO@log_env@747: Client
[jira] [Created] (SPARK-17070) Zookeeper server refused to accept the client (mesos-master)
Anh Nguyen created SPARK-17070: -- Summary: Zookeeper server refused to accept the client (mesos-master) Key: SPARK-17070 URL: https://issues.apache.org/jira/browse/SPARK-17070 Project: Spark Issue Type: Bug Reporter: Anh Nguyen I started zookepper server: ./bin/zkServer.sh start-foreground conf/cnf_zoo.cfg and this is zookeeper server log : ZooKeeper JMX enabled by default Using config: conf/cnf_zoo.cfg 2016-08-16 02:59:20,767 [myid:] - INFO [main:QuorumPeerConfig@103] - Reading configuration from: conf/cnf_zoo.cfg 2016-08-16 02:59:20,778 [myid:] - INFO [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 3 2016-08-16 02:59:20,779 [myid:] - INFO [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 0 2016-08-16 02:59:20,780 [myid:] - INFO [main:DatadirCleanupManager@101] - Purge task is not scheduled. 2016-08-16 02:59:20,781 [myid:] - WARN [main:QuorumPeerMain@113] - Either no config or no quorum defined in config, running in standalone mode 2016-08-16 02:59:20,803 [myid:] - INFO [main:QuorumPeerConfig@103] - Reading configuration from: conf/cnf_zoo.cfg 2016-08-16 02:59:20,804 [myid:] - INFO [main:ZooKeeperServerMain@95] - Starting server 2016-08-16 02:59:20,826 [myid:] - INFO [main:Environment@100] - Server environment:zookeeper.version=3.4.8--1, built on 02/06/2016 03:18 GMT 2016-08-16 02:59:20,828 [myid:] - INFO [main:Environment@100] - Server environment:host.name=zookeeper 2016-08-16 02:59:20,829 [myid:] - INFO [main:Environment@100] - Server environment:java.version=1.8.0_91 2016-08-16 02:59:20,830 [myid:] - INFO [main:Environment@100] - Server environment:java.vendor=Oracle Corporation 2016-08-16 02:59:20,831 [myid:] - INFO [main:Environment@100] - Server environment:java.home=/usr/lib/jvm/java-8-oracle/jre 2016-08-16 02:59:20,834 [myid:] - INFO [main:Environment@100] - Server environment:java.class.path=/home/vagrant/mesos/zookeeper-3.4.8/bin/../build/classes:/home/vagrant/mesos/zookeeper-3.4.8/bin/../build/lib/*.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../lib/slf4j-log4j12-1.6.1.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../lib/slf4j-api-1.6.1.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../lib/netty-3.7.0.Final.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../lib/log4j-1.2.16.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../lib/jline-0.9.94.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../zookeeper-3.4.8.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../src/java/lib/*.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../conf: 2016-08-16 02:59:20,835 [myid:] - INFO [main:Environment@100] - Server environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib 2016-08-16 02:59:20,835 [myid:] - INFO [main:Environment@100] - Server environment:java.io.tmpdir=/tmp 2016-08-16 02:59:20,836 [myid:] - INFO [main:Environment@100] - Server environment:java.compiler= 2016-08-16 02:59:20,842 [myid:] - INFO [main:Environment@100] - Server environment:os.name=Linux 2016-08-16 02:59:20,842 [myid:] - INFO [main:Environment@100] - Server environment:os.arch=amd64 2016-08-16 02:59:20,844 [myid:] - INFO [main:Environment@100] - Server environment:os.version=3.13.0-92-generic 2016-08-16 02:59:20,845 [myid:] - INFO [main:Environment@100] - Server environment:user.name=vagrant 2016-08-16 02:59:20,846 [myid:] - INFO [main:Environment@100] - Server environment:user.home=/home/vagrant 2016-08-16 02:59:20,846 [myid:] - INFO [main:Environment@100] - Server environment:user.dir=/home/vagrant/mesos/zookeeper-3.4.8 2016-08-16 02:59:20,861 [myid:] - INFO [main:ZooKeeperServer@787] - tickTime set to 2000 2016-08-16 02:59:20,863 [myid:] - INFO [main:ZooKeeperServer@796] - minSessionTimeout set to -1 2016-08-16 02:59:20,864 [myid:] - INFO [main:ZooKeeperServer@805] - maxSessionTimeout set to -1 2016-08-16 02:59:20,883 [myid:] - INFO [main:NIOServerCnxnFactory@89] - binding to port 0.0.0.0/0.0.0.0:2181 But, when I start mesos-master: mesos-master --zk=zk://192.168.33.10:2128/home/vagrant/mesos/zookeeper-3.4.8 --port=5050 --ip=192.168.33.20 --quorum=1 --work_dir=/home/vagrant/mesos/master1/work --log_dir=/home/vagrant/mesos/master1/logs --advertise_ip=192.168.33.20 --advertise_port=5050 --cluster=mesut-ozil --logging_level=ERROR 2016-08-16 03:29:09,053:2107(0x7f93e25e3700):ZOO_INFO@zookeeper_init@800: Initiating client connection, host=192.168.33.10:2128 sessionTimeout=1 watcher=0x7f93ec335340 sessionId=0 sessionPasswd= context=0x7f93d80047c0 flags=0 2016-08-16 03:29:09,051:2107(0x7f93e35e5700):ZOO_INFO@log_env@747: Client environment:user.name=vagrant 2016-08-16 03:29:09,054:2107(0x7f93e35e5700):ZOO_INFO@log_env@755: Client environment:user.home=/home/vagrant 2016-08-16 03:29:09,054:2107(0x7f93e35e5700):ZOO_INFO@log_env@767: Client environment:user.dir=/home/vagrant 2016-08-16
[jira] [Commented] (SPARK-16990) Define the data structure to hold the statistics for CBO
[ https://issues.apache.org/jira/browse/SPARK-16990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422105#comment-15422105 ] Ron Hu commented on SPARK-16990: We will follow the agreed design spec and add the following information to Statistics class: - totalSize: Total uncompressed output data size from a LogicalPlan node. - rowCount: Number of output rows from a LogicalPlan node. - fileNumber Number of files (or number of HDFS data blocks). Currently only useful when a LogicalPlan is a relation. - basicStats: Basic column statistics, i.e., max, min, number of nulls, number of distinct values, max column length, average column length. - Histograms: Histograms of columns, i.e., equi-width histogram (for numeric and string types) and equi-height histogram (only for numeric types). - histogramMins: Maintain min values for numeric columns. > Define the data structure to hold the statistics for CBO > > > Key: SPARK-16990 > URL: https://issues.apache.org/jira/browse/SPARK-16990 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Ron Hu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17054) SparkR can not run in yarn-cluster mode on mac os
[ https://issues.apache.org/jira/browse/SPARK-17054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422058#comment-15422058 ] Jeff Zhang commented on SPARK-17054: I have single node hadoop cluster in my laptop, and I run R script in yarn cluster mode. > SparkR can not run in yarn-cluster mode on mac os > - > > Key: SPARK-17054 > URL: https://issues.apache.org/jira/browse/SPARK-17054 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Jeff Zhang > > This is due to it download sparkR to the wrong place. > {noformat} > Warning message: > 'sparkR.init' is deprecated. > Use 'sparkR.session' instead. > See help("Deprecated") > Spark not found in SPARK_HOME: . > To search in the cache directory. Installation will start if not found. > Mirror site not provided. > Looking for site suggested from apache website... > Preferred mirror site found: http://apache.mirror.cdnetworks.com/spark > Downloading Spark spark-2.0.0 for Hadoop 2.7 from: > - > http://apache.mirror.cdnetworks.com/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz > Fetch failed from http://apache.mirror.cdnetworks.com/spark > open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', > reason 'No such file or directory'> > To use backup site... > Downloading Spark spark-2.0.0 for Hadoop 2.7 from: > - > http://www-us.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz > Fetch failed from http://www-us.apache.org/dist/spark > open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', > reason 'No such file or directory'> > Error in robust_download_tar(mirrorUrl, version, hadoopVersion, packageName, > : > Unable to download Spark spark-2.0.0 for Hadoop 2.7. Please check network > connection, Hadoop version, or provide other mirror sites. > Calls: sparkRSQL.init ... sparkR.session -> install.spark -> > robust_download_tar > In addition: Warning messages: > 1: 'sparkRSQL.init' is deprecated. > Use 'sparkR.session' instead. > See help("Deprecated") > 2: In dir.create(localDir, recursive = TRUE) : > cannot create dir '/home//Library', reason 'Operation not supported' > Execution halted > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17061) Incorrect results returned following a join of two datasets and a map step where total number of columns >100
[ https://issues.apache.org/jira/browse/SPARK-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422051#comment-15422051 ] Liwei Lin commented on SPARK-17061: --- This can be reproduced against the master branch; let me look into this. Thanks. > Incorrect results returned following a join of two datasets and a map step > where total number of columns >100 > - > > Key: SPARK-17061 > URL: https://issues.apache.org/jira/browse/SPARK-17061 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Jamie Hutton >Priority: Critical > > We have hit a consistent bug where we have a dataset with more than 100 > columns. I am raising as a blocker because spark is returning the WRONG > results rather than erroring, leading to data integrity issues > I have put together the following test case which will show the issue (it > will run in spark-shell). In this example i am joining a dataset with lots of > fields onto another dataset. > The join works fine and if you show the dataset you will get the expected > result. However if you run a map step over the dataset you end up with a > strange error where the sequence that is in the right dataset now only > contains the last value. > Whilst this test may seem a rather contrived example, what we are doing here > is a very standard analtical pattern. My original code was designed to: > - take a dataset of child records > - groupByKey up to the parent: giving a Dataset of (ParentID, Seq[Children]) > - join the children onto the parent by parentID: giving > ((Parent),(ParentID,Seq[Children]) > - map over the result to give a tuple of (Parent,Seq[Children]) > Notes: > - The issue is resolved by having less fields - as soon as we go <= 100 the > integrity issue goes away. Try removing one of the fields from BigCaseClass > below > - The issue will arise based on the total number of fields in the resulting > dataset. Below i have a small case class and a big case class, but two case > classes of 50 variable would give the same issue > - the issue occurs where the case class being joined on (on the right) has a > case class type. It doesnt occur if you have a Seq[String] > - If i go back to an RDD for the map step after the join i can workaround the > issue, but i lose all the benefits of datasets > Scala code test case: > case class Name(name: String) > case class SmallCaseClass (joinkey: Integer, names: Seq[Name]) > case class BigCaseClass (field1: Integer,field2: Integer,field3: > Integer,field4: Integer,field5: Integer,field6: Integer,field7: > Integer,field8: Integer,field9: Integer,field10: Integer,field11: > Integer,field12: Integer,field13: Integer,field14: Integer,field15: > Integer,field16: Integer,field17: Integer,field18: Integer,field19: > Integer,field20: Integer,field21: Integer,field22: Integer,field23: > Integer,field24: Integer,field25: Integer,field26: Integer,field27: > Integer,field28: Integer,field29: Integer,field30: Integer,field31: > Integer,field32: Integer,field33: Integer,field34: Integer,field35: > Integer,field36: Integer,field37: Integer,field38: Integer,field39: > Integer,field40: Integer,field41: Integer,field42: Integer,field43: > Integer,field44: Integer,field45: Integer,field46: Integer,field47: > Integer,field48: Integer,field49: Integer,field50: Integer,field51: > Integer,field52: Integer,field53: Integer,field54: Integer,field55: > Integer,field56: Integer,field57: Integer,field58: Integer,field59: > Integer,field60: Integer,field61: Integer,field62: Integer,field63: > Integer,field64: Integer,field65: Integer,field66: Integer,field67: > Integer,field68: Integer,field69: Integer,field70: Integer,field71: > Integer,field72: Integer,field73: Integer,field74: Integer,field75: > Integer,field76: Integer,field77: Integer,field78: Integer,field79: > Integer,field80: Integer,field81: Integer,field82: Integer,field83: > Integer,field84: Integer,field85: Integer,field86: Integer,field87: > Integer,field88: Integer,field89: Integer,field90: Integer,field91: > Integer,field92: Integer,field93: Integer,field94: Integer,field95: > Integer,field96: Integer,field97: Integer,field98: Integer,field99: Integer) > > val bigCC=Seq(BigCaseClass(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, > 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, > 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, > 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, > 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, > 91, 92, 93, 94, 95, 96, 97, 98, 99)) > > val smallCC=Seq(SmallCaseClass(1,Seq(
[jira] [Commented] (SPARK-17054) SparkR can not run in yarn-cluster mode on mac os
[ https://issues.apache.org/jira/browse/SPARK-17054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422020#comment-15422020 ] Shivaram Venkataraman commented on SPARK-17054: --- And is this connecting to a remote YARN master or is there a way to run a YARN cluster locally on a laptop ? > SparkR can not run in yarn-cluster mode on mac os > - > > Key: SPARK-17054 > URL: https://issues.apache.org/jira/browse/SPARK-17054 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Jeff Zhang > > This is due to it download sparkR to the wrong place. > {noformat} > Warning message: > 'sparkR.init' is deprecated. > Use 'sparkR.session' instead. > See help("Deprecated") > Spark not found in SPARK_HOME: . > To search in the cache directory. Installation will start if not found. > Mirror site not provided. > Looking for site suggested from apache website... > Preferred mirror site found: http://apache.mirror.cdnetworks.com/spark > Downloading Spark spark-2.0.0 for Hadoop 2.7 from: > - > http://apache.mirror.cdnetworks.com/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz > Fetch failed from http://apache.mirror.cdnetworks.com/spark > open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', > reason 'No such file or directory'> > To use backup site... > Downloading Spark spark-2.0.0 for Hadoop 2.7 from: > - > http://www-us.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz > Fetch failed from http://www-us.apache.org/dist/spark > open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', > reason 'No such file or directory'> > Error in robust_download_tar(mirrorUrl, version, hadoopVersion, packageName, > : > Unable to download Spark spark-2.0.0 for Hadoop 2.7. Please check network > connection, Hadoop version, or provide other mirror sites. > Calls: sparkRSQL.init ... sparkR.session -> install.spark -> > robust_download_tar > In addition: Warning messages: > 1: 'sparkRSQL.init' is deprecated. > Use 'sparkR.session' instead. > See help("Deprecated") > 2: In dir.create(localDir, recursive = TRUE) : > cannot create dir '/home//Library', reason 'Operation not supported' > Execution halted > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17068) Retain view visibility information through out Analysis
[ https://issues.apache.org/jira/browse/SPARK-17068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17068: Assignee: Herman van Hovell (was: Apache Spark) > Retain view visibility information through out Analysis > --- > > Key: SPARK-17068 > URL: https://issues.apache.org/jira/browse/SPARK-17068 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Herman van Hovell >Assignee: Herman van Hovell > Fix For: 2.1.0 > > > Views in Spark SQL are replaced by their backing {{LogicalPlan}} during > analysis. This can be confusing when dealing with and debugging large > {{LogicalPlan}}s. I propose to add an identifier to the subquery alias in > order to improve this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17068) Retain view visibility information through out Analysis
[ https://issues.apache.org/jira/browse/SPARK-17068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421917#comment-15421917 ] Apache Spark commented on SPARK-17068: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/14657 > Retain view visibility information through out Analysis > --- > > Key: SPARK-17068 > URL: https://issues.apache.org/jira/browse/SPARK-17068 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Herman van Hovell >Assignee: Herman van Hovell > Fix For: 2.1.0 > > > Views in Spark SQL are replaced by their backing {{LogicalPlan}} during > analysis. This can be confusing when dealing with and debugging large > {{LogicalPlan}}s. I propose to add an identifier to the subquery alias in > order to improve this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17068) Retain view visibility information through out Analysis
[ https://issues.apache.org/jira/browse/SPARK-17068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17068: Assignee: Apache Spark (was: Herman van Hovell) > Retain view visibility information through out Analysis > --- > > Key: SPARK-17068 > URL: https://issues.apache.org/jira/browse/SPARK-17068 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Herman van Hovell >Assignee: Apache Spark > Fix For: 2.1.0 > > > Views in Spark SQL are replaced by their backing {{LogicalPlan}} during > analysis. This can be confusing when dealing with and debugging large > {{LogicalPlan}}s. I propose to add an identifier to the subquery alias in > order to improve this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16578) Configurable hostname for RBackend
[ https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421878#comment-15421878 ] Jeff Zhang commented on SPARK-16578: I think this feature can also be applied in pyspark. > Configurable hostname for RBackend > -- > > Key: SPARK-16578 > URL: https://issues.apache.org/jira/browse/SPARK-16578 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > One of the requirements that comes up with SparkR being a standalone package > is that users can now install just the R package on the client side and > connect to a remote machine which runs the RBackend class. > We should check if we can support this mode of execution and what are the > pros / cons of it -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17069) Expose spark.range() as table-valued function in SQL
[ https://issues.apache.org/jira/browse/SPARK-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17069: Assignee: (was: Apache Spark) > Expose spark.range() as table-valued function in SQL > > > Key: SPARK-17069 > URL: https://issues.apache.org/jira/browse/SPARK-17069 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Eric Liang >Priority: Minor > > The idea here is to create the spark.range( x ) equivalent in SQL, so we can > do something like > {noformat} > select count(*) from range(1) > {noformat} > This would be useful for sql-only testing and benchmarks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17069) Expose spark.range() as table-valued function in SQL
[ https://issues.apache.org/jira/browse/SPARK-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421853#comment-15421853 ] Apache Spark commented on SPARK-17069: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/14656 > Expose spark.range() as table-valued function in SQL > > > Key: SPARK-17069 > URL: https://issues.apache.org/jira/browse/SPARK-17069 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Eric Liang >Priority: Minor > > The idea here is to create the spark.range( x ) equivalent in SQL, so we can > do something like > {noformat} > select count(*) from range(1) > {noformat} > This would be useful for sql-only testing and benchmarks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17069) Expose spark.range() as table-valued function in SQL
[ https://issues.apache.org/jira/browse/SPARK-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17069: Assignee: Apache Spark > Expose spark.range() as table-valued function in SQL > > > Key: SPARK-17069 > URL: https://issues.apache.org/jira/browse/SPARK-17069 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Eric Liang >Assignee: Apache Spark >Priority: Minor > > The idea here is to create the spark.range( x ) equivalent in SQL, so we can > do something like > {noformat} > select count(*) from range(1) > {noformat} > This would be useful for sql-only testing and benchmarks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17054) SparkR can not run in yarn-cluster mode on mac os
[ https://issues.apache.org/jira/browse/SPARK-17054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421851#comment-15421851 ] Jeff Zhang commented on SPARK-17054: Here's the command I run. {code} bin/spark-submit --master yarn-cluster --conf spark.r.command=/usr/local/bin/Rscript ~/github/spark_examples/spark.R {code} > SparkR can not run in yarn-cluster mode on mac os > - > > Key: SPARK-17054 > URL: https://issues.apache.org/jira/browse/SPARK-17054 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Jeff Zhang > > This is due to it download sparkR to the wrong place. > {noformat} > Warning message: > 'sparkR.init' is deprecated. > Use 'sparkR.session' instead. > See help("Deprecated") > Spark not found in SPARK_HOME: . > To search in the cache directory. Installation will start if not found. > Mirror site not provided. > Looking for site suggested from apache website... > Preferred mirror site found: http://apache.mirror.cdnetworks.com/spark > Downloading Spark spark-2.0.0 for Hadoop 2.7 from: > - > http://apache.mirror.cdnetworks.com/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz > Fetch failed from http://apache.mirror.cdnetworks.com/spark > open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', > reason 'No such file or directory'> > To use backup site... > Downloading Spark spark-2.0.0 for Hadoop 2.7 from: > - > http://www-us.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz > Fetch failed from http://www-us.apache.org/dist/spark > open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', > reason 'No such file or directory'> > Error in robust_download_tar(mirrorUrl, version, hadoopVersion, packageName, > : > Unable to download Spark spark-2.0.0 for Hadoop 2.7. Please check network > connection, Hadoop version, or provide other mirror sites. > Calls: sparkRSQL.init ... sparkR.session -> install.spark -> > robust_download_tar > In addition: Warning messages: > 1: 'sparkRSQL.init' is deprecated. > Use 'sparkR.session' instead. > See help("Deprecated") > 2: In dir.create(localDir, recursive = TRUE) : > cannot create dir '/home//Library', reason 'Operation not supported' > Execution halted > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17054) SparkR can not run in yarn-cluster mode on mac os
[ https://issues.apache.org/jira/browse/SPARK-17054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421846#comment-15421846 ] Jeff Zhang commented on SPARK-17054: Do you run it as yarn-cluster mode ? > SparkR can not run in yarn-cluster mode on mac os > - > > Key: SPARK-17054 > URL: https://issues.apache.org/jira/browse/SPARK-17054 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Jeff Zhang > > This is due to it download sparkR to the wrong place. > {noformat} > Warning message: > 'sparkR.init' is deprecated. > Use 'sparkR.session' instead. > See help("Deprecated") > Spark not found in SPARK_HOME: . > To search in the cache directory. Installation will start if not found. > Mirror site not provided. > Looking for site suggested from apache website... > Preferred mirror site found: http://apache.mirror.cdnetworks.com/spark > Downloading Spark spark-2.0.0 for Hadoop 2.7 from: > - > http://apache.mirror.cdnetworks.com/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz > Fetch failed from http://apache.mirror.cdnetworks.com/spark > open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', > reason 'No such file or directory'> > To use backup site... > Downloading Spark spark-2.0.0 for Hadoop 2.7 from: > - > http://www-us.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz > Fetch failed from http://www-us.apache.org/dist/spark > open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', > reason 'No such file or directory'> > Error in robust_download_tar(mirrorUrl, version, hadoopVersion, packageName, > : > Unable to download Spark spark-2.0.0 for Hadoop 2.7. Please check network > connection, Hadoop version, or provide other mirror sites. > Calls: sparkRSQL.init ... sparkR.session -> install.spark -> > robust_download_tar > In addition: Warning messages: > 1: 'sparkRSQL.init' is deprecated. > Use 'sparkR.session' instead. > See help("Deprecated") > 2: In dir.create(localDir, recursive = TRUE) : > cannot create dir '/home//Library', reason 'Operation not supported' > Execution halted > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17069) Expose spark.range() as table-valued function in SQL
Eric Liang created SPARK-17069: -- Summary: Expose spark.range() as table-valued function in SQL Key: SPARK-17069 URL: https://issues.apache.org/jira/browse/SPARK-17069 Project: Spark Issue Type: New Feature Components: SQL Reporter: Eric Liang Priority: Minor The idea here is to create the spark.range( x ) equivalent in SQL, so we can do something like {noformat} select count(*) from range(1) {noformat} This would be useful for sql-only testing and benchmarks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17065) Improve the error message when encountering an incompatible DataSourceRegister
[ https://issues.apache.org/jira/browse/SPARK-17065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-17065. -- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14651 [https://github.com/apache/spark/pull/14651] > Improve the error message when encountering an incompatible DataSourceRegister > -- > > Key: SPARK-17065 > URL: https://issues.apache.org/jira/browse/SPARK-17065 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.1, 2.1.0 > > > When encountering an incompatible DataSourceRegister, it's better to add > instructions to remove or upgrade it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16158) Support pluggable dynamic allocation heuristics
[ https://issues.apache.org/jira/browse/SPARK-16158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421803#comment-15421803 ] Thomas Graves commented on SPARK-16158: --- seems like an ok idea to me, but you have to make sure to do it right. for instance I would expect different policies to possibly have different configs. Need to think about how those work, are named, are documented, etc. Make sure things aren't relying on implicit things happening, make sure the communication with the resource managers (mesos/yarn) is well defined. I do also think the current policy could be improved a lot. Is there a reason to not just improve that? Obviously having the user define the policy for different jobs makes things more complex for the user so it would be nice to have at least one generic that works ok in many situations but you could still have highly optimized ones. One thing you mention is executors between stages. The min executor setting should help with that but obviously you could have largely varying number of tasks between different stages which makes that not ideal. I could see some other policy having a different config to handle this. > Support pluggable dynamic allocation heuristics > --- > > Key: SPARK-16158 > URL: https://issues.apache.org/jira/browse/SPARK-16158 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Nezih Yigitbasi > > It would be nice if Spark supports plugging in custom dynamic allocation > heuristics. This feature would be useful for experimenting with new > heuristics and also useful for plugging in different heuristics per job etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17038) StreamingSource reports metrics for lastCompletedBatch instead of lastReceivedBatch
[ https://issues.apache.org/jira/browse/SPARK-17038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421789#comment-15421789 ] Xin Ren commented on SPARK-17038: - hi [~ozzieba] if you don't have time, I can just submit a quick path on this :) > StreamingSource reports metrics for lastCompletedBatch instead of > lastReceivedBatch > --- > > Key: SPARK-17038 > URL: https://issues.apache.org/jira/browse/SPARK-17038 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.2, 2.0.0 >Reporter: Oz Ben-Ami >Priority: Minor > Labels: metrics > > StreamingSource's lastReceivedBatch_submissionTime, > lastReceivedBatch_processingTimeStart, and > lastReceivedBatch_processingTimeEnd all use data from lastCompletedBatch > instead of lastReceivedBatch. In particular, this makes it impossible to > match lastReceivedBatch_records with a batchID/submission time. > This is apparent when looking at StreamingSource.scala, lines 89-94. > I would guess that just replacing Completed->Received in those lines would > fix the issue, but I haven't tested it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16964) Remove private[sql] and private[spark] from sql.execution package
[ https://issues.apache.org/jira/browse/SPARK-16964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421783#comment-15421783 ] Apache Spark commented on SPARK-16964: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/14654 > Remove private[sql] and private[spark] from sql.execution package > - > > Key: SPARK-16964 > URL: https://issues.apache.org/jira/browse/SPARK-16964 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.1.0 > > > The execution package is meant to be internal, and as a result it does not > make sense to mark things as private[sql] or private[spark]. It simply makes > debugging harder when Spark developers need to inspect the plans at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16669) Partition pruning for metastore relation size estimates for better join selection.
[ https://issues.apache.org/jira/browse/SPARK-16669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421782#comment-15421782 ] Apache Spark commented on SPARK-16669: -- User 'Parth-Brahmbhatt' has created a pull request for this issue: https://github.com/apache/spark/pull/14655 > Partition pruning for metastore relation size estimates for better join > selection. > -- > > Key: SPARK-16669 > URL: https://issues.apache.org/jira/browse/SPARK-16669 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Parth Brahmbhatt > > Currently the metastore statistics returns the size of entire table which > results in Join selection strategy to not use broadcast joins even when only > a single partition from a large table is selected. We should optimize the > statistic calculation at table level to apply partition pruning and only get > the size of Partition that are valid for the query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17068) Retain view visibility information through out Analysis
Herman van Hovell created SPARK-17068: - Summary: Retain view visibility information through out Analysis Key: SPARK-17068 URL: https://issues.apache.org/jira/browse/SPARK-17068 Project: Spark Issue Type: Improvement Components: SQL Reporter: Herman van Hovell Assignee: Herman van Hovell Fix For: 2.1.0 Views in Spark SQL are replaced by their backing {{LogicalPlan}} during analysis. This can be confusing when dealing with and debugging large {{LogicalPlan}}s. I propose to add an identifier to the subquery alias in order to improve this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17066) dateFormat should be used when writing dataframes as csv files
[ https://issues.apache.org/jira/browse/SPARK-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421758#comment-15421758 ] Barry Becker commented on SPARK-17066: -- Yes, I think its fair to say this is a subset of SPARK-16216. Probably this can be duped to that one. > dateFormat should be used when writing dataframes as csv files > -- > > Key: SPARK-17066 > URL: https://issues.apache.org/jira/browse/SPARK-17066 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > I noticed this when running tests after pulling and building @lw-lin 's PR > (https://github.com/apache/spark/pull/14118). I don't think it is anything > wrong with his PR, just that the fix that was made to spark-csv for this > issue was never moved to spark 2.x when databrick's spark-csv was merged into > spark 2 back in January. https://github.com/databricks/spark-csv/issues/308 > was fixed in spark-csv after that merge. > The problem is that if I try to write a dataframe that contains a date column > out to a csv using something like this > repartitionDf.write.format("csv") //.format(DATABRICKS_CSV) > .option("delimiter", "\t") > .option("header", "false") > .option("nullValue", "?") > .option("dateFormat", "-MM-dd'T'HH:mm:ss") > .option("escape", "\\") > .save(tempFileName) > Then my unit test (which passed under spark 1.6.2) fails using the spark > 2.1.0 snapshot build that I made today. The dataframe contained 3 values in a > date column. > Expected "[2012-01-03T09:12:00 > ? > 2015-02-23T18:00:]00", > but got > "[132561072000 > ? > 1424743200]00" > This means that while the null value is being correctly exported, the > specified dateFormat is not being used to format the date. Instead it looks > like number of seconds from epoch is being used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10931) PySpark ML Models should contain Param values
[ https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421753#comment-15421753 ] Apache Spark commented on SPARK-10931: -- User 'evanyc15' has created a pull request for this issue: https://github.com/apache/spark/pull/14653 > PySpark ML Models should contain Param values > - > > Key: SPARK-10931 > URL: https://issues.apache.org/jira/browse/SPARK-10931 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley > > PySpark spark.ml Models are generally wrappers around Java objects and do not > even contain Param values. This JIRA is for copying the Param values from > the Estimator to the model. > This can likely be solved by modifying Estimator.fit to copy Param values, > but should also include proper unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16964) Remove private[sql] and private[spark] from sql.execution package
[ https://issues.apache.org/jira/browse/SPARK-16964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421746#comment-15421746 ] Apache Spark commented on SPARK-16964: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/14652 > Remove private[sql] and private[spark] from sql.execution package > - > > Key: SPARK-16964 > URL: https://issues.apache.org/jira/browse/SPARK-16964 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.1.0 > > > The execution package is meant to be internal, and as a result it does not > make sense to mark things as private[sql] or private[spark]. It simply makes > debugging harder when Spark developers need to inspect the plans at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421731#comment-15421731 ] Sean Owen commented on SPARK-16320: --- I think that was the problem being solved there though, right? there's also an arguably bigger downside to setting Xms, and only an indirect way to control the region size if that's what's desired. I'd favor pointing out G1HeapRegionSize as a possible tuning parameter to increase. That is, to solve the problem you identify, Xms isn't the best solution even. > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421728#comment-15421728 ] Maciej Bryński edited comment on SPARK-16320 at 8/15/16 9:42 PM: - Maybe we can change SPARK-12384 a little bit and set default value to -Xms (as in Spark 1.6) ? (user will still have an option to change it) was (Author: maver1ck): Maybe we can change SPARK-12384 a little bit and set default value to -Xms (as in Spark 1.6) ? (user still will have an option to change it) > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421728#comment-15421728 ] Maciej Bryński commented on SPARK-16320: Maybe we can change SPARK-12384 a little bit and set default value to -Xms (as in Spark 1.6) ? (user still will have an option to change it) > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421719#comment-15421719 ] Sean Owen commented on SPARK-16320: --- I see, I wonder if this deserves a bit of documentation in the tuning section? you'd be welcome to collect some of the wisdom here into a note in the docs. > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17066) dateFormat should be used when writing dataframes as csv files
[ https://issues.apache.org/jira/browse/SPARK-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421723#comment-15421723 ] Sean Owen commented on SPARK-17066: --- I think this is a subset of https://issues.apache.org/jira/browse/SPARK-16216 ? > dateFormat should be used when writing dataframes as csv files > -- > > Key: SPARK-17066 > URL: https://issues.apache.org/jira/browse/SPARK-17066 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > I noticed this when running tests after pulling and building @lw-lin 's PR > (https://github.com/apache/spark/pull/14118). I don't think it is anything > wrong with his PR, just that the fix that was made to spark-csv for this > issue was never moved to spark 2.x when databrick's spark-csv was merged into > spark 2 back in January. https://github.com/databricks/spark-csv/issues/308 > was fixed in spark-csv after that merge. > The problem is that if I try to write a dataframe that contains a date column > out to a csv using something like this > repartitionDf.write.format("csv") //.format(DATABRICKS_CSV) > .option("delimiter", "\t") > .option("header", "false") > .option("nullValue", "?") > .option("dateFormat", "-MM-dd'T'HH:mm:ss") > .option("escape", "\\") > .save(tempFileName) > Then my unit test (which passed under spark 1.6.2) fails using the spark > 2.1.0 snapshot build that I made today. The dataframe contained 3 values in a > date column. > Expected "[2012-01-03T09:12:00 > ? > 2015-02-23T18:00:]00", > but got > "[132561072000 > ? > 1424743200]00" > This means that while the null value is being correctly exported, the > specified dateFormat is not being used to format the date. Instead it looks > like number of seconds from epoch is being used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17066) dateFormat should be used when writing dataframes as csv files
[ https://issues.apache.org/jira/browse/SPARK-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barry Becker updated SPARK-17066: - Description: I noticed this when running tests after pulling and building @lw-lin 's PR (https://github.com/apache/spark/pull/14118). I don't think it is anything wrong with his PR, just that the fix that was made to spark-csv for this issue was never moved to spark 2.x when databrick's spark-csv was merged into spark 2 back in January. https://github.com/databricks/spark-csv/issues/308 was fixed in spark-csv after that merge. The problem is that if I try to write a dataframe that contains a date column out to a csv using something like this repartitionDf.write.format("csv") //.format(DATABRICKS_CSV) .option("delimiter", "\t") .option("header", "false") .option("nullValue", "?") .option("dateFormat", "-MM-dd'T'HH:mm:ss") .option("escape", "\\") .save(tempFileName) Then my unit test (which passed under spark 1.6.2) fails using the spark 2.1.0 snapshot build that I made today. The dataframe contained 3 values in a date column. Expected "[2012-01-03T09:12:00 ? 2015-02-23T18:00:]00", but got "[132561072000 ? 1424743200]00" This means that while the null value is being correctly exported, the specified dateFormat is not being used to format the date. Instead it looks like number of seconds from epoch is being used. was: I noticed this when running tests after pulling and building @lw-lin 's PR (https://github.com/apache/spark/pull/14118). I don't think it is anything wrong with his PR, just that the fix that was made to spark-csv for this issue was never moved to spark 2.x when databrick's spark-csv was merged into spark 2 back in January. https://github.com/databricks/spark-csv/issues/308 was fixed in spark-csv after that merge. The problem is that if I try to write a dataframe that contains a date column out to a csv using something like this repartitionDf.write.format("csv") //.format(DATABRICKS_CSV) .option("delimiter", "\t") .option("header", "false") .option("nullValue", "?") .option("dateFormat", "-MM-dd'T'HH:mm:ss") .option("escape", "\\") .save(tempFileName) Then my unit test (which passed under spark 1.6.2 fails using the spark 2.1.0 snapshot build that I made today. Expected "[2012-01-03T09:12:00 ? 2015-02-23T18:00:]00", but got "[132561072000 ? 1424743200]00" This means that while the null value is being correctly exported, the specified dateFormat is not being used to format the date. Instead it looks like number of seconds from epoch is being used. > dateFormat should be used when writing dataframes as csv files > -- > > Key: SPARK-17066 > URL: https://issues.apache.org/jira/browse/SPARK-17066 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > I noticed this when running tests after pulling and building @lw-lin 's PR > (https://github.com/apache/spark/pull/14118). I don't think it is anything > wrong with his PR, just that the fix that was made to spark-csv for this > issue was never moved to spark 2.x when databrick's spark-csv was merged into > spark 2 back in January. https://github.com/databricks/spark-csv/issues/308 > was fixed in spark-csv after that merge. > The problem is that if I try to write a dataframe that contains a date column > out to a csv using something like this > repartitionDf.write.format("csv") //.format(DATABRICKS_CSV) > .option("delimiter", "\t") > .option("header", "false") > .option("nullValue", "?") > .option("dateFormat", "-MM-dd'T'HH:mm:ss") > .option("escape", "\\") > .save(tempFileName) > Then my unit test (which passed under spark 1.6.2) fails using the spark > 2.1.0 snapshot build that I made today. The dataframe contained 3 values in a > date column. > Expected "[2012-01-03T09:12:00 > ? > 2015-02-23T18:00:]00", > but got > "[132561072000 > ? > 1424743200]00" > This means that while the null value is being correctly exported, the > specified dateFormat is not being used to format the date. Instead it looks > like number of seconds from epoch is being used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16158) Support pluggable dynamic allocation heuristics
[ https://issues.apache.org/jira/browse/SPARK-16158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421707#comment-15421707 ] Reynold Xin commented on SPARK-16158: - Actually seems like a great idea, if just for modularity of software. > Support pluggable dynamic allocation heuristics > --- > > Key: SPARK-16158 > URL: https://issues.apache.org/jira/browse/SPARK-16158 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Nezih Yigitbasi > > It would be nice if Spark supports plugging in custom dynamic allocation > heuristics. This feature would be useful for experimenting with new > heuristics and also useful for plugging in different heuristics per job etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17067) Revocable resource support
[ https://issues.apache.org/jira/browse/SPARK-17067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421698#comment-15421698 ] Michael Gummelt commented on SPARK-17067: - Add revocable resource support: http://mesos.apache.org/documentation/latest/oversubscription/ This will allow higher priority jobs (or other, non-Spark services) to preempty lower priority jobs > Revocable resource support > -- > > Key: SPARK-17067 > URL: https://issues.apache.org/jira/browse/SPARK-17067 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Michael Gummelt > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421699#comment-15421699 ] Maciej Bryński commented on SPARK-16320: [~srowen], [~michael], I found the reason why G1GC with default configuration is so slow on Spark 2.0. That's the reason: https://issues.apache.org/jira/browse/SPARK-12384 G1GC calculate region size based on initial heap size. Which is 2G. So region size is 2G/2048 = 1M. When heap grows up to 30G (my settings) we have 30k regions. That is too much. So we have to configure one of the following options 1) Set -Xms=30g 2) Set -XX:G1HeapRegionSize=16m Both solves the problem. > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17067) Revocable resource support
Michael Gummelt created SPARK-17067: --- Summary: Revocable resource support Key: SPARK-17067 URL: https://issues.apache.org/jira/browse/SPARK-17067 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Michael Gummelt -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17066) dateFormat should be used when writing dataframes as csv files
Barry Becker created SPARK-17066: Summary: dateFormat should be used when writing dataframes as csv files Key: SPARK-17066 URL: https://issues.apache.org/jira/browse/SPARK-17066 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.0.0 Reporter: Barry Becker I noticed this when running tests after pulling and building @lw-lin 's PR (https://github.com/apache/spark/pull/14118). I don't think it is anything wrong with his PR, just that the fix that was made to spark-csv for this issue was never moved to spark 2.x when databrick's spark-csv was merged into spark 2 back in January. https://github.com/databricks/spark-csv/issues/308 was fixed in spark-csv after that merge. The problem is that if I try to write a dataframe that contains a date column out to a csv using something like this repartitionDf.write.format("csv") //.format(DATABRICKS_CSV) .option("delimiter", "\t") .option("header", "false") .option("nullValue", "?") .option("dateFormat", "-MM-dd'T'HH:mm:ss") .option("escape", "\\") .save(tempFileName) Then my unit test (which passed under spark 1.6.2 fails using the spark 2.1.0 snapshot build that I made today. Expected "[2012-01-03T09:12:00 ? 2015-02-23T18:00:]00", but got "[132561072000 ? 1424743200]00" This means that while the null value is being correctly exported, the specified dateFormat is not being used to format the date. Instead it looks like number of seconds from epoch is being used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17039) cannot read null dates from csv file
[ https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421688#comment-15421688 ] Barry Becker commented on SPARK-17039: -- I did notice that https://github.com/databricks/spark-csv/issues/308 was not ported to spark 2.x. IOW, when you specify the dateFormat when writing with format("csv") the dates get written as longs instead of dates with the specified dateFormat. I will open a separate issue for it. > cannot read null dates from csv file > > > Key: SPARK-17039 > URL: https://issues.apache.org/jira/browse/SPARK-17039 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > I see this exact same bug as reported in this [stack overflow > post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column] > using Spark 2.0.0 (released version). > In scala, I read a csv using > sqlContext.read > .format("csv") > .option("header", "false") > .option("inferSchema", "false") > .option("nullValue", "?") > .option("dateFormat", "-MM-dd'T'HH:mm:ss") > .schema(dfSchema) > .csv(dataFile) > The data contains some null dates (represented with ?). > The error I get is: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 > (TID 10, localhost): java.text.ParseException: Unparseable date: "?" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17065) Improve the error message when encountering an incompatible DataSourceRegister
[ https://issues.apache.org/jira/browse/SPARK-17065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17065: Assignee: Apache Spark (was: Shixiong Zhu) > Improve the error message when encountering an incompatible DataSourceRegister > -- > > Key: SPARK-17065 > URL: https://issues.apache.org/jira/browse/SPARK-17065 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Shixiong Zhu >Assignee: Apache Spark > > When encountering an incompatible DataSourceRegister, it's better to add > instructions to remove or upgrade it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17065) Improve the error message when encountering an incompatible DataSourceRegister
[ https://issues.apache.org/jira/browse/SPARK-17065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17065: Assignee: Shixiong Zhu (was: Apache Spark) > Improve the error message when encountering an incompatible DataSourceRegister > -- > > Key: SPARK-17065 > URL: https://issues.apache.org/jira/browse/SPARK-17065 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > When encountering an incompatible DataSourceRegister, it's better to add > instructions to remove or upgrade it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17065) Improve the error message when encountering an incompatible DataSourceRegister
[ https://issues.apache.org/jira/browse/SPARK-17065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421671#comment-15421671 ] Apache Spark commented on SPARK-17065: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/14651 > Improve the error message when encountering an incompatible DataSourceRegister > -- > > Key: SPARK-17065 > URL: https://issues.apache.org/jira/browse/SPARK-17065 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > When encountering an incompatible DataSourceRegister, it's better to add > instructions to remove or upgrade it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17065) Improve the error message when encountering an incompatible DataSourceRegister
Shixiong Zhu created SPARK-17065: Summary: Improve the error message when encountering an incompatible DataSourceRegister Key: SPARK-17065 URL: https://issues.apache.org/jira/browse/SPARK-17065 Project: Spark Issue Type: Bug Components: SQL Reporter: Shixiong Zhu Assignee: Shixiong Zhu When encountering an incompatible DataSourceRegister, it's better to add instructions to remove or upgrade it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17039) cannot read null dates from csv file
[ https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421611#comment-15421611 ] Barry Becker commented on SPARK-17039: -- I was able to pull the patch (https://github.com/apache/spark/pull/14118) and verify that it fixes my issue with reading null values. I hope that this patch will make it in for 2.0.1. Thanks. > cannot read null dates from csv file > > > Key: SPARK-17039 > URL: https://issues.apache.org/jira/browse/SPARK-17039 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > I see this exact same bug as reported in this [stack overflow > post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column] > using Spark 2.0.0 (released version). > In scala, I read a csv using > sqlContext.read > .format("csv") > .option("header", "false") > .option("inferSchema", "false") > .option("nullValue", "?") > .option("dateFormat", "-MM-dd'T'HH:mm:ss") > .schema(dfSchema) > .csv(dataFile) > The data contains some null dates (represented with ?). > The error I get is: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 > (TID 10, localhost): java.text.ParseException: Unparseable date: "?" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17064) Reconsider spark.job.interruptOnCancel
[ https://issues.apache.org/jira/browse/SPARK-17064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421610#comment-15421610 ] Reynold Xin commented on SPARK-17064: - Yea so my worry is that even if Hadoop has addressed this (HDFS-1208 actually hasn't been resolved), the problem exists outside Hadoop and a lot of underlying storage clients are not handling interrupts well and taking this away will make users' life very difficult when working with other storage systems. > Reconsider spark.job.interruptOnCancel > -- > > Key: SPARK-17064 > URL: https://issues.apache.org/jira/browse/SPARK-17064 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Reporter: Mark Hamstra > > There is a frequent need or desire in Spark to cancel already running Tasks. > This has been recognized for a very long time (see, e.g., the ancient TODO > comment in the DAGScheduler: "Cancel running tasks in the stage"), but we've > never had more than an incomplete solution. Killing running Tasks at the > Executor level has been implemented by interrupting the threads running the > Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since > https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3 > addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting > Task threads in this way has only been possible if interruptThread is true, > and that typically comes from the setting of the interruptOnCancel property > in the JobGroup, which in turn typically comes from the setting of > spark.job.interruptOnCancel. Because of concerns over > https://issues.apache.org/jira/browse/HDFS-1208 and the possibility of nodes > being marked dead when a Task thread is interrupted, the default value of the > boolean has been "false" -- i.e. by default we do not interrupt Tasks already > running on Executor even when the Task has been canceled in the DAGScheduler, > or the Stage has been abort, or the Job has been killed, etc. > There are several issues resulting from this current state of affairs, and > they each probably need to spawn their own JIRA issue and PR once we decide > on an overall strategy here. Among those issues: > * Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS > versions that Spark now supports so that we can set the default value of > spark.job.interruptOnCancel to "true" or eliminate this boolean flag entirely? > * Even if interrupting Task threads is no longer an issue for HDFS, is it > still enough of an issue for non-HDFS usage (e.g. Cassandra) so that we still > need protection similar to what the current default value of > spark.job.interruptOnCancel provides? > * If interrupting Task threads isn't safe enough, what should we do instead? > * Once we have a safe mechanism to stop and clean up after already executing > Tasks, there is still the question of whether we _should_ end executing > Tasks. While that is likely a good thing to do in cases where individual > Tasks are lightweight in terms of resource usage, at least in some cases not > all running Tasks should be ended: https://github.com/apache/spark/pull/12436 > That means that we probably need to continue to make allowing Task > interruption configurable at the Job or JobGroup level (and we need better > documentation explaining how and when to allow interruption or not.) > * There is one place in the current code > (TaskSetManager#handleSuccessfulTask) that hard codes interruptThread to > "true". This should be fixed, and similar misuses of killTask be denied in > pull requests until this issue is adequately resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16700) StructType doesn't accept Python dicts anymore
[ https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-16700. Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14469 [https://github.com/apache/spark/pull/14469] > StructType doesn't accept Python dicts anymore > -- > > Key: SPARK-16700 > URL: https://issues.apache.org/jira/browse/SPARK-16700 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer >Assignee: Davies Liu > Fix For: 2.1.0 > > > Hello, > I found this issue while testing my codebase with 2.0.0-rc5 > StructType in Spark 1.6.2 accepts the Python type, which is very > handy. 2.0.0-rc5 does not and throws an error. > I don't know if this was intended but I'd advocate for this behaviour to > remain the same. MapType is probably wasteful when your key names never > change and switching to Python tuples would be cumbersome. > Here is a minimal script to reproduce the issue: > {code} > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > struct_schema = SparkTypes.StructType([ > SparkTypes.StructField("id", SparkTypes.LongType()) > ]) > rdd = sc.parallelize([{"id": 0}, {"id": 1}]) > df = sqlc.createDataFrame(rdd, struct_schema) > print df.collect() > # 1.6.2 prints [Row(id=0), Row(id=1)] > # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in > type > {code} > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16321) [Spark 2.0] Performance regression when reading parquet and using PPD and non-vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421468#comment-15421468 ] Maciej Bryński edited comment on SPARK-16321 at 8/15/16 7:34 PM: - [~davies] I think you can mark this one as resolved. was (Author: maver1ck): [~davies] I think you mark this one as resolved. > [Spark 2.0] Performance regression when reading parquet and using PPD and > non-vectorized reader > --- > > Key: SPARK-16321 > URL: https://issues.apache.org/jira/browse/SPARK-16321 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Assignee: Liang-Chi Hsieh >Priority: Critical > Fix For: 2.0.1, 2.1.0 > > Attachments: Spark16.nps, Spark2.nps, spark16._trace.png, > spark16_query.nps, spark2_nofilterpushdown.nps, spark2_query.nps, > spark2_trace.png, visualvm_spark16.png, visualvm_spark2.png, > visualvm_spark2_G1GC.png > > > *UPDATE* > Please start with this comment > https://issues.apache.org/jira/browse/SPARK-16321?focusedCommentId=15383785=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15383785 > I assume that problem results from the performance problem with reading > parquet files > *Original Issue description* > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is 2x slower. > {code} > df = sqlctx.read.parquet(path) > df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 > else []).collect() > {code} > Spark 1.6 -> 2.3 min > Spark 2.0 -> 4.6 min (2x slower) > I used BasicProfiler for this task and cumulative time was: > Spark 1.6 - 4300 sec > Spark 2.0 - 5800 sec > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16321) [Spark 2.0] Performance regression when reading parquet and using PPD and non-vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16321. --- Resolution: Fixed Assignee: Liang-Chi Hsieh Fix Version/s: 2.1.0 2.0.1 Resolved by https://github.com/apache/spark/pull/13701 > [Spark 2.0] Performance regression when reading parquet and using PPD and > non-vectorized reader > --- > > Key: SPARK-16321 > URL: https://issues.apache.org/jira/browse/SPARK-16321 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Assignee: Liang-Chi Hsieh >Priority: Critical > Fix For: 2.0.1, 2.1.0 > > Attachments: Spark16.nps, Spark2.nps, spark16._trace.png, > spark16_query.nps, spark2_nofilterpushdown.nps, spark2_query.nps, > spark2_trace.png, visualvm_spark16.png, visualvm_spark2.png, > visualvm_spark2_G1GC.png > > > *UPDATE* > Please start with this comment > https://issues.apache.org/jira/browse/SPARK-16321?focusedCommentId=15383785=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15383785 > I assume that problem results from the performance problem with reading > parquet files > *Original Issue description* > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is 2x slower. > {code} > df = sqlctx.read.parquet(path) > df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 > else []).collect() > {code} > Spark 1.6 -> 2.3 min > Spark 2.0 -> 4.6 min (2x slower) > I used BasicProfiler for this task and cumulative time was: > Spark 1.6 - 4300 sec > Spark 2.0 - 5800 sec > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421524#comment-15421524 ] Sital Kedia commented on SPARK-16922: - Yes, I have the above mentioned PR as well. > Query with Broadcast Hash join fails due to executor OOM in Spark 2.0 > - > > Key: SPARK-16922 > URL: https://issues.apache.org/jira/browse/SPARK-16922 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > A query which used to work in Spark 1.6 fails with executor OOM in 2.0. > Stack trace - > {code} > at > org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Query plan in Spark 1.6 > {code} > == Physical Plan == > TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3]) > +- TungstenExchange hashpartitioning(field1#101,200), None >+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111]) > +- Project [field1#101,field2#74] > +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as > decimal(20,0)) as bigint)], BuildRight > :- ConvertToUnsafe > : +- HiveTableScan [field2#74,field5#63L], MetastoreRelation > foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)] > +- ConvertToUnsafe >+- HiveTableScan [field1#101,field4#97], MetastoreRelation > foo, table2, Some(b) > {code} > Query plan in 2.0 > {code} > == Physical Plan == > *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))]) > +- Exchange hashpartitioning(field1#160, 200) >+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / > 100.0))]) > +- *Project [field2#133, field1#160] > +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as > decimal(20,0)) as bigint)], Inner, BuildRight > :- *Filter isnotnull(field5#122L) > : +- HiveTableScan [field5#122L, field2#133], MetastoreRelation > foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= > 2013-12-31)] > +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as > decimal(20,0)) as bigint))) >+- *Filter isnotnull(field4#156) > +- HiveTableScan [field4#156, field1#160], > MetastoreRelation foo, table2, b > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17064) Reconsider spark.job.interruptOnCancel
[ https://issues.apache.org/jira/browse/SPARK-17064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Hamstra updated SPARK-17064: - Description: There is a frequent need or desire in Spark to cancel already running Tasks. This has been recognized for a very long time (see, e.g., the ancient TODO comment in the DAGScheduler: "Cancel running tasks in the stage"), but we've never had more than an incomplete solution. Killing running Tasks at the Executor level has been implemented by interrupting the threads running the Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3 addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting Task threads in this way has only been possible if interruptThread is true, and that typically comes from the setting of the interruptOnCancel property in the JobGroup, which in turn typically comes from the setting of spark.job.interruptOnCancel. Because of concerns over https://issues.apache.org/jira/browse/HDFS-1208 and the possibility of nodes being marked dead when a Task thread is interrupted, the default value of the boolean has been "false" -- i.e. by default we do not interrupt Tasks already running on Executor even when the Task has been canceled in the DAGScheduler, or the Stage has been abort, or the Job has been killed, etc. There are several issues resulting from this current state of affairs, and they each probably need to spawn their own JIRA issue and PR once we decide on an overall strategy here. Among those issues: * Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS versions that Spark now supports so that we can set the default value of spark.job.interruptOnCancel to "true" or eliminate this boolean flag entirely? * Even if interrupting Task threads is no longer an issue for HDFS, is it still enough of an issue for non-HDFS usage (e.g. Cassandra) so that we still need protection similar to what the current default value of spark.job.interruptOnCancel provides? * If interrupting Task threads isn't safe enough, what should we do instead? * Once we have a safe mechanism to stop and clean up after already executing Tasks, there is still the question of whether we _should_ end executing Tasks. While that is likely a good thing to do in cases where individual Tasks are lightweight in terms of resource usage, at least in some cases not all running Tasks should be ended: https://github.com/apache/spark/pull/12436 That means that we probably need to continue to make allowing Task interruption configurable at the Job or JobGroup level (and we need better documentation explaining how and when to allow interruption or not.) * There is one place in the current code (TaskSetManager#handleSuccessfulTask) that hard codes interruptThread to "true". This should be fixed, and similar misuses of killTask be denied in pull requests until this issue is adequately resolved. was: There is a frequent need or desire in Spark to cancel already running Tasks. This has been recognized for a very long time (see, e.g., the ancient TODO comment in the DAGScheduler: "Cancel running tasks in the stage"), but we've never had more than an incomplete solution. Killing running Tasks at the Executor level has been implemented by interrupting the threads running the Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3 addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting Task threads in this way has only been possible if interruptThread is true, and that typically comes from the setting of the interruptOnCancel property in the JobGroup, which in turn typically comes from the setting of spark.job.interruptOnCancel. Because of concerns over https://issues.apache.org/jira/browse/HDFS-1208 and the possibility of nodes being marked dead when a Task thread is interrupted, the default value of the boolean has been "false" -- i.e. by default we do not interrupt Tasks already running on Executor even when the Task has been canceled in the DAGScheduler, or the Stage has been abort, or the Job has been killed, etc. There are several issues resulting from this current state of affairs, and they each probably need to spawn their own JIRA issue and PR once we decide on an overall strategy here. Among those issues: * Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS versions that Spark now supports so that we set the default value of spark.job.interruptOnCancel to "true" or eliminate this boolean flag entirely? * Even if interrupting Task threads is no longer an issue for HDFS, is it still enough of an issue for non-HDFS usage (e.g. Cassandra) so that we still need protection similar to what the current
[jira] [Commented] (SPARK-17064) Reconsider spark.job.interruptOnCancel
[ https://issues.apache.org/jira/browse/SPARK-17064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421503#comment-15421503 ] Mark Hamstra commented on SPARK-17064: -- [~kayousterhout] [~r...@databricks.com] [~imranr] > Reconsider spark.job.interruptOnCancel > -- > > Key: SPARK-17064 > URL: https://issues.apache.org/jira/browse/SPARK-17064 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Reporter: Mark Hamstra > > There is a frequent need or desire in Spark to cancel already running Tasks. > This has been recognized for a very long time (see, e.g., the ancient TODO > comment in the DAGScheduler: "Cancel running tasks in the stage"), but we've > never had more than an incomplete solution. Killing running Tasks at the > Executor level has been implemented by interrupting the threads running the > Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since > https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3 > addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting > Task threads in this way has only been possible if interruptThread is true, > and that typically comes from the setting of the interruptOnCancel property > in the JobGroup, which in turn typically comes from the setting of > spark.job.interruptOnCancel. Because of concerns over > https://issues.apache.org/jira/browse/HDFS-1208 and the possibility of nodes > being marked dead when a Task thread is interrupted, the default value of the > boolean has been "false" -- i.e. by default we do not interrupt Tasks already > running on Executor even when the Task has been canceled in the DAGScheduler, > or the Stage has been abort, or the Job has been killed, etc. > There are several issues resulting from this current state of affairs, and > they each probably need to spawn their own JIRA issue and PR once we decide > on an overall strategy here. Among those issues: > * Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS > versions that Spark now supports so that we set the default value of > spark.job.interruptOnCancel to "true" or eliminate this boolean flag entirely? > * Even if interrupting Task threads is no longer an issue for HDFS, is it > still enough of an issue for non-HDFS usage (e.g. Cassandra) so that we still > need protection similar to what the current default value of > spark.job.interruptOnCancel provides? > * If interrupting Task threads isn't safe enough, what should we do instead? > * Once we have a safe mechanism to stop and clean up after already executing > Tasks, there is still the question of whether we _should_ end executing > Tasks. While that is likely a good thing to do in cases where individual > Tasks are lightweight in terms of resource usage, at least in some cases not > all running Tasks should be ended: https://github.com/apache/spark/pull/12436 > That means that we probably need to continue to make allowing Task > interruption configurable at the Job or JobGroup level (and we need better > documentation explaining how and when to allow interruption or not.) > * There is one place in the current code > (TaskSetManager#handleSuccessfulTask) that hard codes interruptThread to > "true". This should be fixed, and similar misuses of killTask be denied in > pull requests until this issue is adequately resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17064) Reconsider spark.job.interruptOnCancel
Mark Hamstra created SPARK-17064: Summary: Reconsider spark.job.interruptOnCancel Key: SPARK-17064 URL: https://issues.apache.org/jira/browse/SPARK-17064 Project: Spark Issue Type: Improvement Components: Scheduler, Spark Core Reporter: Mark Hamstra There is a frequent need or desire in Spark to cancel already running Tasks. This has been recognized for a very long time (see, e.g., the ancient TODO comment in the DAGScheduler: "Cancel running tasks in the stage"), but we've never had more than an incomplete solution. Killing running Tasks at the Executor level has been implemented by interrupting the threads running the Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3 addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting Task threads in this way has only been possible if interruptThread is true, and that typically comes from the setting of the interruptOnCancel property in the JobGroup, which in turn typically comes from the setting of spark.job.interruptOnCancel. Because of concerns over https://issues.apache.org/jira/browse/HDFS-1208 and the possibility of nodes being marked dead when a Task thread is interrupted, the default value of the boolean has been "false" -- i.e. by default we do not interrupt Tasks already running on Executor even when the Task has been canceled in the DAGScheduler, or the Stage has been abort, or the Job has been killed, etc. There are several issues resulting from this current state of affairs, and they each probably need to spawn their own JIRA issue and PR once we decide on an overall strategy here. Among those issues: * Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS versions that Spark now supports so that we set the default value of spark.job.interruptOnCancel to "true" or eliminate this boolean flag entirely? * Even if interrupting Task threads is no longer an issue for HDFS, is it still enough of an issue for non-HDFS usage (e.g. Cassandra) so that we still need protection similar to what the current default value of spark.job.interruptOnCancel provides? * If interrupting Task threads isn't safe enough, what should we do instead? * Once we have a safe mechanism to stop and clean up after already executing Tasks, there is still the question of whether we _should_ end executing Tasks. While that is likely a good thing to do in cases where individual Tasks are lightweight in terms of resource usage, at least in some cases not all running Tasks should be ended: https://github.com/apache/spark/pull/12436 That means that we probably need to continue to make allowing Task interruption configurable at the Job or JobGroup level (and we need better documentation explaining how and when to allow interruption or not.) * There is one place in the current code (TaskSetManager#handleSuccessfulTask) that hard codes interruptThread to "true". This should be fixed, and similar misuses of killTask be denied in pull requests until this issue is adequately resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421484#comment-15421484 ] Davies Liu commented on SPARK-16922: Have you also have this one? https://github.com/apache/spark/pull/14373 > Query with Broadcast Hash join fails due to executor OOM in Spark 2.0 > - > > Key: SPARK-16922 > URL: https://issues.apache.org/jira/browse/SPARK-16922 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > A query which used to work in Spark 1.6 fails with executor OOM in 2.0. > Stack trace - > {code} > at > org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Query plan in Spark 1.6 > {code} > == Physical Plan == > TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3]) > +- TungstenExchange hashpartitioning(field1#101,200), None >+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111]) > +- Project [field1#101,field2#74] > +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as > decimal(20,0)) as bigint)], BuildRight > :- ConvertToUnsafe > : +- HiveTableScan [field2#74,field5#63L], MetastoreRelation > foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)] > +- ConvertToUnsafe >+- HiveTableScan [field1#101,field4#97], MetastoreRelation > foo, table2, Some(b) > {code} > Query plan in 2.0 > {code} > == Physical Plan == > *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))]) > +- Exchange hashpartitioning(field1#160, 200) >+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / > 100.0))]) > +- *Project [field2#133, field1#160] > +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as > decimal(20,0)) as bigint)], Inner, BuildRight > :- *Filter isnotnull(field5#122L) > : +- HiveTableScan [field5#122L, field2#133], MetastoreRelation > foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= > 2013-12-31)] > +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as > decimal(20,0)) as bigint))) >+- *Filter isnotnull(field4#156) > +- HiveTableScan [field4#156, field1#160], > MetastoreRelation foo, table2, b > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11714) Make Spark on Mesos honor port restrictions
[ https://issues.apache.org/jira/browse/SPARK-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421478#comment-15421478 ] Charles Allen commented on SPARK-11714: --- Awesome! Thanks guys! > Make Spark on Mesos honor port restrictions > --- > > Key: SPARK-11714 > URL: https://issues.apache.org/jira/browse/SPARK-11714 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Charles Allen >Assignee: Stavros Kontopoulos > Fix For: 2.1.0 > > > Currently the MesosSchedulerBackend does not make any effort to honor "ports" > as a resource offer in Mesos. This ask is to have the ports which the > executor binds to honor the limits of the "ports" resource of an offer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16321) [Spark 2.0] Performance regression when reading parquet and using PPD and non-vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421468#comment-15421468 ] Maciej Bryński commented on SPARK-16321: [~davies] I think you mark this one as resolved. > [Spark 2.0] Performance regression when reading parquet and using PPD and > non-vectorized reader > --- > > Key: SPARK-16321 > URL: https://issues.apache.org/jira/browse/SPARK-16321 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: Spark16.nps, Spark2.nps, spark16._trace.png, > spark16_query.nps, spark2_nofilterpushdown.nps, spark2_query.nps, > spark2_trace.png, visualvm_spark16.png, visualvm_spark2.png, > visualvm_spark2_G1GC.png > > > *UPDATE* > Please start with this comment > https://issues.apache.org/jira/browse/SPARK-16321?focusedCommentId=15383785=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15383785 > I assume that problem results from the performance problem with reading > parquet files > *Original Issue description* > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is 2x slower. > {code} > df = sqlctx.read.parquet(path) > df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 > else []).collect() > {code} > Spark 1.6 -> 2.3 min > Spark 2.0 -> 4.6 min (2x slower) > I used BasicProfiler for this task and cumulative time was: > Spark 1.6 - 4300 sec > Spark 2.0 - 5800 sec > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16087) Spark Hangs When Using Union With Persisted Hadoop RDD
[ https://issues.apache.org/jira/browse/SPARK-16087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421457#comment-15421457 ] Nick Sakovich commented on SPARK-16087: --- [~kevinconaway], [~srowen] today i met the same issue .. my app works correctly with locally installed Redis. But when i try to use remote Redis instance my app just stucks. It also happens only in case when i use "union" feature. I use Spark 1.6.2 and just run the app locally without Hadoop. Do you have any updates what can be wrong here? > Spark Hangs When Using Union With Persisted Hadoop RDD > -- > > Key: SPARK-16087 > URL: https://issues.apache.org/jira/browse/SPARK-16087 > Project: Spark > Issue Type: Bug >Affects Versions: 1.4.1, 1.6.1 >Reporter: Kevin Conaway >Priority: Critical > Attachments: SPARK-16087.dump.log, SPARK-16087.log, Screen Shot > 2016-06-21 at 4.27.26 PM.png, Screen Shot 2016-06-21 at 4.27.35 PM.png, > part-0, part-1, spark-16087.tar.gz > > > Spark hangs when materializing a persisted RDD that was built from a Hadoop > sequence file and then union-ed with a similar RDD. > Below is a small file that exhibits the issue: > {code:java} > import org.apache.hadoop.io.BytesWritable; > import org.apache.hadoop.io.LongWritable; > import org.apache.spark.SparkConf; > import org.apache.spark.api.java.JavaPairRDD; > import org.apache.spark.api.java.JavaSparkContext; > import org.apache.spark.api.java.function.PairFunction; > import org.apache.spark.serializer.KryoSerializer; > import org.apache.spark.storage.StorageLevel; > import scala.Tuple2; > public class SparkBug { > public static void main(String [] args) throws Exception { > JavaSparkContext sc = new JavaSparkContext( > new SparkConf() > .set("spark.serializer", KryoSerializer.class.getName()) > .set("spark.master", "local[*]") > .setAppName(SparkBug.class.getName()) > ); > JavaPairRDDrdd1 = sc.sequenceFile( >"hdfs://localhost:9000/part-0", > LongWritable.class, > BytesWritable.class > ).mapToPair(new PairFunction , > LongWritable, BytesWritable>() { > @Override > public Tuple2 > call(Tuple2 tuple) throws Exception { > return new Tuple2<>( > new LongWritable(tuple._1.get()), > new BytesWritable(tuple._2.copyBytes()) > ); > } > }).persist( > StorageLevel.MEMORY_ONLY() > ); > System.out.println("Before union: " + rdd1.count()); > JavaPairRDD rdd2 = sc.sequenceFile( > "hdfs://localhost:9000/part-1", > LongWritable.class, > BytesWritable.class > ); > JavaPairRDD joined = rdd1.union(rdd2); > System.out.println("After union: " + joined.count()); > } > } > {code} > You'll need to upload the attached part-0 and part-1 to a local hdfs > instance (I'm just using a dummy [Single Node > Cluster|http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html] > locally). > Some things to note: > - It does not hang if rdd1 is not persisted > - It does not hang is rdd1 is not materialized (via calling rdd1.count()) > before the union-ed RDD is materialized > - It does not hang if the mapToPair() transformation is removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16671) Merge variable substitution code in core and SQL
[ https://issues.apache.org/jira/browse/SPARK-16671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-16671. Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 2.1.0 > Merge variable substitution code in core and SQL > > > Key: SPARK-16671 > URL: https://issues.apache.org/jira/browse/SPARK-16671 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.1.0 > > > SPARK-16272 added support for variable substitution in configs in the core > Spark configuration. That code has a lot of similarities with SQL's > {{VariableSubstitution}}, and we should make both use the same code as much > as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16508) Fix documentation warnings found by R CMD check
[ https://issues.apache.org/jira/browse/SPARK-16508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421393#comment-15421393 ] Shivaram Venkataraman commented on SPARK-16508: --- We merged https://github.com/apache/spark/pull/14522 but I'm keeping the JIRA open till we merge PR 14558 > Fix documentation warnings found by R CMD check > --- > > Key: SPARK-16508 > URL: https://issues.apache.org/jira/browse/SPARK-16508 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > A full list of warnings after the fixes in SPARK-16507 is at > https://gist.github.com/shivaram/62866c4ca59c5d34b8963939cf04b5eb -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17054) SparkR can not run in yarn-cluster mode on mac os
[ https://issues.apache.org/jira/browse/SPARK-17054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421351#comment-15421351 ] Shivaram Venkataraman commented on SPARK-17054: --- [~zjffdu] We added this new feature as a part of SPARK-15799 -- The main idea here is that if users only download the R package from CRAN and dont have Spark installed on their machines then we will auto download spark. Note that this should only happen when the Spark jars cannot be found using SPARK_HOME -- If you launch SparkR using bin/sparkR or bin/spark-submit it should find the jars that are a part of Spark. Could you give us more details on how you launched SparkR and any other configuration used ? > SparkR can not run in yarn-cluster mode on mac os > - > > Key: SPARK-17054 > URL: https://issues.apache.org/jira/browse/SPARK-17054 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Jeff Zhang > > This is due to it download sparkR to the wrong place. > {noformat} > Warning message: > 'sparkR.init' is deprecated. > Use 'sparkR.session' instead. > See help("Deprecated") > Spark not found in SPARK_HOME: . > To search in the cache directory. Installation will start if not found. > Mirror site not provided. > Looking for site suggested from apache website... > Preferred mirror site found: http://apache.mirror.cdnetworks.com/spark > Downloading Spark spark-2.0.0 for Hadoop 2.7 from: > - > http://apache.mirror.cdnetworks.com/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz > Fetch failed from http://apache.mirror.cdnetworks.com/spark > open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', > reason 'No such file or directory'> > To use backup site... > Downloading Spark spark-2.0.0 for Hadoop 2.7 from: > - > http://www-us.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz > Fetch failed from http://www-us.apache.org/dist/spark > open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', > reason 'No such file or directory'> > Error in robust_download_tar(mirrorUrl, version, hadoopVersion, packageName, > : > Unable to download Spark spark-2.0.0 for Hadoop 2.7. Please check network > connection, Hadoop version, or provide other mirror sites. > Calls: sparkRSQL.init ... sparkR.session -> install.spark -> > robust_download_tar > In addition: Warning messages: > 1: 'sparkRSQL.init' is deprecated. > Use 'sparkR.session' instead. > See help("Deprecated") > 2: In dir.create(localDir, recursive = TRUE) : > cannot create dir '/home//Library', reason 'Operation not supported' > Execution halted > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16578) Configurable hostname for RBackend
[ https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421336#comment-15421336 ] Shivaram Venkataraman commented on SPARK-16578: --- [~zjffdu] The main goal I had for this ticket was to enable running the R frontend in a different machine (say a user's laptop) while running the SparkContext JVM on a different machine (say on a cluster). It will be interesting to see if we can support more of the use cases used by downstream projects like zeppelin / livy etc. I thinkj [~junyangq] has been doing some work on this and gives us some more details on the changes. > Configurable hostname for RBackend > -- > > Key: SPARK-16578 > URL: https://issues.apache.org/jira/browse/SPARK-16578 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > One of the requirements that comes up with SparkR being a standalone package > is that users can now install just the R package on the client side and > connect to a remote machine which runs the RBackend class. > We should check if we can support this mode of execution and what are the > pros / cons of it -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17063) MSCK REPAIR TABLE is super slow with Hive metastore
[ https://issues.apache.org/jira/browse/SPARK-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17063: Assignee: Apache Spark (was: Davies Liu) > MSCK REPAIR TABLE is super slow with Hive metastore > --- > > Key: SPARK-17063 > URL: https://issues.apache.org/jira/browse/SPARK-17063 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > When repair a table with thousands of partitions, it could take hundreds of > seconds, Hive metastore can only add a few partitioins per seconds, because > it will list all the files for each partition to gather the fast stats > (number of files, total size of files). > We could improve this by listing the files in Spark in parallel, than sending > the fast stats to Hive metastore to avoid this sequential listing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17063) MSCK REPAIR TABLE is super slow with Hive metastore
[ https://issues.apache.org/jira/browse/SPARK-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17063: Assignee: Davies Liu (was: Apache Spark) > MSCK REPAIR TABLE is super slow with Hive metastore > --- > > Key: SPARK-17063 > URL: https://issues.apache.org/jira/browse/SPARK-17063 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > When repair a table with thousands of partitions, it could take hundreds of > seconds, Hive metastore can only add a few partitioins per seconds, because > it will list all the files for each partition to gather the fast stats > (number of files, total size of files). > We could improve this by listing the files in Spark in parallel, than sending > the fast stats to Hive metastore to avoid this sequential listing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17063) MSCK REPAIR TABLE is super slow with Hive metastore
[ https://issues.apache.org/jira/browse/SPARK-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421335#comment-15421335 ] Apache Spark commented on SPARK-17063: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/14607 > MSCK REPAIR TABLE is super slow with Hive metastore > --- > > Key: SPARK-17063 > URL: https://issues.apache.org/jira/browse/SPARK-17063 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > When repair a table with thousands of partitions, it could take hundreds of > seconds, Hive metastore can only add a few partitioins per seconds, because > it will list all the files for each partition to gather the fast stats > (number of files, total size of files). > We could improve this by listing the files in Spark in parallel, than sending > the fast stats to Hive metastore to avoid this sequential listing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17063) MSCK REPAIR TABLE is super slow with Hive metastore
Davies Liu created SPARK-17063: -- Summary: MSCK REPAIR TABLE is super slow with Hive metastore Key: SPARK-17063 URL: https://issues.apache.org/jira/browse/SPARK-17063 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu When repair a table with thousands of partitions, it could take hundreds of seconds, Hive metastore can only add a few partitioins per seconds, because it will list all the files for each partition to gather the fast stats (number of files, total size of files). We could improve this by listing the files in Spark in parallel, than sending the fast stats to Hive metastore to avoid this sequential listing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16158) Support pluggable dynamic allocation heuristics
[ https://issues.apache.org/jira/browse/SPARK-16158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421327#comment-15421327 ] Nezih Yigitbasi commented on SPARK-16158: - [~andrewor14] [~rxin] how do you guys feel about this? > Support pluggable dynamic allocation heuristics > --- > > Key: SPARK-16158 > URL: https://issues.apache.org/jira/browse/SPARK-16158 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Nezih Yigitbasi > > It would be nice if Spark supports plugging in custom dynamic allocation > heuristics. This feature would be useful for experimenting with new > heuristics and also useful for plugging in different heuristics per job etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17062) Add --conf to mesos dispatcher process
[ https://issues.apache.org/jira/browse/SPARK-17062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17062: Assignee: Apache Spark > Add --conf to mesos dispatcher process > -- > > Key: SPARK-17062 > URL: https://issues.apache.org/jira/browse/SPARK-17062 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Stavros Kontopoulos >Assignee: Apache Spark > > Sometimes we simply need to add a property in Spark Config for the Mesos > Dispatcher. The only option right now is to created a property file -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17062) Add --conf to mesos dispatcher process
[ https://issues.apache.org/jira/browse/SPARK-17062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17062: Assignee: (was: Apache Spark) > Add --conf to mesos dispatcher process > -- > > Key: SPARK-17062 > URL: https://issues.apache.org/jira/browse/SPARK-17062 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Stavros Kontopoulos > > Sometimes we simply need to add a property in Spark Config for the Mesos > Dispatcher. The only option right now is to created a property file -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17062) Add --conf to mesos dispatcher process
[ https://issues.apache.org/jira/browse/SPARK-17062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421315#comment-15421315 ] Apache Spark commented on SPARK-17062: -- User 'skonto' has created a pull request for this issue: https://github.com/apache/spark/pull/14650 > Add --conf to mesos dispatcher process > -- > > Key: SPARK-17062 > URL: https://issues.apache.org/jira/browse/SPARK-17062 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Stavros Kontopoulos > > Sometimes we simply need to add a property in Spark Config for the Mesos > Dispatcher. The only option right now is to created a property file -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17062) Add --conf to mesos dispatcher process
Stavros Kontopoulos created SPARK-17062: --- Summary: Add --conf to mesos dispatcher process Key: SPARK-17062 URL: https://issues.apache.org/jira/browse/SPARK-17062 Project: Spark Issue Type: Improvement Components: Mesos Affects Versions: 2.0.0 Reporter: Stavros Kontopoulos Sometimes we simply need to add a property in Spark Config for the Mesos Dispatcher. The only option right now is to created a property file -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15002) Calling unpersist can cause spark to hang indefinitely when writing out a result
[ https://issues.apache.org/jira/browse/SPARK-15002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421306#comment-15421306 ] Jamie Hutton commented on SPARK-15002: -- Hi Sean, Looking at the stack trace on the executors, quite a few of them have the following stack trace - which mentions "fillInStackTrace" - it seems like kryo is going wrong to me. java.lang.Throwable.fillInStackTrace(Native Method) java.lang.Throwable.fillInStackTrace(Throwable.java:783) java.lang.Throwable.(Throwable.java:265) java.lang.Exception.(Exception.java:66) java.lang.ReflectiveOperationException.(ReflectiveOperationException.java:56) java.lang.NoSuchMethodException.(NoSuchMethodException.java:51) java.lang.Class.getConstructor0(Class.java:3082) java.lang.Class.getConstructor(Class.java:1825) com.esotericsoftware.kryo.factories.ReflectionSerializerFactory.makeSerializer(ReflectionSerializerFactory.java:60) com.esotericsoftware.kryo.factories.ReflectionSerializerFactory.makeSerializer(ReflectionSerializerFactory.java:45) com.esotericsoftware.kryo.Kryo.getDefaultSerializer(Kryo.java:359) com.esotericsoftware.kryo.Kryo.register(Kryo.java:394) org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$1.apply(KryoSerializer.scala:97) org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$1.apply(KryoSerializer.scala:96) scala.collection.immutable.List.foreach(List.scala:381) org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:96) org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:274) org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:259) org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:175) org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:59) org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:109) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) org.apache.spark.rdd.RDD.iterator(RDD.scala:283) org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) org.apache.spark.rdd.RDD.iterator(RDD.scala:283) org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) (the last 3 lines above repeat a number of times in the trace so I truncated them) The main thread is clearly locked: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153) org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:623) org.apache.spark.SparkContext.runJob(SparkContext.scala:1871) > Calling unpersist can cause spark to hang indefinitely when writing out a > result > > > Key: SPARK-15002 > URL: https://issues.apache.org/jira/browse/SPARK-15002 > Project: Spark > Issue Type: Bug > Components: GraphX, Spark Core >Affects Versions: 1.5.2, 1.6.0, 2.0.0 > Environment: AWS and Linux VM, both in spark-shell and spark-submit. > tested in 1.5.2 and 1.6. Tested in 2.0 >Reporter: Jamie Hutton > > The following code will cause spark to hang indefinitely. It happens when you > have an unpersist which is followed by some futher processing of that data > (in my case writing it out). > I first experienced this issue with graphX so my example below involved > graphX code, however i suspect it might be more of a core issue than a graphx > one. I have raised another bug with similar results (indefinite hanging) but > in different circumstances, so these may well be linked > Code to reproduce (can be run in spark-shell): > import org.apache.spark.graphx._ > import org.apache.spark.rdd.RDD > import org.apache.spark.sql.types._ > import org.apache.spark.sql._ > val r = scala.util.Random > val list = (0L to 500L).map(i=>(i,r.nextInt(500).asInstanceOf[Long],"LABEL")) > val distData = sc.parallelize(list) > val edgesRDD = distData.map(x => Edge(x._1, x._2, x._3)) > val distinctNodes = distData.flatMap{row => Iterable((row._1, ("A")),(row._2, > ("B")))}.distinct() > val nodesRDD: RDD[(VertexId, (String))] = distinctNodes > val graph = Graph(nodesRDD, edgesRDD) > graph.persist() > val ccGraph = graph.connectedComponents() > ccGraph.cache > > val schema =
[jira] [Commented] (SPARK-17054) SparkR can not run in yarn-cluster mode on mac os
[ https://issues.apache.org/jira/browse/SPARK-17054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421288#comment-15421288 ] Miao Wang commented on SPARK-17054: --- I use Mac and build from source. sparkR works fine. How to reproduce this issue? Thanks! > SparkR can not run in yarn-cluster mode on mac os > - > > Key: SPARK-17054 > URL: https://issues.apache.org/jira/browse/SPARK-17054 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Jeff Zhang > > This is due to it download sparkR to the wrong place. > {noformat} > Warning message: > 'sparkR.init' is deprecated. > Use 'sparkR.session' instead. > See help("Deprecated") > Spark not found in SPARK_HOME: . > To search in the cache directory. Installation will start if not found. > Mirror site not provided. > Looking for site suggested from apache website... > Preferred mirror site found: http://apache.mirror.cdnetworks.com/spark > Downloading Spark spark-2.0.0 for Hadoop 2.7 from: > - > http://apache.mirror.cdnetworks.com/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz > Fetch failed from http://apache.mirror.cdnetworks.com/spark > open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', > reason 'No such file or directory'> > To use backup site... > Downloading Spark spark-2.0.0 for Hadoop 2.7 from: > - > http://www-us.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz > Fetch failed from http://www-us.apache.org/dist/spark > open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', > reason 'No such file or directory'> > Error in robust_download_tar(mirrorUrl, version, hadoopVersion, packageName, > : > Unable to download Spark spark-2.0.0 for Hadoop 2.7. Please check network > connection, Hadoop version, or provide other mirror sites. > Calls: sparkRSQL.init ... sparkR.session -> install.spark -> > robust_download_tar > In addition: Warning messages: > 1: 'sparkRSQL.init' is deprecated. > Use 'sparkR.session' instead. > See help("Deprecated") > 2: In dir.create(localDir, recursive = TRUE) : > cannot create dir '/home//Library', reason 'Operation not supported' > Execution halted > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17061) Incorrect results returned following a join of two datasets and a map step where total number of columns >100
[ https://issues.apache.org/jira/browse/SPARK-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jamie Hutton updated SPARK-17061: - Affects Version/s: 2.0.1 > Incorrect results returned following a join of two datasets and a map step > where total number of columns >100 > - > > Key: SPARK-17061 > URL: https://issues.apache.org/jira/browse/SPARK-17061 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Jamie Hutton >Priority: Critical > > We have hit a consistent bug where we have a dataset with more than 100 > columns. I am raising as a blocker because spark is returning the WRONG > results rather than erroring, leading to data integrity issues > I have put together the following test case which will show the issue (it > will run in spark-shell). In this example i am joining a dataset with lots of > fields onto another dataset. > The join works fine and if you show the dataset you will get the expected > result. However if you run a map step over the dataset you end up with a > strange error where the sequence that is in the right dataset now only > contains the last value. > Whilst this test may seem a rather contrived example, what we are doing here > is a very standard analtical pattern. My original code was designed to: > - take a dataset of child records > - groupByKey up to the parent: giving a Dataset of (ParentID, Seq[Children]) > - join the children onto the parent by parentID: giving > ((Parent),(ParentID,Seq[Children]) > - map over the result to give a tuple of (Parent,Seq[Children]) > Notes: > - The issue is resolved by having less fields - as soon as we go <= 100 the > integrity issue goes away. Try removing one of the fields from BigCaseClass > below > - The issue will arise based on the total number of fields in the resulting > dataset. Below i have a small case class and a big case class, but two case > classes of 50 variable would give the same issue > - the issue occurs where the case class being joined on (on the right) has a > case class type. It doesnt occur if you have a Seq[String] > - If i go back to an RDD for the map step after the join i can workaround the > issue, but i lose all the benefits of datasets > Scala code test case: > case class Name(name: String) > case class SmallCaseClass (joinkey: Integer, names: Seq[Name]) > case class BigCaseClass (field1: Integer,field2: Integer,field3: > Integer,field4: Integer,field5: Integer,field6: Integer,field7: > Integer,field8: Integer,field9: Integer,field10: Integer,field11: > Integer,field12: Integer,field13: Integer,field14: Integer,field15: > Integer,field16: Integer,field17: Integer,field18: Integer,field19: > Integer,field20: Integer,field21: Integer,field22: Integer,field23: > Integer,field24: Integer,field25: Integer,field26: Integer,field27: > Integer,field28: Integer,field29: Integer,field30: Integer,field31: > Integer,field32: Integer,field33: Integer,field34: Integer,field35: > Integer,field36: Integer,field37: Integer,field38: Integer,field39: > Integer,field40: Integer,field41: Integer,field42: Integer,field43: > Integer,field44: Integer,field45: Integer,field46: Integer,field47: > Integer,field48: Integer,field49: Integer,field50: Integer,field51: > Integer,field52: Integer,field53: Integer,field54: Integer,field55: > Integer,field56: Integer,field57: Integer,field58: Integer,field59: > Integer,field60: Integer,field61: Integer,field62: Integer,field63: > Integer,field64: Integer,field65: Integer,field66: Integer,field67: > Integer,field68: Integer,field69: Integer,field70: Integer,field71: > Integer,field72: Integer,field73: Integer,field74: Integer,field75: > Integer,field76: Integer,field77: Integer,field78: Integer,field79: > Integer,field80: Integer,field81: Integer,field82: Integer,field83: > Integer,field84: Integer,field85: Integer,field86: Integer,field87: > Integer,field88: Integer,field89: Integer,field90: Integer,field91: > Integer,field92: Integer,field93: Integer,field94: Integer,field95: > Integer,field96: Integer,field97: Integer,field98: Integer,field99: Integer) > > val bigCC=Seq(BigCaseClass(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, > 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, > 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, > 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, > 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, > 91, 92, 93, 94, 95, 96, 97, 98, 99)) > > val smallCC=Seq(SmallCaseClass(1,Seq( > Name("Jamie"), > Name("Ian"), > Name("Dave"), >
[jira] [Reopened] (SPARK-17061) Incorrect results returned following a join of two datasets and a map step where total number of columns >100
[ https://issues.apache.org/jira/browse/SPARK-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jamie Hutton reopened SPARK-17061: -- Tested in 2.0.1 nightly snapshot and still not resolved so this appears not to be a dupe > Incorrect results returned following a join of two datasets and a map step > where total number of columns >100 > - > > Key: SPARK-17061 > URL: https://issues.apache.org/jira/browse/SPARK-17061 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Jamie Hutton >Priority: Critical > > We have hit a consistent bug where we have a dataset with more than 100 > columns. I am raising as a blocker because spark is returning the WRONG > results rather than erroring, leading to data integrity issues > I have put together the following test case which will show the issue (it > will run in spark-shell). In this example i am joining a dataset with lots of > fields onto another dataset. > The join works fine and if you show the dataset you will get the expected > result. However if you run a map step over the dataset you end up with a > strange error where the sequence that is in the right dataset now only > contains the last value. > Whilst this test may seem a rather contrived example, what we are doing here > is a very standard analtical pattern. My original code was designed to: > - take a dataset of child records > - groupByKey up to the parent: giving a Dataset of (ParentID, Seq[Children]) > - join the children onto the parent by parentID: giving > ((Parent),(ParentID,Seq[Children]) > - map over the result to give a tuple of (Parent,Seq[Children]) > Notes: > - The issue is resolved by having less fields - as soon as we go <= 100 the > integrity issue goes away. Try removing one of the fields from BigCaseClass > below > - The issue will arise based on the total number of fields in the resulting > dataset. Below i have a small case class and a big case class, but two case > classes of 50 variable would give the same issue > - the issue occurs where the case class being joined on (on the right) has a > case class type. It doesnt occur if you have a Seq[String] > - If i go back to an RDD for the map step after the join i can workaround the > issue, but i lose all the benefits of datasets > Scala code test case: > case class Name(name: String) > case class SmallCaseClass (joinkey: Integer, names: Seq[Name]) > case class BigCaseClass (field1: Integer,field2: Integer,field3: > Integer,field4: Integer,field5: Integer,field6: Integer,field7: > Integer,field8: Integer,field9: Integer,field10: Integer,field11: > Integer,field12: Integer,field13: Integer,field14: Integer,field15: > Integer,field16: Integer,field17: Integer,field18: Integer,field19: > Integer,field20: Integer,field21: Integer,field22: Integer,field23: > Integer,field24: Integer,field25: Integer,field26: Integer,field27: > Integer,field28: Integer,field29: Integer,field30: Integer,field31: > Integer,field32: Integer,field33: Integer,field34: Integer,field35: > Integer,field36: Integer,field37: Integer,field38: Integer,field39: > Integer,field40: Integer,field41: Integer,field42: Integer,field43: > Integer,field44: Integer,field45: Integer,field46: Integer,field47: > Integer,field48: Integer,field49: Integer,field50: Integer,field51: > Integer,field52: Integer,field53: Integer,field54: Integer,field55: > Integer,field56: Integer,field57: Integer,field58: Integer,field59: > Integer,field60: Integer,field61: Integer,field62: Integer,field63: > Integer,field64: Integer,field65: Integer,field66: Integer,field67: > Integer,field68: Integer,field69: Integer,field70: Integer,field71: > Integer,field72: Integer,field73: Integer,field74: Integer,field75: > Integer,field76: Integer,field77: Integer,field78: Integer,field79: > Integer,field80: Integer,field81: Integer,field82: Integer,field83: > Integer,field84: Integer,field85: Integer,field86: Integer,field87: > Integer,field88: Integer,field89: Integer,field90: Integer,field91: > Integer,field92: Integer,field93: Integer,field94: Integer,field95: > Integer,field96: Integer,field97: Integer,field98: Integer,field99: Integer) > > val bigCC=Seq(BigCaseClass(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, > 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, > 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, > 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, > 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, > 91, 92, 93, 94, 95, 96, 97, 98, 99)) > > val smallCC=Seq(SmallCaseClass(1,Seq( > Name("Jamie"), >
[jira] [Commented] (SPARK-17061) Incorrect results returned following a join of two datasets and a map step where total number of columns >100
[ https://issues.apache.org/jira/browse/SPARK-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421277#comment-15421277 ] Jamie Hutton commented on SPARK-17061: -- I have just downloaded 2.0.1 nightly build from here: http://people.apache.org/~pwendell/spark-nightly/spark-branch-2.0-bin/spark-2.0.1-SNAPSHOT-2016_08_15_00_23-e02d0d0-bin/ Unfortunately the issue is still present. I am going to down-grade the issue to critical as you suggested and re-open if thats ok. Please can you guys take a look at it? Thanks > Incorrect results returned following a join of two datasets and a map step > where total number of columns >100 > - > > Key: SPARK-17061 > URL: https://issues.apache.org/jira/browse/SPARK-17061 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Jamie Hutton >Priority: Critical > > We have hit a consistent bug where we have a dataset with more than 100 > columns. I am raising as a blocker because spark is returning the WRONG > results rather than erroring, leading to data integrity issues > I have put together the following test case which will show the issue (it > will run in spark-shell). In this example i am joining a dataset with lots of > fields onto another dataset. > The join works fine and if you show the dataset you will get the expected > result. However if you run a map step over the dataset you end up with a > strange error where the sequence that is in the right dataset now only > contains the last value. > Whilst this test may seem a rather contrived example, what we are doing here > is a very standard analtical pattern. My original code was designed to: > - take a dataset of child records > - groupByKey up to the parent: giving a Dataset of (ParentID, Seq[Children]) > - join the children onto the parent by parentID: giving > ((Parent),(ParentID,Seq[Children]) > - map over the result to give a tuple of (Parent,Seq[Children]) > Notes: > - The issue is resolved by having less fields - as soon as we go <= 100 the > integrity issue goes away. Try removing one of the fields from BigCaseClass > below > - The issue will arise based on the total number of fields in the resulting > dataset. Below i have a small case class and a big case class, but two case > classes of 50 variable would give the same issue > - the issue occurs where the case class being joined on (on the right) has a > case class type. It doesnt occur if you have a Seq[String] > - If i go back to an RDD for the map step after the join i can workaround the > issue, but i lose all the benefits of datasets > Scala code test case: > case class Name(name: String) > case class SmallCaseClass (joinkey: Integer, names: Seq[Name]) > case class BigCaseClass (field1: Integer,field2: Integer,field3: > Integer,field4: Integer,field5: Integer,field6: Integer,field7: > Integer,field8: Integer,field9: Integer,field10: Integer,field11: > Integer,field12: Integer,field13: Integer,field14: Integer,field15: > Integer,field16: Integer,field17: Integer,field18: Integer,field19: > Integer,field20: Integer,field21: Integer,field22: Integer,field23: > Integer,field24: Integer,field25: Integer,field26: Integer,field27: > Integer,field28: Integer,field29: Integer,field30: Integer,field31: > Integer,field32: Integer,field33: Integer,field34: Integer,field35: > Integer,field36: Integer,field37: Integer,field38: Integer,field39: > Integer,field40: Integer,field41: Integer,field42: Integer,field43: > Integer,field44: Integer,field45: Integer,field46: Integer,field47: > Integer,field48: Integer,field49: Integer,field50: Integer,field51: > Integer,field52: Integer,field53: Integer,field54: Integer,field55: > Integer,field56: Integer,field57: Integer,field58: Integer,field59: > Integer,field60: Integer,field61: Integer,field62: Integer,field63: > Integer,field64: Integer,field65: Integer,field66: Integer,field67: > Integer,field68: Integer,field69: Integer,field70: Integer,field71: > Integer,field72: Integer,field73: Integer,field74: Integer,field75: > Integer,field76: Integer,field77: Integer,field78: Integer,field79: > Integer,field80: Integer,field81: Integer,field82: Integer,field83: > Integer,field84: Integer,field85: Integer,field86: Integer,field87: > Integer,field88: Integer,field89: Integer,field90: Integer,field91: > Integer,field92: Integer,field93: Integer,field94: Integer,field95: > Integer,field96: Integer,field97: Integer,field98: Integer,field99: Integer) > > val bigCC=Seq(BigCaseClass(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, > 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, > 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46,
[jira] [Updated] (SPARK-17061) Incorrect results returned following a join of two datasets and a map step where total number of columns >100
[ https://issues.apache.org/jira/browse/SPARK-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jamie Hutton updated SPARK-17061: - Priority: Critical (was: Blocker) > Incorrect results returned following a join of two datasets and a map step > where total number of columns >100 > - > > Key: SPARK-17061 > URL: https://issues.apache.org/jira/browse/SPARK-17061 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Jamie Hutton >Priority: Critical > > We have hit a consistent bug where we have a dataset with more than 100 > columns. I am raising as a blocker because spark is returning the WRONG > results rather than erroring, leading to data integrity issues > I have put together the following test case which will show the issue (it > will run in spark-shell). In this example i am joining a dataset with lots of > fields onto another dataset. > The join works fine and if you show the dataset you will get the expected > result. However if you run a map step over the dataset you end up with a > strange error where the sequence that is in the right dataset now only > contains the last value. > Whilst this test may seem a rather contrived example, what we are doing here > is a very standard analtical pattern. My original code was designed to: > - take a dataset of child records > - groupByKey up to the parent: giving a Dataset of (ParentID, Seq[Children]) > - join the children onto the parent by parentID: giving > ((Parent),(ParentID,Seq[Children]) > - map over the result to give a tuple of (Parent,Seq[Children]) > Notes: > - The issue is resolved by having less fields - as soon as we go <= 100 the > integrity issue goes away. Try removing one of the fields from BigCaseClass > below > - The issue will arise based on the total number of fields in the resulting > dataset. Below i have a small case class and a big case class, but two case > classes of 50 variable would give the same issue > - the issue occurs where the case class being joined on (on the right) has a > case class type. It doesnt occur if you have a Seq[String] > - If i go back to an RDD for the map step after the join i can workaround the > issue, but i lose all the benefits of datasets > Scala code test case: > case class Name(name: String) > case class SmallCaseClass (joinkey: Integer, names: Seq[Name]) > case class BigCaseClass (field1: Integer,field2: Integer,field3: > Integer,field4: Integer,field5: Integer,field6: Integer,field7: > Integer,field8: Integer,field9: Integer,field10: Integer,field11: > Integer,field12: Integer,field13: Integer,field14: Integer,field15: > Integer,field16: Integer,field17: Integer,field18: Integer,field19: > Integer,field20: Integer,field21: Integer,field22: Integer,field23: > Integer,field24: Integer,field25: Integer,field26: Integer,field27: > Integer,field28: Integer,field29: Integer,field30: Integer,field31: > Integer,field32: Integer,field33: Integer,field34: Integer,field35: > Integer,field36: Integer,field37: Integer,field38: Integer,field39: > Integer,field40: Integer,field41: Integer,field42: Integer,field43: > Integer,field44: Integer,field45: Integer,field46: Integer,field47: > Integer,field48: Integer,field49: Integer,field50: Integer,field51: > Integer,field52: Integer,field53: Integer,field54: Integer,field55: > Integer,field56: Integer,field57: Integer,field58: Integer,field59: > Integer,field60: Integer,field61: Integer,field62: Integer,field63: > Integer,field64: Integer,field65: Integer,field66: Integer,field67: > Integer,field68: Integer,field69: Integer,field70: Integer,field71: > Integer,field72: Integer,field73: Integer,field74: Integer,field75: > Integer,field76: Integer,field77: Integer,field78: Integer,field79: > Integer,field80: Integer,field81: Integer,field82: Integer,field83: > Integer,field84: Integer,field85: Integer,field86: Integer,field87: > Integer,field88: Integer,field89: Integer,field90: Integer,field91: > Integer,field92: Integer,field93: Integer,field94: Integer,field95: > Integer,field96: Integer,field97: Integer,field98: Integer,field99: Integer) > > val bigCC=Seq(BigCaseClass(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, > 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, > 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, > 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, > 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, > 91, 92, 93, 94, 95, 96, 97, 98, 99)) > > val smallCC=Seq(SmallCaseClass(1,Seq( > Name("Jamie"), > Name("Ian"), > Name("Dave"), >
[jira] [Commented] (SPARK-15002) Calling unpersist can cause spark to hang indefinitely when writing out a result
[ https://issues.apache.org/jira/browse/SPARK-15002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421234#comment-15421234 ] Sean Owen commented on SPARK-15002: --- Yeah I just mean it ought to be fine and there's no obvious reason something would fail. If nothing is RUNNING ... can you search for "deadlock" ? the JVM can identify some deadlocks by itself. It's sometimes worth browsing any "waiting on lock" messages in the trace to see if something is suspicious. I'm wondering however how you can have 100% CPU but no running threads... are we talking about the same thing in both cases? an executor? > Calling unpersist can cause spark to hang indefinitely when writing out a > result > > > Key: SPARK-15002 > URL: https://issues.apache.org/jira/browse/SPARK-15002 > Project: Spark > Issue Type: Bug > Components: GraphX, Spark Core >Affects Versions: 1.5.2, 1.6.0, 2.0.0 > Environment: AWS and Linux VM, both in spark-shell and spark-submit. > tested in 1.5.2 and 1.6. Tested in 2.0 >Reporter: Jamie Hutton > > The following code will cause spark to hang indefinitely. It happens when you > have an unpersist which is followed by some futher processing of that data > (in my case writing it out). > I first experienced this issue with graphX so my example below involved > graphX code, however i suspect it might be more of a core issue than a graphx > one. I have raised another bug with similar results (indefinite hanging) but > in different circumstances, so these may well be linked > Code to reproduce (can be run in spark-shell): > import org.apache.spark.graphx._ > import org.apache.spark.rdd.RDD > import org.apache.spark.sql.types._ > import org.apache.spark.sql._ > val r = scala.util.Random > val list = (0L to 500L).map(i=>(i,r.nextInt(500).asInstanceOf[Long],"LABEL")) > val distData = sc.parallelize(list) > val edgesRDD = distData.map(x => Edge(x._1, x._2, x._3)) > val distinctNodes = distData.flatMap{row => Iterable((row._1, ("A")),(row._2, > ("B")))}.distinct() > val nodesRDD: RDD[(VertexId, (String))] = distinctNodes > val graph = Graph(nodesRDD, edgesRDD) > graph.persist() > val ccGraph = graph.connectedComponents() > ccGraph.cache > > val schema = StructType(Seq(StructField("id", LongType, false), > StructField("netid", LongType, false))) > val > rdd=ccGraph.vertices.map(row=>(Row(row._1.asInstanceOf[Long],row._2.asInstanceOf[Long]))) > val builtEntityDF = sqlContext.createDataFrame(rdd, schema) > > /*this unpersist step causes the issue*/ > ccGraph.unpersist() > > /*write step hangs for ever*/ > builtEntityDF.write.format("parquet").mode("overwrite").save("/user/root/writetest.parquet") > > If you take out the ccGraph.unpersist() step the write step completes > successfully -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15002) Calling unpersist can cause spark to hang indefinitely when writing out a result
[ https://issues.apache.org/jira/browse/SPARK-15002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421192#comment-15421192 ] Jamie Hutton commented on SPARK-15002: -- I took a look at the executors and there is nothing in a RUNNING state. I do agree that there is no major reason to call unpersist in the above code, but i put that example together as the simplest possible test case i could find that would replicate it. In the actual code we have written we do wish to use unpersist and ended up in this hanging state. > Calling unpersist can cause spark to hang indefinitely when writing out a > result > > > Key: SPARK-15002 > URL: https://issues.apache.org/jira/browse/SPARK-15002 > Project: Spark > Issue Type: Bug > Components: GraphX, Spark Core >Affects Versions: 1.5.2, 1.6.0, 2.0.0 > Environment: AWS and Linux VM, both in spark-shell and spark-submit. > tested in 1.5.2 and 1.6. Tested in 2.0 >Reporter: Jamie Hutton > > The following code will cause spark to hang indefinitely. It happens when you > have an unpersist which is followed by some futher processing of that data > (in my case writing it out). > I first experienced this issue with graphX so my example below involved > graphX code, however i suspect it might be more of a core issue than a graphx > one. I have raised another bug with similar results (indefinite hanging) but > in different circumstances, so these may well be linked > Code to reproduce (can be run in spark-shell): > import org.apache.spark.graphx._ > import org.apache.spark.rdd.RDD > import org.apache.spark.sql.types._ > import org.apache.spark.sql._ > val r = scala.util.Random > val list = (0L to 500L).map(i=>(i,r.nextInt(500).asInstanceOf[Long],"LABEL")) > val distData = sc.parallelize(list) > val edgesRDD = distData.map(x => Edge(x._1, x._2, x._3)) > val distinctNodes = distData.flatMap{row => Iterable((row._1, ("A")),(row._2, > ("B")))}.distinct() > val nodesRDD: RDD[(VertexId, (String))] = distinctNodes > val graph = Graph(nodesRDD, edgesRDD) > graph.persist() > val ccGraph = graph.connectedComponents() > ccGraph.cache > > val schema = StructType(Seq(StructField("id", LongType, false), > StructField("netid", LongType, false))) > val > rdd=ccGraph.vertices.map(row=>(Row(row._1.asInstanceOf[Long],row._2.asInstanceOf[Long]))) > val builtEntityDF = sqlContext.createDataFrame(rdd, schema) > > /*this unpersist step causes the issue*/ > ccGraph.unpersist() > > /*write step hangs for ever*/ > builtEntityDF.write.format("parquet").mode("overwrite").save("/user/root/writetest.parquet") > > If you take out the ccGraph.unpersist() step the write step completes > successfully -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15002) Calling unpersist can cause spark to hang indefinitely when writing out a result
[ https://issues.apache.org/jira/browse/SPARK-15002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421186#comment-15421186 ] Sean Owen commented on SPARK-15002: --- It would be the workers. Oh, and I meant RUNNING rather than RUNNABLE. The idea is, where are the running tasks working when you snapshot them? a couple looks at a couple executors should reveal if something is stuck or just busy in a particular piece of code. That's the easiest first step to figure out if it's just that something is busy. It may not be. there's no particuarly good reason that removing unpersist() there should help. All it does is make it like you never persisted it to begin with, and it's only touched once anyway. > Calling unpersist can cause spark to hang indefinitely when writing out a > result > > > Key: SPARK-15002 > URL: https://issues.apache.org/jira/browse/SPARK-15002 > Project: Spark > Issue Type: Bug > Components: GraphX, Spark Core >Affects Versions: 1.5.2, 1.6.0, 2.0.0 > Environment: AWS and Linux VM, both in spark-shell and spark-submit. > tested in 1.5.2 and 1.6. Tested in 2.0 >Reporter: Jamie Hutton > > The following code will cause spark to hang indefinitely. It happens when you > have an unpersist which is followed by some futher processing of that data > (in my case writing it out). > I first experienced this issue with graphX so my example below involved > graphX code, however i suspect it might be more of a core issue than a graphx > one. I have raised another bug with similar results (indefinite hanging) but > in different circumstances, so these may well be linked > Code to reproduce (can be run in spark-shell): > import org.apache.spark.graphx._ > import org.apache.spark.rdd.RDD > import org.apache.spark.sql.types._ > import org.apache.spark.sql._ > val r = scala.util.Random > val list = (0L to 500L).map(i=>(i,r.nextInt(500).asInstanceOf[Long],"LABEL")) > val distData = sc.parallelize(list) > val edgesRDD = distData.map(x => Edge(x._1, x._2, x._3)) > val distinctNodes = distData.flatMap{row => Iterable((row._1, ("A")),(row._2, > ("B")))}.distinct() > val nodesRDD: RDD[(VertexId, (String))] = distinctNodes > val graph = Graph(nodesRDD, edgesRDD) > graph.persist() > val ccGraph = graph.connectedComponents() > ccGraph.cache > > val schema = StructType(Seq(StructField("id", LongType, false), > StructField("netid", LongType, false))) > val > rdd=ccGraph.vertices.map(row=>(Row(row._1.asInstanceOf[Long],row._2.asInstanceOf[Long]))) > val builtEntityDF = sqlContext.createDataFrame(rdd, schema) > > /*this unpersist step causes the issue*/ > ccGraph.unpersist() > > /*write step hangs for ever*/ > builtEntityDF.write.format("parquet").mode("overwrite").save("/user/root/writetest.parquet") > > If you take out the ccGraph.unpersist() step the write step completes > successfully -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17061) Incorrect results returned following a join of two datasets and a map step where total number of columns >100
[ https://issues.apache.org/jira/browse/SPARK-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421178#comment-15421178 ] Sean Owen commented on SPARK-17061: --- It is likely to be -- see also SPARK-17043. At least, I'd try a version with this fix before reopening this one. > Incorrect results returned following a join of two datasets and a map step > where total number of columns >100 > - > > Key: SPARK-17061 > URL: https://issues.apache.org/jira/browse/SPARK-17061 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Jamie Hutton >Priority: Blocker > > We have hit a consistent bug where we have a dataset with more than 100 > columns. I am raising as a blocker because spark is returning the WRONG > results rather than erroring, leading to data integrity issues > I have put together the following test case which will show the issue (it > will run in spark-shell). In this example i am joining a dataset with lots of > fields onto another dataset. > The join works fine and if you show the dataset you will get the expected > result. However if you run a map step over the dataset you end up with a > strange error where the sequence that is in the right dataset now only > contains the last value. > Whilst this test may seem a rather contrived example, what we are doing here > is a very standard analtical pattern. My original code was designed to: > - take a dataset of child records > - groupByKey up to the parent: giving a Dataset of (ParentID, Seq[Children]) > - join the children onto the parent by parentID: giving > ((Parent),(ParentID,Seq[Children]) > - map over the result to give a tuple of (Parent,Seq[Children]) > Notes: > - The issue is resolved by having less fields - as soon as we go <= 100 the > integrity issue goes away. Try removing one of the fields from BigCaseClass > below > - The issue will arise based on the total number of fields in the resulting > dataset. Below i have a small case class and a big case class, but two case > classes of 50 variable would give the same issue > - the issue occurs where the case class being joined on (on the right) has a > case class type. It doesnt occur if you have a Seq[String] > - If i go back to an RDD for the map step after the join i can workaround the > issue, but i lose all the benefits of datasets > Scala code test case: > case class Name(name: String) > case class SmallCaseClass (joinkey: Integer, names: Seq[Name]) > case class BigCaseClass (field1: Integer,field2: Integer,field3: > Integer,field4: Integer,field5: Integer,field6: Integer,field7: > Integer,field8: Integer,field9: Integer,field10: Integer,field11: > Integer,field12: Integer,field13: Integer,field14: Integer,field15: > Integer,field16: Integer,field17: Integer,field18: Integer,field19: > Integer,field20: Integer,field21: Integer,field22: Integer,field23: > Integer,field24: Integer,field25: Integer,field26: Integer,field27: > Integer,field28: Integer,field29: Integer,field30: Integer,field31: > Integer,field32: Integer,field33: Integer,field34: Integer,field35: > Integer,field36: Integer,field37: Integer,field38: Integer,field39: > Integer,field40: Integer,field41: Integer,field42: Integer,field43: > Integer,field44: Integer,field45: Integer,field46: Integer,field47: > Integer,field48: Integer,field49: Integer,field50: Integer,field51: > Integer,field52: Integer,field53: Integer,field54: Integer,field55: > Integer,field56: Integer,field57: Integer,field58: Integer,field59: > Integer,field60: Integer,field61: Integer,field62: Integer,field63: > Integer,field64: Integer,field65: Integer,field66: Integer,field67: > Integer,field68: Integer,field69: Integer,field70: Integer,field71: > Integer,field72: Integer,field73: Integer,field74: Integer,field75: > Integer,field76: Integer,field77: Integer,field78: Integer,field79: > Integer,field80: Integer,field81: Integer,field82: Integer,field83: > Integer,field84: Integer,field85: Integer,field86: Integer,field87: > Integer,field88: Integer,field89: Integer,field90: Integer,field91: > Integer,field92: Integer,field93: Integer,field94: Integer,field95: > Integer,field96: Integer,field97: Integer,field98: Integer,field99: Integer) > > val bigCC=Seq(BigCaseClass(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, > 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, > 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, > 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, > 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, > 91, 92, 93, 94, 95, 96, 97, 98, 99)) > > val
[jira] [Assigned] (SPARK-17059) Allow FileFormat to specify partition pruning strategy
[ https://issues.apache.org/jira/browse/SPARK-17059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17059: Assignee: Apache Spark > Allow FileFormat to specify partition pruning strategy > -- > > Key: SPARK-17059 > URL: https://issues.apache.org/jira/browse/SPARK-17059 > Project: Spark > Issue Type: Bug >Reporter: Andrew Duffy >Assignee: Apache Spark > > Allow Spark to have pluggable pruning of input files for FileSourceScanExec > by allowing FileFormat's to specify format-specific filterPartitions method. > This is especially useful for Parquet as Spark does not currently make use of > the summary metadata, instead reading the footer of all part files for a > Parquet data source. This can lead to massive speedups when reading a > filtered chunk of a dataset, especially when using remote storage (S3). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17059) Allow FileFormat to specify partition pruning strategy
[ https://issues.apache.org/jira/browse/SPARK-17059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421183#comment-15421183 ] Apache Spark commented on SPARK-17059: -- User 'andreweduffy' has created a pull request for this issue: https://github.com/apache/spark/pull/14649 > Allow FileFormat to specify partition pruning strategy > -- > > Key: SPARK-17059 > URL: https://issues.apache.org/jira/browse/SPARK-17059 > Project: Spark > Issue Type: Bug >Reporter: Andrew Duffy > > Allow Spark to have pluggable pruning of input files for FileSourceScanExec > by allowing FileFormat's to specify format-specific filterPartitions method. > This is especially useful for Parquet as Spark does not currently make use of > the summary metadata, instead reading the footer of all part files for a > Parquet data source. This can lead to massive speedups when reading a > filtered chunk of a dataset, especially when using remote storage (S3). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17059) Allow FileFormat to specify partition pruning strategy
[ https://issues.apache.org/jira/browse/SPARK-17059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17059: Assignee: (was: Apache Spark) > Allow FileFormat to specify partition pruning strategy > -- > > Key: SPARK-17059 > URL: https://issues.apache.org/jira/browse/SPARK-17059 > Project: Spark > Issue Type: Bug >Reporter: Andrew Duffy > > Allow Spark to have pluggable pruning of input files for FileSourceScanExec > by allowing FileFormat's to specify format-specific filterPartitions method. > This is especially useful for Parquet as Spark does not currently make use of > the summary metadata, instead reading the footer of all part files for a > Parquet data source. This can lead to massive speedups when reading a > filtered chunk of a dataset, especially when using remote storage (S3). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15002) Calling unpersist can cause spark to hang indefinitely when writing out a result
[ https://issues.apache.org/jira/browse/SPARK-15002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421182#comment-15421182 ] Jamie Hutton commented on SPARK-15002: -- GC time is 1.5seconds and does not increase whilst in the hanging state The runnable tasks are: Signal Dispatcher shuffle-server-0 (x2) executor task launch worker (x8) ResponseProcessor for block BP-267552868-172.16.137.143-1457691099567:blk_1073937571_196881 Does any of the above help? > Calling unpersist can cause spark to hang indefinitely when writing out a > result > > > Key: SPARK-15002 > URL: https://issues.apache.org/jira/browse/SPARK-15002 > Project: Spark > Issue Type: Bug > Components: GraphX, Spark Core >Affects Versions: 1.5.2, 1.6.0, 2.0.0 > Environment: AWS and Linux VM, both in spark-shell and spark-submit. > tested in 1.5.2 and 1.6. Tested in 2.0 >Reporter: Jamie Hutton > > The following code will cause spark to hang indefinitely. It happens when you > have an unpersist which is followed by some futher processing of that data > (in my case writing it out). > I first experienced this issue with graphX so my example below involved > graphX code, however i suspect it might be more of a core issue than a graphx > one. I have raised another bug with similar results (indefinite hanging) but > in different circumstances, so these may well be linked > Code to reproduce (can be run in spark-shell): > import org.apache.spark.graphx._ > import org.apache.spark.rdd.RDD > import org.apache.spark.sql.types._ > import org.apache.spark.sql._ > val r = scala.util.Random > val list = (0L to 500L).map(i=>(i,r.nextInt(500).asInstanceOf[Long],"LABEL")) > val distData = sc.parallelize(list) > val edgesRDD = distData.map(x => Edge(x._1, x._2, x._3)) > val distinctNodes = distData.flatMap{row => Iterable((row._1, ("A")),(row._2, > ("B")))}.distinct() > val nodesRDD: RDD[(VertexId, (String))] = distinctNodes > val graph = Graph(nodesRDD, edgesRDD) > graph.persist() > val ccGraph = graph.connectedComponents() > ccGraph.cache > > val schema = StructType(Seq(StructField("id", LongType, false), > StructField("netid", LongType, false))) > val > rdd=ccGraph.vertices.map(row=>(Row(row._1.asInstanceOf[Long],row._2.asInstanceOf[Long]))) > val builtEntityDF = sqlContext.createDataFrame(rdd, schema) > > /*this unpersist step causes the issue*/ > ccGraph.unpersist() > > /*write step hangs for ever*/ > builtEntityDF.write.format("parquet").mode("overwrite").save("/user/root/writetest.parquet") > > If you take out the ccGraph.unpersist() step the write step completes > successfully -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17061) Incorrect results returned following a join of two datasets and a map step where total number of columns >100
[ https://issues.apache.org/jira/browse/SPARK-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421169#comment-15421169 ] Jamie Hutton commented on SPARK-17061: -- Apologies for setting blocker. I wont use that again. Is the above definitely the same issue as the one you marked as a duplicate? That does seem to be slightly different as it relates to persist and 200 columns. Did you manage to run the above test case on a 2.0.1 codebase and did it work? > Incorrect results returned following a join of two datasets and a map step > where total number of columns >100 > - > > Key: SPARK-17061 > URL: https://issues.apache.org/jira/browse/SPARK-17061 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Jamie Hutton >Priority: Blocker > > We have hit a consistent bug where we have a dataset with more than 100 > columns. I am raising as a blocker because spark is returning the WRONG > results rather than erroring, leading to data integrity issues > I have put together the following test case which will show the issue (it > will run in spark-shell). In this example i am joining a dataset with lots of > fields onto another dataset. > The join works fine and if you show the dataset you will get the expected > result. However if you run a map step over the dataset you end up with a > strange error where the sequence that is in the right dataset now only > contains the last value. > Whilst this test may seem a rather contrived example, what we are doing here > is a very standard analtical pattern. My original code was designed to: > - take a dataset of child records > - groupByKey up to the parent: giving a Dataset of (ParentID, Seq[Children]) > - join the children onto the parent by parentID: giving > ((Parent),(ParentID,Seq[Children]) > - map over the result to give a tuple of (Parent,Seq[Children]) > Notes: > - The issue is resolved by having less fields - as soon as we go <= 100 the > integrity issue goes away. Try removing one of the fields from BigCaseClass > below > - The issue will arise based on the total number of fields in the resulting > dataset. Below i have a small case class and a big case class, but two case > classes of 50 variable would give the same issue > - the issue occurs where the case class being joined on (on the right) has a > case class type. It doesnt occur if you have a Seq[String] > - If i go back to an RDD for the map step after the join i can workaround the > issue, but i lose all the benefits of datasets > Scala code test case: > case class Name(name: String) > case class SmallCaseClass (joinkey: Integer, names: Seq[Name]) > case class BigCaseClass (field1: Integer,field2: Integer,field3: > Integer,field4: Integer,field5: Integer,field6: Integer,field7: > Integer,field8: Integer,field9: Integer,field10: Integer,field11: > Integer,field12: Integer,field13: Integer,field14: Integer,field15: > Integer,field16: Integer,field17: Integer,field18: Integer,field19: > Integer,field20: Integer,field21: Integer,field22: Integer,field23: > Integer,field24: Integer,field25: Integer,field26: Integer,field27: > Integer,field28: Integer,field29: Integer,field30: Integer,field31: > Integer,field32: Integer,field33: Integer,field34: Integer,field35: > Integer,field36: Integer,field37: Integer,field38: Integer,field39: > Integer,field40: Integer,field41: Integer,field42: Integer,field43: > Integer,field44: Integer,field45: Integer,field46: Integer,field47: > Integer,field48: Integer,field49: Integer,field50: Integer,field51: > Integer,field52: Integer,field53: Integer,field54: Integer,field55: > Integer,field56: Integer,field57: Integer,field58: Integer,field59: > Integer,field60: Integer,field61: Integer,field62: Integer,field63: > Integer,field64: Integer,field65: Integer,field66: Integer,field67: > Integer,field68: Integer,field69: Integer,field70: Integer,field71: > Integer,field72: Integer,field73: Integer,field74: Integer,field75: > Integer,field76: Integer,field77: Integer,field78: Integer,field79: > Integer,field80: Integer,field81: Integer,field82: Integer,field83: > Integer,field84: Integer,field85: Integer,field86: Integer,field87: > Integer,field88: Integer,field89: Integer,field90: Integer,field91: > Integer,field92: Integer,field93: Integer,field94: Integer,field95: > Integer,field96: Integer,field97: Integer,field98: Integer,field99: Integer) > > val bigCC=Seq(BigCaseClass(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, > 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, > 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, > 53, 54, 55, 56, 57, 58, 59, 60,
[jira] [Commented] (SPARK-15002) Calling unpersist can cause spark to hang indefinitely when writing out a result
[ https://issues.apache.org/jira/browse/SPARK-15002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421164#comment-15421164 ] Sean Owen commented on SPARK-15002: --- In the UI, go look at a heap dump of the pegged executor. It should show you RUNNABLE threads and it ought to be sort of clear where they're busy. Also check GC time. If it's a non-trivial fraction of execution time, at least it's clear that somehow it's memory pressure. > Calling unpersist can cause spark to hang indefinitely when writing out a > result > > > Key: SPARK-15002 > URL: https://issues.apache.org/jira/browse/SPARK-15002 > Project: Spark > Issue Type: Bug > Components: GraphX, Spark Core >Affects Versions: 1.5.2, 1.6.0, 2.0.0 > Environment: AWS and Linux VM, both in spark-shell and spark-submit. > tested in 1.5.2 and 1.6. Tested in 2.0 >Reporter: Jamie Hutton > > The following code will cause spark to hang indefinitely. It happens when you > have an unpersist which is followed by some futher processing of that data > (in my case writing it out). > I first experienced this issue with graphX so my example below involved > graphX code, however i suspect it might be more of a core issue than a graphx > one. I have raised another bug with similar results (indefinite hanging) but > in different circumstances, so these may well be linked > Code to reproduce (can be run in spark-shell): > import org.apache.spark.graphx._ > import org.apache.spark.rdd.RDD > import org.apache.spark.sql.types._ > import org.apache.spark.sql._ > val r = scala.util.Random > val list = (0L to 500L).map(i=>(i,r.nextInt(500).asInstanceOf[Long],"LABEL")) > val distData = sc.parallelize(list) > val edgesRDD = distData.map(x => Edge(x._1, x._2, x._3)) > val distinctNodes = distData.flatMap{row => Iterable((row._1, ("A")),(row._2, > ("B")))}.distinct() > val nodesRDD: RDD[(VertexId, (String))] = distinctNodes > val graph = Graph(nodesRDD, edgesRDD) > graph.persist() > val ccGraph = graph.connectedComponents() > ccGraph.cache > > val schema = StructType(Seq(StructField("id", LongType, false), > StructField("netid", LongType, false))) > val > rdd=ccGraph.vertices.map(row=>(Row(row._1.asInstanceOf[Long],row._2.asInstanceOf[Long]))) > val builtEntityDF = sqlContext.createDataFrame(rdd, schema) > > /*this unpersist step causes the issue*/ > ccGraph.unpersist() > > /*write step hangs for ever*/ > builtEntityDF.write.format("parquet").mode("overwrite").save("/user/root/writetest.parquet") > > If you take out the ccGraph.unpersist() step the write step completes > successfully -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15002) Calling unpersist can cause spark to hang indefinitely when writing out a result
[ https://issues.apache.org/jira/browse/SPARK-15002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421160#comment-15421160 ] Jamie Hutton commented on SPARK-15002: -- Hi Sean. There are no errors. When i run the code above the progress bar appears but never moves. The CPU is pegged at 100%. I have done a thread dump but i am not really sure how to interpret - if you run the code above its very simple to reproduce. If you remove the "unpersist" step in the code the entire script runs in a couple of seconds, so it is not an issue with a complex task. > Calling unpersist can cause spark to hang indefinitely when writing out a > result > > > Key: SPARK-15002 > URL: https://issues.apache.org/jira/browse/SPARK-15002 > Project: Spark > Issue Type: Bug > Components: GraphX, Spark Core >Affects Versions: 1.5.2, 1.6.0, 2.0.0 > Environment: AWS and Linux VM, both in spark-shell and spark-submit. > tested in 1.5.2 and 1.6. Tested in 2.0 >Reporter: Jamie Hutton > > The following code will cause spark to hang indefinitely. It happens when you > have an unpersist which is followed by some futher processing of that data > (in my case writing it out). > I first experienced this issue with graphX so my example below involved > graphX code, however i suspect it might be more of a core issue than a graphx > one. I have raised another bug with similar results (indefinite hanging) but > in different circumstances, so these may well be linked > Code to reproduce (can be run in spark-shell): > import org.apache.spark.graphx._ > import org.apache.spark.rdd.RDD > import org.apache.spark.sql.types._ > import org.apache.spark.sql._ > val r = scala.util.Random > val list = (0L to 500L).map(i=>(i,r.nextInt(500).asInstanceOf[Long],"LABEL")) > val distData = sc.parallelize(list) > val edgesRDD = distData.map(x => Edge(x._1, x._2, x._3)) > val distinctNodes = distData.flatMap{row => Iterable((row._1, ("A")),(row._2, > ("B")))}.distinct() > val nodesRDD: RDD[(VertexId, (String))] = distinctNodes > val graph = Graph(nodesRDD, edgesRDD) > graph.persist() > val ccGraph = graph.connectedComponents() > ccGraph.cache > > val schema = StructType(Seq(StructField("id", LongType, false), > StructField("netid", LongType, false))) > val > rdd=ccGraph.vertices.map(row=>(Row(row._1.asInstanceOf[Long],row._2.asInstanceOf[Long]))) > val builtEntityDF = sqlContext.createDataFrame(rdd, schema) > > /*this unpersist step causes the issue*/ > ccGraph.unpersist() > > /*write step hangs for ever*/ > builtEntityDF.write.format("parquet").mode("overwrite").save("/user/root/writetest.parquet") > > If you take out the ccGraph.unpersist() step the write step completes > successfully -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17061) Incorrect results returned following a join of two datasets and a map step where total number of columns >100
[ https://issues.apache.org/jira/browse/SPARK-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17061. --- Resolution: Duplicate Search JIRA please, and don't set blocker > Incorrect results returned following a join of two datasets and a map step > where total number of columns >100 > - > > Key: SPARK-17061 > URL: https://issues.apache.org/jira/browse/SPARK-17061 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Jamie Hutton >Priority: Blocker > > We have hit a consistent bug where we have a dataset with more than 100 > columns. I am raising as a blocker because spark is returning the WRONG > results rather than erroring, leading to data integrity issues > I have put together the following test case which will show the issue (it > will run in spark-shell). In this example i am joining a dataset with lots of > fields onto another dataset. > The join works fine and if you show the dataset you will get the expected > result. However if you run a map step over the dataset you end up with a > strange error where the sequence that is in the right dataset now only > contains the last value. > Whilst this test may seem a rather contrived example, what we are doing here > is a very standard analtical pattern. My original code was designed to: > - take a dataset of child records > - groupByKey up to the parent: giving a Dataset of (ParentID, Seq[Children]) > - join the children onto the parent by parentID: giving > ((Parent),(ParentID,Seq[Children]) > - map over the result to give a tuple of (Parent,Seq[Children]) > Notes: > - The issue is resolved by having less fields - as soon as we go <= 100 the > integrity issue goes away. Try removing one of the fields from BigCaseClass > below > - The issue will arise based on the total number of fields in the resulting > dataset. Below i have a small case class and a big case class, but two case > classes of 50 variable would give the same issue > - the issue occurs where the case class being joined on (on the right) has a > case class type. It doesnt occur if you have a Seq[String] > - If i go back to an RDD for the map step after the join i can workaround the > issue, but i lose all the benefits of datasets > Scala code test case: > case class Name(name: String) > case class SmallCaseClass (joinkey: Integer, names: Seq[Name]) > case class BigCaseClass (field1: Integer,field2: Integer,field3: > Integer,field4: Integer,field5: Integer,field6: Integer,field7: > Integer,field8: Integer,field9: Integer,field10: Integer,field11: > Integer,field12: Integer,field13: Integer,field14: Integer,field15: > Integer,field16: Integer,field17: Integer,field18: Integer,field19: > Integer,field20: Integer,field21: Integer,field22: Integer,field23: > Integer,field24: Integer,field25: Integer,field26: Integer,field27: > Integer,field28: Integer,field29: Integer,field30: Integer,field31: > Integer,field32: Integer,field33: Integer,field34: Integer,field35: > Integer,field36: Integer,field37: Integer,field38: Integer,field39: > Integer,field40: Integer,field41: Integer,field42: Integer,field43: > Integer,field44: Integer,field45: Integer,field46: Integer,field47: > Integer,field48: Integer,field49: Integer,field50: Integer,field51: > Integer,field52: Integer,field53: Integer,field54: Integer,field55: > Integer,field56: Integer,field57: Integer,field58: Integer,field59: > Integer,field60: Integer,field61: Integer,field62: Integer,field63: > Integer,field64: Integer,field65: Integer,field66: Integer,field67: > Integer,field68: Integer,field69: Integer,field70: Integer,field71: > Integer,field72: Integer,field73: Integer,field74: Integer,field75: > Integer,field76: Integer,field77: Integer,field78: Integer,field79: > Integer,field80: Integer,field81: Integer,field82: Integer,field83: > Integer,field84: Integer,field85: Integer,field86: Integer,field87: > Integer,field88: Integer,field89: Integer,field90: Integer,field91: > Integer,field92: Integer,field93: Integer,field94: Integer,field95: > Integer,field96: Integer,field97: Integer,field98: Integer,field99: Integer) > > val bigCC=Seq(BigCaseClass(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, > 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, > 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, > 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, > 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, > 91, 92, 93, 94, 95, 96, 97, 98, 99)) > > val smallCC=Seq(SmallCaseClass(1,Seq( > Name("Jamie"), > Name("Ian"), >
[jira] [Assigned] (SPARK-16995) TreeNodeException when flat mapping RelationalGroupedDataset created from DataFrame containing a column created with lit/expr
[ https://issues.apache.org/jira/browse/SPARK-16995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16995: Assignee: (was: Apache Spark) > TreeNodeException when flat mapping RelationalGroupedDataset created from > DataFrame containing a column created with lit/expr > - > > Key: SPARK-16995 > URL: https://issues.apache.org/jira/browse/SPARK-16995 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cédric Perriard > > A TreeNodeException is thrown when executing the following minimal example in > Spark 2.0. Crucial is that the column q is generated with lit/expr. > {code} > import spark.implicits._ > case class test (x: Int, q: Int) > val d = Seq(1).toDF("x") > d.withColumn("q", lit(0)).as[test].groupByKey(_.x).flatMapGroups{case (x, > iter) => List()}.show > d.withColumn("q", expr("0")).as[test].groupByKey(_.x).flatMapGroups{case (x, > iter) => List()}.show > // this works fine > d.withColumn("q", lit(0)).as[test].groupByKey(_.x).count() > {code} > The exception is: > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: x#5 > A possible workaround is to write the dataframe to disk before grouping and > mapping. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16995) TreeNodeException when flat mapping RelationalGroupedDataset created from DataFrame containing a column created with lit/expr
[ https://issues.apache.org/jira/browse/SPARK-16995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16995: Assignee: Apache Spark > TreeNodeException when flat mapping RelationalGroupedDataset created from > DataFrame containing a column created with lit/expr > - > > Key: SPARK-16995 > URL: https://issues.apache.org/jira/browse/SPARK-16995 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cédric Perriard >Assignee: Apache Spark > > A TreeNodeException is thrown when executing the following minimal example in > Spark 2.0. Crucial is that the column q is generated with lit/expr. > {code} > import spark.implicits._ > case class test (x: Int, q: Int) > val d = Seq(1).toDF("x") > d.withColumn("q", lit(0)).as[test].groupByKey(_.x).flatMapGroups{case (x, > iter) => List()}.show > d.withColumn("q", expr("0")).as[test].groupByKey(_.x).flatMapGroups{case (x, > iter) => List()}.show > // this works fine > d.withColumn("q", lit(0)).as[test].groupByKey(_.x).count() > {code} > The exception is: > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: x#5 > A possible workaround is to write the dataframe to disk before grouping and > mapping. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16995) TreeNodeException when flat mapping RelationalGroupedDataset created from DataFrame containing a column created with lit/expr
[ https://issues.apache.org/jira/browse/SPARK-16995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421146#comment-15421146 ] Apache Spark commented on SPARK-16995: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/14648 > TreeNodeException when flat mapping RelationalGroupedDataset created from > DataFrame containing a column created with lit/expr > - > > Key: SPARK-16995 > URL: https://issues.apache.org/jira/browse/SPARK-16995 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cédric Perriard > > A TreeNodeException is thrown when executing the following minimal example in > Spark 2.0. Crucial is that the column q is generated with lit/expr. > {code} > import spark.implicits._ > case class test (x: Int, q: Int) > val d = Seq(1).toDF("x") > d.withColumn("q", lit(0)).as[test].groupByKey(_.x).flatMapGroups{case (x, > iter) => List()}.show > d.withColumn("q", expr("0")).as[test].groupByKey(_.x).flatMapGroups{case (x, > iter) => List()}.show > // this works fine > d.withColumn("q", lit(0)).as[test].groupByKey(_.x).count() > {code} > The exception is: > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: x#5 > A possible workaround is to write the dataframe to disk before grouping and > mapping. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org