[jira] [Commented] (SPARK-17071) Fetch Parquet schema within driver-side when there is single file to touch without another Spark job

2016-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422241#comment-15422241
 ] 

Apache Spark commented on SPARK-17071:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/14660

> Fetch Parquet schema within driver-side when there is single file to touch 
> without another Spark job
> 
>
> Key: SPARK-17071
> URL: https://issues.apache.org/jira/browse/SPARK-17071
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently, it seems always launch a Spark distributed job to fetch and merge 
> Parquet's schemas.
> It seems we don't actually have to run another Spark job even when there is 
> only a single file to touch (meaning without {{mergeSchema}}) but just fetch 
> the schema within driver-side just like ORC data source is doing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17071) Fetch Parquet schema within driver-side when there is single file to touch without another Spark job

2016-08-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17071:


Assignee: (was: Apache Spark)

> Fetch Parquet schema within driver-side when there is single file to touch 
> without another Spark job
> 
>
> Key: SPARK-17071
> URL: https://issues.apache.org/jira/browse/SPARK-17071
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently, it seems always launch a Spark distributed job to fetch and merge 
> Parquet's schemas.
> It seems we don't actually have to run another Spark job even when there is 
> only a single file to touch (meaning without {{mergeSchema}}) but just fetch 
> the schema within driver-side just like ORC data source is doing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17071) Fetch Parquet schema within driver-side when there is single file to touch without another Spark job

2016-08-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17071:


Assignee: Apache Spark

> Fetch Parquet schema within driver-side when there is single file to touch 
> without another Spark job
> 
>
> Key: SPARK-17071
> URL: https://issues.apache.org/jira/browse/SPARK-17071
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> Currently, it seems always launch a Spark distributed job to fetch and merge 
> Parquet's schemas.
> It seems we don't actually have to run another Spark job even when there is 
> only a single file to touch (meaning without {{mergeSchema}}) but just fetch 
> the schema within driver-side just like ORC data source is doing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17071) Fetch Parquet schema within driver-side when there is single file to touch without another Spark job

2016-08-15 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-17071:


 Summary: Fetch Parquet schema within driver-side when there is 
single file to touch without another Spark job
 Key: SPARK-17071
 URL: https://issues.apache.org/jira/browse/SPARK-17071
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Hyukjin Kwon


Currently, it seems always launch a Spark distributed job to fetch and merge 
Parquet's schemas.

It seems we don't actually have to run another Spark job even when there is 
only a single file to touch (meaning without {{mergeSchema}}) but just fetch 
the schema within driver-side just like ORC data source is doing.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16760) Pass 'jobId' to Task

2016-08-15 Thread Weiqing Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiqing Yang closed SPARK-16760.

Resolution: Duplicate

The code for this jira has been put into the PR of SPARK-16757

> Pass 'jobId' to Task
> 
>
> Key: SPARK-16760
> URL: https://issues.apache.org/jira/browse/SPARK-16760
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Weiqing Yang
>
> In the end, the spark caller context written into HDFS log will associate 
> with task id, stage id, job id, app id, etc, but now Task does not know any 
> job information, so job id will be passed to Task in the patch of this jira. 
> That is good for Spark users to identify tasks especially if Spark supports 
> multi-tenant environment in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16916) serde/storage properties should not have limitations

2016-08-15 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-16916.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14506
[https://github.com/apache/spark/pull/14506]

> serde/storage properties should not have limitations
> 
>
> Key: SPARK-16916
> URL: https://issues.apache.org/jira/browse/SPARK-16916
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16757) Set up caller context to HDFS

2016-08-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16757:


Assignee: (was: Apache Spark)

> Set up caller context to HDFS
> -
>
> Key: SPARK-16757
> URL: https://issues.apache.org/jira/browse/SPARK-16757
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Weiqing Yang
>
> In this jira, Spark will invoke hadoop caller context api to set up its 
> caller context to HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16757) Set up caller context to HDFS

2016-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422160#comment-15422160
 ] 

Apache Spark commented on SPARK-16757:
--

User 'Sherry302' has created a pull request for this issue:
https://github.com/apache/spark/pull/14659

> Set up caller context to HDFS
> -
>
> Key: SPARK-16757
> URL: https://issues.apache.org/jira/browse/SPARK-16757
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Weiqing Yang
>
> In this jira, Spark will invoke hadoop caller context api to set up its 
> caller context to HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16757) Set up caller context to HDFS

2016-08-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16757:


Assignee: Apache Spark

> Set up caller context to HDFS
> -
>
> Key: SPARK-16757
> URL: https://issues.apache.org/jira/browse/SPARK-16757
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Weiqing Yang
>Assignee: Apache Spark
>
> In this jira, Spark will invoke hadoop caller context api to set up its 
> caller context to HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2016-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422141#comment-15422141
 ] 

Apache Spark commented on SPARK-5928:
-

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/14658

> Remote Shuffle Blocks cannot be more than 2 GB
> --
>
> Key: SPARK-5928
> URL: https://issues.apache.org/jira/browse/SPARK-5928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>
> If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
> exception.  The tasks get retried a few times and then eventually the job 
> fails.
> Here is an example program which can cause the exception:
> {code}
> val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
>   val n = 3e3.toInt
>   val arr = new Array[Byte](n)
>   //need to make sure the array doesn't compress to something small
>   scala.util.Random.nextBytes(arr)
>   arr
> }
> rdd.map { x => (1, x)}.groupByKey().count()
> {code}
> Note that you can't trigger this exception in local mode, it only happens on 
> remote fetches.   I triggered these exceptions running with 
> {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
> {noformat}
> 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
> imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
> imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
> org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
> 2147483647: 3021252889 - discarded
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
>   at 
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
> length exceeds 2147483647: 3021252889 - discarded
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at 

[jira] [Assigned] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2016-08-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5928:
---

Assignee: (was: Apache Spark)

> Remote Shuffle Blocks cannot be more than 2 GB
> --
>
> Key: SPARK-5928
> URL: https://issues.apache.org/jira/browse/SPARK-5928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>
> If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
> exception.  The tasks get retried a few times and then eventually the job 
> fails.
> Here is an example program which can cause the exception:
> {code}
> val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
>   val n = 3e3.toInt
>   val arr = new Array[Byte](n)
>   //need to make sure the array doesn't compress to something small
>   scala.util.Random.nextBytes(arr)
>   arr
> }
> rdd.map { x => (1, x)}.groupByKey().count()
> {code}
> Note that you can't trigger this exception in local mode, it only happens on 
> remote fetches.   I triggered these exceptions running with 
> {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
> {noformat}
> 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
> imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
> imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
> org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
> 2147483647: 3021252889 - discarded
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
>   at 
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
> length exceeds 2147483647: 3021252889 - discarded
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> 

[jira] [Assigned] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2016-08-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5928:
---

Assignee: Apache Spark

> Remote Shuffle Blocks cannot be more than 2 GB
> --
>
> Key: SPARK-5928
> URL: https://issues.apache.org/jira/browse/SPARK-5928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>Assignee: Apache Spark
>
> If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
> exception.  The tasks get retried a few times and then eventually the job 
> fails.
> Here is an example program which can cause the exception:
> {code}
> val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
>   val n = 3e3.toInt
>   val arr = new Array[Byte](n)
>   //need to make sure the array doesn't compress to something small
>   scala.util.Random.nextBytes(arr)
>   arr
> }
> rdd.map { x => (1, x)}.groupByKey().count()
> {code}
> Note that you can't trigger this exception in local mode, it only happens on 
> remote fetches.   I triggered these exceptions running with 
> {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
> {noformat}
> 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
> imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
> imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
> org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
> 2147483647: 3021252889 - discarded
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
>   at 
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
> length exceeds 2147483647: 3021252889 - discarded
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> 

[jira] [Updated] (SPARK-17070) Zookeeper server refused to accept the client (mesos-master)

2016-08-15 Thread Anh Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anh Nguyen updated SPARK-17070:
---
Description: 
I started zookepper server: ./bin/zkServer.sh start-foreground conf/cnf_zoo.cfg
and this is zookeeper server log :
ZooKeeper JMX enabled by default
Using config: conf/cnf_zoo.cfg
2016-08-16 02:59:20,767 [myid:] - INFO  [main:QuorumPeerConfig@103] - Reading 
configuration from: conf/cnf_zoo.cfg
2016-08-16 02:59:20,778 [myid:] - INFO  [main:DatadirCleanupManager@78] - 
autopurge.snapRetainCount set to 3
2016-08-16 02:59:20,779 [myid:] - INFO  [main:DatadirCleanupManager@79] - 
autopurge.purgeInterval set to 0
2016-08-16 02:59:20,780 [myid:] - INFO  [main:DatadirCleanupManager@101] - 
Purge task is not scheduled.
2016-08-16 02:59:20,781 [myid:] - WARN  [main:QuorumPeerMain@113] - Either no 
config or no quorum defined in config, running  in standalone mode
2016-08-16 02:59:20,803 [myid:] - INFO  [main:QuorumPeerConfig@103] - Reading 
configuration from: conf/cnf_zoo.cfg
2016-08-16 02:59:20,804 [myid:] - INFO  [main:ZooKeeperServerMain@95] - 
Starting server
2016-08-16 02:59:20,826 [myid:] - INFO  [main:Environment@100] - Server 
environment:zookeeper.version=3.4.8--1, built on 02/06/2016 03:18 GMT
2016-08-16 02:59:20,828 [myid:] - INFO  [main:Environment@100] - Server 
environment:host.name=zookeeper
2016-08-16 02:59:20,829 [myid:] - INFO  [main:Environment@100] - Server 
environment:java.version=1.8.0_91
2016-08-16 02:59:20,830 [myid:] - INFO  [main:Environment@100] - Server 
environment:java.vendor=Oracle Corporation
2016-08-16 02:59:20,831 [myid:] - INFO  [main:Environment@100] - Server 
environment:java.home=/usr/lib/jvm/java-8-oracle/jre
2016-08-16 02:59:20,834 [myid:] - INFO  [main:Environment@100] - Server 
environment:java.class.path=/home/vagrant/mesos/zookeeper-3.4.8/bin/../build/classes:/home/vagrant/mesos/zookeeper-3.4.8/bin/../build/lib/*.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../lib/slf4j-log4j12-1.6.1.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../lib/slf4j-api-1.6.1.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../lib/netty-3.7.0.Final.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../lib/log4j-1.2.16.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../lib/jline-0.9.94.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../zookeeper-3.4.8.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../src/java/lib/*.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../conf:
2016-08-16 02:59:20,835 [myid:] - INFO  [main:Environment@100] - Server 
environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
2016-08-16 02:59:20,835 [myid:] - INFO  [main:Environment@100] - Server 
environment:java.io.tmpdir=/tmp
2016-08-16 02:59:20,836 [myid:] - INFO  [main:Environment@100] - Server 
environment:java.compiler=
2016-08-16 02:59:20,842 [myid:] - INFO  [main:Environment@100] - Server 
environment:os.name=Linux
2016-08-16 02:59:20,842 [myid:] - INFO  [main:Environment@100] - Server 
environment:os.arch=amd64
2016-08-16 02:59:20,844 [myid:] - INFO  [main:Environment@100] - Server 
environment:os.version=3.13.0-92-generic
2016-08-16 02:59:20,845 [myid:] - INFO  [main:Environment@100] - Server 
environment:user.name=vagrant
2016-08-16 02:59:20,846 [myid:] - INFO  [main:Environment@100] - Server 
environment:user.home=/home/vagrant
2016-08-16 02:59:20,846 [myid:] - INFO  [main:Environment@100] - Server 
environment:user.dir=/home/vagrant/mesos/zookeeper-3.4.8
2016-08-16 02:59:20,861 [myid:] - INFO  [main:ZooKeeperServer@787] - tickTime 
set to 2000
2016-08-16 02:59:20,863 [myid:] - INFO  [main:ZooKeeperServer@796] - 
minSessionTimeout set to -1
2016-08-16 02:59:20,864 [myid:] - INFO  [main:ZooKeeperServer@805] - 
maxSessionTimeout set to -1
2016-08-16 02:59:20,883 [myid:] - INFO  [main:NIOServerCnxnFactory@89] - 
binding to port 0.0.0.0/0.0.0.0:2181


But, when I start mesos-master:  mesos-master 
--zk=zk://192.168.33.10:2128/home/vagrant/mesos/zookeeper-3.4.8 --port=5050 
--ip=192.168.33.20 --quorum=1 --work_dir=/home/vagrant/mesos/master1/work 
--log_dir=/home/vagrant/mesos/master1/logs --advertise_ip=192.168.33.20  
--advertise_port=5050 --cluster=mesut-ozil --logging_level=ERROR

This is log mesos-master: zookeeper refaused to connect the mesos-master.

2016-08-16 03:29:09,052:2107(0x7f93e2de4700):ZOO_INFO@log_env@747: Client 
environment:user.name=vagrant
2016-08-16 03:29:09,055:2107(0x7f93e2de4700):ZOO_INFO@log_env@755: Client 
environment:user.home=/home/vagrant
2016-08-16 03:29:09,055:2107(0x7f93e2de4700):ZOO_INFO@log_env@767: Client 
environment:user.dir=/home/vagrant
2016-08-16 03:29:09,055:2107(0x7f93e2de4700):ZOO_INFO@zookeeper_init@800: 
Initiating client connection, host=192.168.33.10:2128 sessionTimeout=1 
watcher=0x7f93ec335340 sessionId=0 sessionPasswd= context=0x7f93c8000930 
flags=0
2016-08-16 03:29:09,052:2107(0x7f93e3de6700):ZOO_INFO@log_env@747: Client 

[jira] [Created] (SPARK-17070) Zookeeper server refused to accept the client (mesos-master)

2016-08-15 Thread Anh Nguyen (JIRA)
Anh Nguyen created SPARK-17070:
--

 Summary: Zookeeper server refused to accept the client 
(mesos-master)
 Key: SPARK-17070
 URL: https://issues.apache.org/jira/browse/SPARK-17070
 Project: Spark
  Issue Type: Bug
Reporter: Anh Nguyen


I started zookepper server: ./bin/zkServer.sh start-foreground conf/cnf_zoo.cfg
and this is zookeeper server log :
ZooKeeper JMX enabled by default
Using config: conf/cnf_zoo.cfg
2016-08-16 02:59:20,767 [myid:] - INFO  [main:QuorumPeerConfig@103] - Reading 
configuration from: conf/cnf_zoo.cfg
2016-08-16 02:59:20,778 [myid:] - INFO  [main:DatadirCleanupManager@78] - 
autopurge.snapRetainCount set to 3
2016-08-16 02:59:20,779 [myid:] - INFO  [main:DatadirCleanupManager@79] - 
autopurge.purgeInterval set to 0
2016-08-16 02:59:20,780 [myid:] - INFO  [main:DatadirCleanupManager@101] - 
Purge task is not scheduled.
2016-08-16 02:59:20,781 [myid:] - WARN  [main:QuorumPeerMain@113] - Either no 
config or no quorum defined in config, running  in standalone mode
2016-08-16 02:59:20,803 [myid:] - INFO  [main:QuorumPeerConfig@103] - Reading 
configuration from: conf/cnf_zoo.cfg
2016-08-16 02:59:20,804 [myid:] - INFO  [main:ZooKeeperServerMain@95] - 
Starting server
2016-08-16 02:59:20,826 [myid:] - INFO  [main:Environment@100] - Server 
environment:zookeeper.version=3.4.8--1, built on 02/06/2016 03:18 GMT
2016-08-16 02:59:20,828 [myid:] - INFO  [main:Environment@100] - Server 
environment:host.name=zookeeper
2016-08-16 02:59:20,829 [myid:] - INFO  [main:Environment@100] - Server 
environment:java.version=1.8.0_91
2016-08-16 02:59:20,830 [myid:] - INFO  [main:Environment@100] - Server 
environment:java.vendor=Oracle Corporation
2016-08-16 02:59:20,831 [myid:] - INFO  [main:Environment@100] - Server 
environment:java.home=/usr/lib/jvm/java-8-oracle/jre
2016-08-16 02:59:20,834 [myid:] - INFO  [main:Environment@100] - Server 
environment:java.class.path=/home/vagrant/mesos/zookeeper-3.4.8/bin/../build/classes:/home/vagrant/mesos/zookeeper-3.4.8/bin/../build/lib/*.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../lib/slf4j-log4j12-1.6.1.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../lib/slf4j-api-1.6.1.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../lib/netty-3.7.0.Final.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../lib/log4j-1.2.16.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../lib/jline-0.9.94.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../zookeeper-3.4.8.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../src/java/lib/*.jar:/home/vagrant/mesos/zookeeper-3.4.8/bin/../conf:
2016-08-16 02:59:20,835 [myid:] - INFO  [main:Environment@100] - Server 
environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
2016-08-16 02:59:20,835 [myid:] - INFO  [main:Environment@100] - Server 
environment:java.io.tmpdir=/tmp
2016-08-16 02:59:20,836 [myid:] - INFO  [main:Environment@100] - Server 
environment:java.compiler=
2016-08-16 02:59:20,842 [myid:] - INFO  [main:Environment@100] - Server 
environment:os.name=Linux
2016-08-16 02:59:20,842 [myid:] - INFO  [main:Environment@100] - Server 
environment:os.arch=amd64
2016-08-16 02:59:20,844 [myid:] - INFO  [main:Environment@100] - Server 
environment:os.version=3.13.0-92-generic
2016-08-16 02:59:20,845 [myid:] - INFO  [main:Environment@100] - Server 
environment:user.name=vagrant
2016-08-16 02:59:20,846 [myid:] - INFO  [main:Environment@100] - Server 
environment:user.home=/home/vagrant
2016-08-16 02:59:20,846 [myid:] - INFO  [main:Environment@100] - Server 
environment:user.dir=/home/vagrant/mesos/zookeeper-3.4.8
2016-08-16 02:59:20,861 [myid:] - INFO  [main:ZooKeeperServer@787] - tickTime 
set to 2000
2016-08-16 02:59:20,863 [myid:] - INFO  [main:ZooKeeperServer@796] - 
minSessionTimeout set to -1
2016-08-16 02:59:20,864 [myid:] - INFO  [main:ZooKeeperServer@805] - 
maxSessionTimeout set to -1
2016-08-16 02:59:20,883 [myid:] - INFO  [main:NIOServerCnxnFactory@89] - 
binding to port 0.0.0.0/0.0.0.0:2181


But, when I start mesos-master:  mesos-master 
--zk=zk://192.168.33.10:2128/home/vagrant/mesos/zookeeper-3.4.8 --port=5050 
--ip=192.168.33.20 --quorum=1 --work_dir=/home/vagrant/mesos/master1/work 
--log_dir=/home/vagrant/mesos/master1/logs --advertise_ip=192.168.33.20  
--advertise_port=5050 --cluster=mesut-ozil --logging_level=ERROR
2016-08-16 03:29:09,053:2107(0x7f93e25e3700):ZOO_INFO@zookeeper_init@800: 
Initiating client connection, host=192.168.33.10:2128 sessionTimeout=1 
watcher=0x7f93ec335340 sessionId=0 sessionPasswd= context=0x7f93d80047c0 
flags=0
2016-08-16 03:29:09,051:2107(0x7f93e35e5700):ZOO_INFO@log_env@747: Client 
environment:user.name=vagrant
2016-08-16 03:29:09,054:2107(0x7f93e35e5700):ZOO_INFO@log_env@755: Client 
environment:user.home=/home/vagrant
2016-08-16 03:29:09,054:2107(0x7f93e35e5700):ZOO_INFO@log_env@767: Client 
environment:user.dir=/home/vagrant
2016-08-16 

[jira] [Commented] (SPARK-16990) Define the data structure to hold the statistics for CBO

2016-08-15 Thread Ron Hu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422105#comment-15422105
 ] 

Ron Hu commented on SPARK-16990:


We will follow the agreed design spec and add the following information to 
Statistics class:

-   totalSize: Total uncompressed output data size from a LogicalPlan node.
-   rowCount: Number of output rows from a LogicalPlan node.
-   fileNumber Number of files (or number of HDFS data blocks). Currently 
only useful when a LogicalPlan is a relation.
-   basicStats: Basic column statistics, i.e., max, min, number of nulls, 
number of distinct values, max column length, average column length.
-   Histograms: Histograms of columns, i.e., equi-width histogram (for 
numeric and string types) and equi-height histogram (only for numeric types).
-   histogramMins: Maintain min values for numeric columns.



> Define the data structure to hold the statistics for CBO
> 
>
> Key: SPARK-16990
> URL: https://issues.apache.org/jira/browse/SPARK-16990
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Ron Hu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17054) SparkR can not run in yarn-cluster mode on mac os

2016-08-15 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422058#comment-15422058
 ] 

Jeff Zhang commented on SPARK-17054:


I have single node hadoop cluster in my laptop, and I run R script in yarn 
cluster mode. 

> SparkR can not run in yarn-cluster mode on mac os
> -
>
> Key: SPARK-17054
> URL: https://issues.apache.org/jira/browse/SPARK-17054
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>
> This is due to it download sparkR to the wrong place.
> {noformat}
> Warning message:
> 'sparkR.init' is deprecated.
> Use 'sparkR.session' instead.
> See help("Deprecated")
> Spark not found in SPARK_HOME:  .
> To search in the cache directory. Installation will start if not found.
> Mirror site not provided.
> Looking for site suggested from apache website...
> Preferred mirror site found: http://apache.mirror.cdnetworks.com/spark
> Downloading Spark spark-2.0.0 for Hadoop 2.7 from:
> - 
> http://apache.mirror.cdnetworks.com/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz
> Fetch failed from http://apache.mirror.cdnetworks.com/spark
>  open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', 
> reason 'No such file or directory'>
> To use backup site...
> Downloading Spark spark-2.0.0 for Hadoop 2.7 from:
> - 
> http://www-us.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz
> Fetch failed from http://www-us.apache.org/dist/spark
>  open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', 
> reason 'No such file or directory'>
> Error in robust_download_tar(mirrorUrl, version, hadoopVersion, packageName,  
> :
>   Unable to download Spark spark-2.0.0 for Hadoop 2.7. Please check network 
> connection, Hadoop version, or provide other mirror sites.
> Calls: sparkRSQL.init ... sparkR.session -> install.spark -> 
> robust_download_tar
> In addition: Warning messages:
> 1: 'sparkRSQL.init' is deprecated.
> Use 'sparkR.session' instead.
> See help("Deprecated")
> 2: In dir.create(localDir, recursive = TRUE) :
>   cannot create dir '/home//Library', reason 'Operation not supported'
> Execution halted
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17061) Incorrect results returned following a join of two datasets and a map step where total number of columns >100

2016-08-15 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422051#comment-15422051
 ] 

Liwei Lin commented on SPARK-17061:
---

This can be reproduced against the master branch; let me look into this. Thanks.

> Incorrect results returned following a join of two datasets and a map step 
> where total number of columns >100
> -
>
> Key: SPARK-17061
> URL: https://issues.apache.org/jira/browse/SPARK-17061
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Jamie Hutton
>Priority: Critical
>
> We have hit a consistent bug where we have a dataset with more than 100 
> columns. I am raising as a blocker because spark is returning the WRONG 
> results rather than erroring, leading to data integrity issues
> I have put together the following test case which will show the issue (it 
> will run in spark-shell). In this example i am joining a dataset with lots of 
> fields onto another dataset. 
> The join works fine and if you show the dataset you will get the expected 
> result. However if you run a map step over the dataset you end up with a 
> strange error where the sequence that is in the right dataset now only 
> contains the last value.
> Whilst this test may seem a rather contrived example, what we are doing here 
> is a very standard analtical pattern. My original code was designed to:
>  - take a dataset of child records
>  - groupByKey up to the parent: giving a Dataset of (ParentID, Seq[Children])
>  - join the children onto the parent by parentID: giving 
> ((Parent),(ParentID,Seq[Children])
>  - map over the result to give a tuple of (Parent,Seq[Children])
> Notes:
> - The issue is resolved by having less fields - as soon as we go <= 100 the 
> integrity issue goes away. Try removing one of the fields from BigCaseClass 
> below
> - The issue will arise based on the total number of fields in the resulting 
> dataset. Below i have a small case class and a big case class, but two case 
> classes of 50 variable would give the same issue
> - the issue occurs where the case class being joined on (on the right) has a 
> case class type. It doesnt occur if you have a Seq[String]
> - If i go back to an RDD for the map step after the join i can workaround the 
> issue, but i lose all the benefits of datasets
> Scala code test case:
>   case class Name(name: String)
>   case class SmallCaseClass (joinkey: Integer, names: Seq[Name])
>   case class BigCaseClass  (field1: Integer,field2: Integer,field3: 
> Integer,field4: Integer,field5: Integer,field6: Integer,field7: 
> Integer,field8: Integer,field9: Integer,field10: Integer,field11: 
> Integer,field12: Integer,field13: Integer,field14: Integer,field15: 
> Integer,field16: Integer,field17: Integer,field18: Integer,field19: 
> Integer,field20: Integer,field21: Integer,field22: Integer,field23: 
> Integer,field24: Integer,field25: Integer,field26: Integer,field27: 
> Integer,field28: Integer,field29: Integer,field30: Integer,field31: 
> Integer,field32: Integer,field33: Integer,field34: Integer,field35: 
> Integer,field36: Integer,field37: Integer,field38: Integer,field39: 
> Integer,field40: Integer,field41: Integer,field42: Integer,field43: 
> Integer,field44: Integer,field45: Integer,field46: Integer,field47: 
> Integer,field48: Integer,field49: Integer,field50: Integer,field51: 
> Integer,field52: Integer,field53: Integer,field54: Integer,field55: 
> Integer,field56: Integer,field57: Integer,field58: Integer,field59: 
> Integer,field60: Integer,field61: Integer,field62: Integer,field63: 
> Integer,field64: Integer,field65: Integer,field66: Integer,field67: 
> Integer,field68: Integer,field69: Integer,field70: Integer,field71: 
> Integer,field72: Integer,field73: Integer,field74: Integer,field75: 
> Integer,field76: Integer,field77: Integer,field78: Integer,field79: 
> Integer,field80: Integer,field81: Integer,field82: Integer,field83: 
> Integer,field84: Integer,field85: Integer,field86: Integer,field87: 
> Integer,field88: Integer,field89: Integer,field90: Integer,field91: 
> Integer,field92: Integer,field93: Integer,field94: Integer,field95: 
> Integer,field96: Integer,field97: Integer,field98: Integer,field99: Integer)
>   
> val bigCC=Seq(BigCaseClass(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
> 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 
> 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 
> 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 
> 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 
> 91, 92, 93, 94, 95, 96, 97, 98, 99))
> 
> val smallCC=Seq(SmallCaseClass(1,Seq(

[jira] [Commented] (SPARK-17054) SparkR can not run in yarn-cluster mode on mac os

2016-08-15 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422020#comment-15422020
 ] 

Shivaram Venkataraman commented on SPARK-17054:
---

And is this connecting to a remote YARN master or is there a way to run a YARN 
cluster locally on a laptop ?

> SparkR can not run in yarn-cluster mode on mac os
> -
>
> Key: SPARK-17054
> URL: https://issues.apache.org/jira/browse/SPARK-17054
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>
> This is due to it download sparkR to the wrong place.
> {noformat}
> Warning message:
> 'sparkR.init' is deprecated.
> Use 'sparkR.session' instead.
> See help("Deprecated")
> Spark not found in SPARK_HOME:  .
> To search in the cache directory. Installation will start if not found.
> Mirror site not provided.
> Looking for site suggested from apache website...
> Preferred mirror site found: http://apache.mirror.cdnetworks.com/spark
> Downloading Spark spark-2.0.0 for Hadoop 2.7 from:
> - 
> http://apache.mirror.cdnetworks.com/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz
> Fetch failed from http://apache.mirror.cdnetworks.com/spark
>  open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', 
> reason 'No such file or directory'>
> To use backup site...
> Downloading Spark spark-2.0.0 for Hadoop 2.7 from:
> - 
> http://www-us.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz
> Fetch failed from http://www-us.apache.org/dist/spark
>  open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', 
> reason 'No such file or directory'>
> Error in robust_download_tar(mirrorUrl, version, hadoopVersion, packageName,  
> :
>   Unable to download Spark spark-2.0.0 for Hadoop 2.7. Please check network 
> connection, Hadoop version, or provide other mirror sites.
> Calls: sparkRSQL.init ... sparkR.session -> install.spark -> 
> robust_download_tar
> In addition: Warning messages:
> 1: 'sparkRSQL.init' is deprecated.
> Use 'sparkR.session' instead.
> See help("Deprecated")
> 2: In dir.create(localDir, recursive = TRUE) :
>   cannot create dir '/home//Library', reason 'Operation not supported'
> Execution halted
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17068) Retain view visibility information through out Analysis

2016-08-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17068:


Assignee: Herman van Hovell  (was: Apache Spark)

> Retain view visibility information through out Analysis
> ---
>
> Key: SPARK-17068
> URL: https://issues.apache.org/jira/browse/SPARK-17068
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.1.0
>
>
> Views in Spark SQL are replaced by their backing {{LogicalPlan}} during 
> analysis. This can be confusing when dealing with and debugging large 
> {{LogicalPlan}}s. I propose to add an identifier to the subquery alias in 
> order to improve this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17068) Retain view visibility information through out Analysis

2016-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421917#comment-15421917
 ] 

Apache Spark commented on SPARK-17068:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/14657

> Retain view visibility information through out Analysis
> ---
>
> Key: SPARK-17068
> URL: https://issues.apache.org/jira/browse/SPARK-17068
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.1.0
>
>
> Views in Spark SQL are replaced by their backing {{LogicalPlan}} during 
> analysis. This can be confusing when dealing with and debugging large 
> {{LogicalPlan}}s. I propose to add an identifier to the subquery alias in 
> order to improve this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17068) Retain view visibility information through out Analysis

2016-08-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17068:


Assignee: Apache Spark  (was: Herman van Hovell)

> Retain view visibility information through out Analysis
> ---
>
> Key: SPARK-17068
> URL: https://issues.apache.org/jira/browse/SPARK-17068
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Apache Spark
> Fix For: 2.1.0
>
>
> Views in Spark SQL are replaced by their backing {{LogicalPlan}} during 
> analysis. This can be confusing when dealing with and debugging large 
> {{LogicalPlan}}s. I propose to add an identifier to the subquery alias in 
> order to improve this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16578) Configurable hostname for RBackend

2016-08-15 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421878#comment-15421878
 ] 

Jeff Zhang commented on SPARK-16578:


I think this feature can also be applied in pyspark. 

> Configurable hostname for RBackend
> --
>
> Key: SPARK-16578
> URL: https://issues.apache.org/jira/browse/SPARK-16578
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> One of the requirements that comes up with SparkR being a standalone package 
> is that users can now install just the R package on the client side and 
> connect to a remote machine which runs the RBackend class.
> We should check if we can support this mode of execution and what are the 
> pros / cons of it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17069) Expose spark.range() as table-valued function in SQL

2016-08-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17069:


Assignee: (was: Apache Spark)

> Expose spark.range() as table-valued function in SQL
> 
>
> Key: SPARK-17069
> URL: https://issues.apache.org/jira/browse/SPARK-17069
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Eric Liang
>Priority: Minor
>
> The idea here is to create the spark.range( x ) equivalent in SQL, so we can 
> do something like
> {noformat}
> select count(*) from range(1)
> {noformat}
> This would be useful for sql-only testing and benchmarks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17069) Expose spark.range() as table-valued function in SQL

2016-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421853#comment-15421853
 ] 

Apache Spark commented on SPARK-17069:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/14656

> Expose spark.range() as table-valued function in SQL
> 
>
> Key: SPARK-17069
> URL: https://issues.apache.org/jira/browse/SPARK-17069
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Eric Liang
>Priority: Minor
>
> The idea here is to create the spark.range( x ) equivalent in SQL, so we can 
> do something like
> {noformat}
> select count(*) from range(1)
> {noformat}
> This would be useful for sql-only testing and benchmarks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17069) Expose spark.range() as table-valued function in SQL

2016-08-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17069:


Assignee: Apache Spark

> Expose spark.range() as table-valued function in SQL
> 
>
> Key: SPARK-17069
> URL: https://issues.apache.org/jira/browse/SPARK-17069
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Apache Spark
>Priority: Minor
>
> The idea here is to create the spark.range( x ) equivalent in SQL, so we can 
> do something like
> {noformat}
> select count(*) from range(1)
> {noformat}
> This would be useful for sql-only testing and benchmarks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17054) SparkR can not run in yarn-cluster mode on mac os

2016-08-15 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421851#comment-15421851
 ] 

Jeff Zhang commented on SPARK-17054:


Here's the command I run.
{code}
bin/spark-submit --master yarn-cluster --conf 
spark.r.command=/usr/local/bin/Rscript ~/github/spark_examples/spark.R
{code}

> SparkR can not run in yarn-cluster mode on mac os
> -
>
> Key: SPARK-17054
> URL: https://issues.apache.org/jira/browse/SPARK-17054
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>
> This is due to it download sparkR to the wrong place.
> {noformat}
> Warning message:
> 'sparkR.init' is deprecated.
> Use 'sparkR.session' instead.
> See help("Deprecated")
> Spark not found in SPARK_HOME:  .
> To search in the cache directory. Installation will start if not found.
> Mirror site not provided.
> Looking for site suggested from apache website...
> Preferred mirror site found: http://apache.mirror.cdnetworks.com/spark
> Downloading Spark spark-2.0.0 for Hadoop 2.7 from:
> - 
> http://apache.mirror.cdnetworks.com/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz
> Fetch failed from http://apache.mirror.cdnetworks.com/spark
>  open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', 
> reason 'No such file or directory'>
> To use backup site...
> Downloading Spark spark-2.0.0 for Hadoop 2.7 from:
> - 
> http://www-us.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz
> Fetch failed from http://www-us.apache.org/dist/spark
>  open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', 
> reason 'No such file or directory'>
> Error in robust_download_tar(mirrorUrl, version, hadoopVersion, packageName,  
> :
>   Unable to download Spark spark-2.0.0 for Hadoop 2.7. Please check network 
> connection, Hadoop version, or provide other mirror sites.
> Calls: sparkRSQL.init ... sparkR.session -> install.spark -> 
> robust_download_tar
> In addition: Warning messages:
> 1: 'sparkRSQL.init' is deprecated.
> Use 'sparkR.session' instead.
> See help("Deprecated")
> 2: In dir.create(localDir, recursive = TRUE) :
>   cannot create dir '/home//Library', reason 'Operation not supported'
> Execution halted
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17054) SparkR can not run in yarn-cluster mode on mac os

2016-08-15 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421846#comment-15421846
 ] 

Jeff Zhang commented on SPARK-17054:


Do you run it as yarn-cluster mode ?

> SparkR can not run in yarn-cluster mode on mac os
> -
>
> Key: SPARK-17054
> URL: https://issues.apache.org/jira/browse/SPARK-17054
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>
> This is due to it download sparkR to the wrong place.
> {noformat}
> Warning message:
> 'sparkR.init' is deprecated.
> Use 'sparkR.session' instead.
> See help("Deprecated")
> Spark not found in SPARK_HOME:  .
> To search in the cache directory. Installation will start if not found.
> Mirror site not provided.
> Looking for site suggested from apache website...
> Preferred mirror site found: http://apache.mirror.cdnetworks.com/spark
> Downloading Spark spark-2.0.0 for Hadoop 2.7 from:
> - 
> http://apache.mirror.cdnetworks.com/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz
> Fetch failed from http://apache.mirror.cdnetworks.com/spark
>  open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', 
> reason 'No such file or directory'>
> To use backup site...
> Downloading Spark spark-2.0.0 for Hadoop 2.7 from:
> - 
> http://www-us.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz
> Fetch failed from http://www-us.apache.org/dist/spark
>  open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', 
> reason 'No such file or directory'>
> Error in robust_download_tar(mirrorUrl, version, hadoopVersion, packageName,  
> :
>   Unable to download Spark spark-2.0.0 for Hadoop 2.7. Please check network 
> connection, Hadoop version, or provide other mirror sites.
> Calls: sparkRSQL.init ... sparkR.session -> install.spark -> 
> robust_download_tar
> In addition: Warning messages:
> 1: 'sparkRSQL.init' is deprecated.
> Use 'sparkR.session' instead.
> See help("Deprecated")
> 2: In dir.create(localDir, recursive = TRUE) :
>   cannot create dir '/home//Library', reason 'Operation not supported'
> Execution halted
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17069) Expose spark.range() as table-valued function in SQL

2016-08-15 Thread Eric Liang (JIRA)
Eric Liang created SPARK-17069:
--

 Summary: Expose spark.range() as table-valued function in SQL
 Key: SPARK-17069
 URL: https://issues.apache.org/jira/browse/SPARK-17069
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Eric Liang
Priority: Minor


The idea here is to create the spark.range( x ) equivalent in SQL, so we can do 
something like

{noformat}
select count(*) from range(1)
{noformat}

This would be useful for sql-only testing and benchmarks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17065) Improve the error message when encountering an incompatible DataSourceRegister

2016-08-15 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-17065.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14651
[https://github.com/apache/spark/pull/14651]

> Improve the error message when encountering an incompatible DataSourceRegister
> --
>
> Key: SPARK-17065
> URL: https://issues.apache.org/jira/browse/SPARK-17065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.1, 2.1.0
>
>
> When encountering an incompatible DataSourceRegister, it's better to add 
> instructions to remove or upgrade it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16158) Support pluggable dynamic allocation heuristics

2016-08-15 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421803#comment-15421803
 ] 

Thomas Graves commented on SPARK-16158:
---

seems like an ok idea to me, but you have to make sure to do it right. for 
instance I would expect different policies to possibly have different configs.  
Need to think about how those work, are named, are documented, etc.  Make sure 
things aren't relying on implicit things happening, make sure the communication 
with the resource managers (mesos/yarn) is well defined.

I do also think the current policy could be improved a lot.  Is there a reason 
to not just improve that? Obviously having the user define the policy for 
different jobs makes things more complex for the user so it would be nice to 
have at least one generic that works ok in many situations but you could still 
have highly optimized ones.

One thing you mention is executors between stages. The min executor setting 
should help with that but obviously you could have largely varying number of 
tasks between different stages which makes that not ideal.   I could see some 
other policy having a different config to handle this.

> Support pluggable dynamic allocation heuristics
> ---
>
> Key: SPARK-16158
> URL: https://issues.apache.org/jira/browse/SPARK-16158
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Nezih Yigitbasi
>
> It would be nice if Spark supports plugging in custom dynamic allocation 
> heuristics. This feature would be useful for experimenting with new 
> heuristics and also useful for plugging in different heuristics per job etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17038) StreamingSource reports metrics for lastCompletedBatch instead of lastReceivedBatch

2016-08-15 Thread Xin Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421789#comment-15421789
 ] 

Xin Ren commented on SPARK-17038:
-

hi [~ozzieba] if you don't have time, I can just submit a quick path on this :)

> StreamingSource reports metrics for lastCompletedBatch instead of 
> lastReceivedBatch
> ---
>
> Key: SPARK-17038
> URL: https://issues.apache.org/jira/browse/SPARK-17038
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>  Labels: metrics
>
> StreamingSource's lastReceivedBatch_submissionTime, 
> lastReceivedBatch_processingTimeStart, and 
> lastReceivedBatch_processingTimeEnd all use data from lastCompletedBatch 
> instead of lastReceivedBatch. In particular, this makes it impossible to 
> match lastReceivedBatch_records with a batchID/submission time.
> This is apparent when looking at StreamingSource.scala, lines 89-94.
> I would guess that just replacing Completed->Received in those lines would 
> fix the issue, but I haven't tested it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16964) Remove private[sql] and private[spark] from sql.execution package

2016-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421783#comment-15421783
 ] 

Apache Spark commented on SPARK-16964:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/14654

> Remove private[sql] and private[spark] from sql.execution package
> -
>
> Key: SPARK-16964
> URL: https://issues.apache.org/jira/browse/SPARK-16964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> The execution package is meant to be internal, and as a result it does not 
> make sense to mark things as private[sql] or private[spark]. It simply makes 
> debugging harder when Spark developers need to inspect the plans at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16669) Partition pruning for metastore relation size estimates for better join selection.

2016-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421782#comment-15421782
 ] 

Apache Spark commented on SPARK-16669:
--

User 'Parth-Brahmbhatt' has created a pull request for this issue:
https://github.com/apache/spark/pull/14655

> Partition pruning for metastore relation size estimates for better join 
> selection.
> --
>
> Key: SPARK-16669
> URL: https://issues.apache.org/jira/browse/SPARK-16669
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Parth Brahmbhatt
>
>  Currently the metastore statistics returns the size of entire table which 
> results in Join selection strategy to not use broadcast joins even when only 
> a single partition from a large table is selected. We should optimize the 
> statistic calculation at table level to apply partition pruning and only get 
> the size of Partition that are valid for the query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17068) Retain view visibility information through out Analysis

2016-08-15 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-17068:
-

 Summary: Retain view visibility information through out Analysis
 Key: SPARK-17068
 URL: https://issues.apache.org/jira/browse/SPARK-17068
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Herman van Hovell
Assignee: Herman van Hovell
 Fix For: 2.1.0


Views in Spark SQL are replaced by their backing {{LogicalPlan}} during 
analysis. This can be confusing when dealing with and debugging large 
{{LogicalPlan}}s. I propose to add an identifier to the subquery alias in order 
to improve this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17066) dateFormat should be used when writing dataframes as csv files

2016-08-15 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421758#comment-15421758
 ] 

Barry Becker commented on SPARK-17066:
--

Yes, I think its fair to say this is a subset of SPARK-16216. Probably this can 
be duped to that one.

> dateFormat should be used when writing dataframes as csv files
> --
>
> Key: SPARK-17066
> URL: https://issues.apache.org/jira/browse/SPARK-17066
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> I noticed this when running tests after pulling and building @lw-lin 's PR 
> (https://github.com/apache/spark/pull/14118). I don't think it is anything 
> wrong with his PR, just that the fix that was made to spark-csv for this 
> issue was never moved to spark 2.x when databrick's spark-csv was merged into 
> spark 2 back in January. https://github.com/databricks/spark-csv/issues/308 
> was fixed in spark-csv after that merge.
> The problem is that if I try to write a dataframe that contains a date column 
> out to a csv using something like this
> repartitionDf.write.format("csv") //.format(DATABRICKS_CSV)
> .option("delimiter", "\t")
> .option("header", "false")
> .option("nullValue", "?")
> .option("dateFormat", "-MM-dd'T'HH:mm:ss")
> .option("escape", "\\")   
> .save(tempFileName)
> Then my unit test (which passed under spark 1.6.2) fails using the spark 
> 2.1.0 snapshot build that I made today. The dataframe contained 3 values in a 
> date column.
> Expected "[2012-01-03T09:12:00
> ?
> 2015-02-23T18:00:]00", 
> but got 
> "[132561072000
> ?
> 1424743200]00"
> This means that while the null value is being correctly exported, the 
> specified dateFormat is not being used to format the date. Instead it looks 
> like number of seconds from epoch is being used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10931) PySpark ML Models should contain Param values

2016-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421753#comment-15421753
 ] 

Apache Spark commented on SPARK-10931:
--

User 'evanyc15' has created a pull request for this issue:
https://github.com/apache/spark/pull/14653

> PySpark ML Models should contain Param values
> -
>
> Key: SPARK-10931
> URL: https://issues.apache.org/jira/browse/SPARK-10931
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> PySpark spark.ml Models are generally wrappers around Java objects and do not 
> even contain Param values.  This JIRA is for copying the Param values from 
> the Estimator to the model.
> This can likely be solved by modifying Estimator.fit to copy Param values, 
> but should also include proper unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16964) Remove private[sql] and private[spark] from sql.execution package

2016-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421746#comment-15421746
 ] 

Apache Spark commented on SPARK-16964:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/14652

> Remove private[sql] and private[spark] from sql.execution package
> -
>
> Key: SPARK-16964
> URL: https://issues.apache.org/jira/browse/SPARK-16964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> The execution package is meant to be internal, and as a result it does not 
> make sense to mark things as private[sql] or private[spark]. It simply makes 
> debugging harder when Spark developers need to inspect the plans at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421731#comment-15421731
 ] 

Sean Owen commented on SPARK-16320:
---

I think that was the problem being solved there though, right? there's also an 
arguably bigger downside to setting Xms, and only an indirect way to control 
the region size if that's what's desired. I'd favor pointing out 
G1HeapRegionSize as a possible tuning parameter to increase. That is, to solve 
the problem you identify, Xms isn't the best solution even.

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421728#comment-15421728
 ] 

Maciej Bryński edited comment on SPARK-16320 at 8/15/16 9:42 PM:
-

Maybe we can change SPARK-12384 a little bit and set default value to -Xms (as 
in Spark 1.6) ?
(user will still have an option to change it)


was (Author: maver1ck):
Maybe we can change SPARK-12384 a little bit and set default value to -Xms (as 
in Spark 1.6) ?
(user still will have an option to change it)

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421728#comment-15421728
 ] 

Maciej Bryński commented on SPARK-16320:


Maybe we can change SPARK-12384 a little bit and set default value to -Xms (as 
in Spark 1.6) ?
(user still will have an option to change it)

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421719#comment-15421719
 ] 

Sean Owen commented on SPARK-16320:
---

I see, I wonder if this deserves a bit of documentation in the tuning section? 
you'd be welcome to collect some of the wisdom here into a note in the docs.

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17066) dateFormat should be used when writing dataframes as csv files

2016-08-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421723#comment-15421723
 ] 

Sean Owen commented on SPARK-17066:
---

I think this is a subset of https://issues.apache.org/jira/browse/SPARK-16216 ?

> dateFormat should be used when writing dataframes as csv files
> --
>
> Key: SPARK-17066
> URL: https://issues.apache.org/jira/browse/SPARK-17066
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> I noticed this when running tests after pulling and building @lw-lin 's PR 
> (https://github.com/apache/spark/pull/14118). I don't think it is anything 
> wrong with his PR, just that the fix that was made to spark-csv for this 
> issue was never moved to spark 2.x when databrick's spark-csv was merged into 
> spark 2 back in January. https://github.com/databricks/spark-csv/issues/308 
> was fixed in spark-csv after that merge.
> The problem is that if I try to write a dataframe that contains a date column 
> out to a csv using something like this
> repartitionDf.write.format("csv") //.format(DATABRICKS_CSV)
> .option("delimiter", "\t")
> .option("header", "false")
> .option("nullValue", "?")
> .option("dateFormat", "-MM-dd'T'HH:mm:ss")
> .option("escape", "\\")   
> .save(tempFileName)
> Then my unit test (which passed under spark 1.6.2) fails using the spark 
> 2.1.0 snapshot build that I made today. The dataframe contained 3 values in a 
> date column.
> Expected "[2012-01-03T09:12:00
> ?
> 2015-02-23T18:00:]00", 
> but got 
> "[132561072000
> ?
> 1424743200]00"
> This means that while the null value is being correctly exported, the 
> specified dateFormat is not being used to format the date. Instead it looks 
> like number of seconds from epoch is being used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17066) dateFormat should be used when writing dataframes as csv files

2016-08-15 Thread Barry Becker (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barry Becker updated SPARK-17066:
-
Description: 
I noticed this when running tests after pulling and building @lw-lin 's PR 
(https://github.com/apache/spark/pull/14118). I don't think it is anything 
wrong with his PR, just that the fix that was made to spark-csv for this issue 
was never moved to spark 2.x when databrick's spark-csv was merged into spark 2 
back in January. https://github.com/databricks/spark-csv/issues/308 was fixed 
in spark-csv after that merge.

The problem is that if I try to write a dataframe that contains a date column 
out to a csv using something like this

repartitionDf.write.format("csv") //.format(DATABRICKS_CSV)
.option("delimiter", "\t")
.option("header", "false")
.option("nullValue", "?")
.option("dateFormat", "-MM-dd'T'HH:mm:ss")
.option("escape", "\\")   
.save(tempFileName)

Then my unit test (which passed under spark 1.6.2) fails using the spark 2.1.0 
snapshot build that I made today. The dataframe contained 3 values in a date 
column.

Expected "[2012-01-03T09:12:00
?
2015-02-23T18:00:]00", 
but got 
"[132561072000
?
1424743200]00"

This means that while the null value is being correctly exported, the specified 
dateFormat is not being used to format the date. Instead it looks like number 
of seconds from epoch is being used.

  was:
I noticed this when running tests after pulling and building @lw-lin 's PR 
(https://github.com/apache/spark/pull/14118). I don't think it is anything 
wrong with his PR, just that the fix that was made to spark-csv for this issue 
was never moved to spark 2.x when databrick's spark-csv was merged into spark 2 
back in January. https://github.com/databricks/spark-csv/issues/308 was fixed 
in spark-csv after that merge.

The problem is that if I try to write a dataframe that contains a date column 
out to a csv using something like this

repartitionDf.write.format("csv") //.format(DATABRICKS_CSV)
.option("delimiter", "\t")
.option("header", "false")
.option("nullValue", "?")
.option("dateFormat", "-MM-dd'T'HH:mm:ss")
.option("escape", "\\")   
.save(tempFileName)

Then my unit test (which passed under spark 1.6.2 fails using the spark 2.1.0 
snapshot build that I made today.

Expected "[2012-01-03T09:12:00
?
2015-02-23T18:00:]00", 
but got 
"[132561072000
?
1424743200]00"

This means that while the null value is being correctly exported, the specified 
dateFormat is not being used to format the date. Instead it looks like number 
of seconds from epoch is being used.


> dateFormat should be used when writing dataframes as csv files
> --
>
> Key: SPARK-17066
> URL: https://issues.apache.org/jira/browse/SPARK-17066
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> I noticed this when running tests after pulling and building @lw-lin 's PR 
> (https://github.com/apache/spark/pull/14118). I don't think it is anything 
> wrong with his PR, just that the fix that was made to spark-csv for this 
> issue was never moved to spark 2.x when databrick's spark-csv was merged into 
> spark 2 back in January. https://github.com/databricks/spark-csv/issues/308 
> was fixed in spark-csv after that merge.
> The problem is that if I try to write a dataframe that contains a date column 
> out to a csv using something like this
> repartitionDf.write.format("csv") //.format(DATABRICKS_CSV)
> .option("delimiter", "\t")
> .option("header", "false")
> .option("nullValue", "?")
> .option("dateFormat", "-MM-dd'T'HH:mm:ss")
> .option("escape", "\\")   
> .save(tempFileName)
> Then my unit test (which passed under spark 1.6.2) fails using the spark 
> 2.1.0 snapshot build that I made today. The dataframe contained 3 values in a 
> date column.
> Expected "[2012-01-03T09:12:00
> ?
> 2015-02-23T18:00:]00", 
> but got 
> "[132561072000
> ?
> 1424743200]00"
> This means that while the null value is being correctly exported, the 
> specified dateFormat is not being used to format the date. Instead it looks 
> like number of seconds from epoch is being used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16158) Support pluggable dynamic allocation heuristics

2016-08-15 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421707#comment-15421707
 ] 

Reynold Xin commented on SPARK-16158:
-

Actually seems like a great idea, if just for modularity of software.

> Support pluggable dynamic allocation heuristics
> ---
>
> Key: SPARK-16158
> URL: https://issues.apache.org/jira/browse/SPARK-16158
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Nezih Yigitbasi
>
> It would be nice if Spark supports plugging in custom dynamic allocation 
> heuristics. This feature would be useful for experimenting with new 
> heuristics and also useful for plugging in different heuristics per job etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17067) Revocable resource support

2016-08-15 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421698#comment-15421698
 ] 

Michael Gummelt commented on SPARK-17067:
-

Add revocable resource support: 
http://mesos.apache.org/documentation/latest/oversubscription/

This will allow higher priority jobs (or other, non-Spark services) to preempty 
lower priority jobs

> Revocable resource support
> --
>
> Key: SPARK-17067
> URL: https://issues.apache.org/jira/browse/SPARK-17067
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Michael Gummelt
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421699#comment-15421699
 ] 

Maciej Bryński commented on SPARK-16320:


[~srowen], [~michael],
I found the reason why G1GC with default configuration is so slow on Spark 2.0.
That's the reason:
https://issues.apache.org/jira/browse/SPARK-12384

G1GC calculate region size based on initial heap size. Which is 2G.
So region size is 2G/2048 = 1M.
When heap grows up to 30G (my settings) we have 30k regions. That is too much.
So we have to configure one of the following options
1) Set -Xms=30g
2) Set -XX:G1HeapRegionSize=16m

Both solves the problem.



> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17067) Revocable resource support

2016-08-15 Thread Michael Gummelt (JIRA)
Michael Gummelt created SPARK-17067:
---

 Summary: Revocable resource support
 Key: SPARK-17067
 URL: https://issues.apache.org/jira/browse/SPARK-17067
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Michael Gummelt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17066) dateFormat should be used when writing dataframes as csv files

2016-08-15 Thread Barry Becker (JIRA)
Barry Becker created SPARK-17066:


 Summary: dateFormat should be used when writing dataframes as csv 
files
 Key: SPARK-17066
 URL: https://issues.apache.org/jira/browse/SPARK-17066
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.0.0
Reporter: Barry Becker


I noticed this when running tests after pulling and building @lw-lin 's PR 
(https://github.com/apache/spark/pull/14118). I don't think it is anything 
wrong with his PR, just that the fix that was made to spark-csv for this issue 
was never moved to spark 2.x when databrick's spark-csv was merged into spark 2 
back in January. https://github.com/databricks/spark-csv/issues/308 was fixed 
in spark-csv after that merge.

The problem is that if I try to write a dataframe that contains a date column 
out to a csv using something like this

repartitionDf.write.format("csv") //.format(DATABRICKS_CSV)
.option("delimiter", "\t")
.option("header", "false")
.option("nullValue", "?")
.option("dateFormat", "-MM-dd'T'HH:mm:ss")
.option("escape", "\\")   
.save(tempFileName)

Then my unit test (which passed under spark 1.6.2 fails using the spark 2.1.0 
snapshot build that I made today.

Expected "[2012-01-03T09:12:00
?
2015-02-23T18:00:]00", 
but got 
"[132561072000
?
1424743200]00"

This means that while the null value is being correctly exported, the specified 
dateFormat is not being used to format the date. Instead it looks like number 
of seconds from epoch is being used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17039) cannot read null dates from csv file

2016-08-15 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421688#comment-15421688
 ] 

Barry Becker commented on SPARK-17039:
--

I did notice that https://github.com/databricks/spark-csv/issues/308 was not 
ported to spark 2.x. IOW, when you specify the dateFormat when writing with 
format("csv") the dates get written as longs instead of dates with the 
specified dateFormat. I will open a separate issue for it.

> cannot read null dates from csv file
> 
>
> Key: SPARK-17039
> URL: https://issues.apache.org/jira/browse/SPARK-17039
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> I see this exact same bug as reported in this [stack overflow 
> post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column]
>   using Spark 2.0.0 (released version).
> In scala, I read a csv using 
> sqlContext.read
>   .format("csv")
>   .option("header", "false")
>   .option("inferSchema", "false") 
>   .option("nullValue", "?")
>   .option("dateFormat", "-MM-dd'T'HH:mm:ss")
>   .schema(dfSchema)
>   .csv(dataFile)
> The data contains some null dates (represented with ?).
> The error I get is:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 
> (TID 10, localhost): java.text.ParseException: Unparseable date: "?"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17065) Improve the error message when encountering an incompatible DataSourceRegister

2016-08-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17065:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Improve the error message when encountering an incompatible DataSourceRegister
> --
>
> Key: SPARK-17065
> URL: https://issues.apache.org/jira/browse/SPARK-17065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> When encountering an incompatible DataSourceRegister, it's better to add 
> instructions to remove or upgrade it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17065) Improve the error message when encountering an incompatible DataSourceRegister

2016-08-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17065:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Improve the error message when encountering an incompatible DataSourceRegister
> --
>
> Key: SPARK-17065
> URL: https://issues.apache.org/jira/browse/SPARK-17065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> When encountering an incompatible DataSourceRegister, it's better to add 
> instructions to remove or upgrade it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17065) Improve the error message when encountering an incompatible DataSourceRegister

2016-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421671#comment-15421671
 ] 

Apache Spark commented on SPARK-17065:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/14651

> Improve the error message when encountering an incompatible DataSourceRegister
> --
>
> Key: SPARK-17065
> URL: https://issues.apache.org/jira/browse/SPARK-17065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> When encountering an incompatible DataSourceRegister, it's better to add 
> instructions to remove or upgrade it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17065) Improve the error message when encountering an incompatible DataSourceRegister

2016-08-15 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-17065:


 Summary: Improve the error message when encountering an 
incompatible DataSourceRegister
 Key: SPARK-17065
 URL: https://issues.apache.org/jira/browse/SPARK-17065
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


When encountering an incompatible DataSourceRegister, it's better to add 
instructions to remove or upgrade it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17039) cannot read null dates from csv file

2016-08-15 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421611#comment-15421611
 ] 

Barry Becker commented on SPARK-17039:
--

I was able to pull the patch (https://github.com/apache/spark/pull/14118) and 
verify that it fixes my issue with reading null values. I hope that this patch 
will make it in for 2.0.1. Thanks.

> cannot read null dates from csv file
> 
>
> Key: SPARK-17039
> URL: https://issues.apache.org/jira/browse/SPARK-17039
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> I see this exact same bug as reported in this [stack overflow 
> post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column]
>   using Spark 2.0.0 (released version).
> In scala, I read a csv using 
> sqlContext.read
>   .format("csv")
>   .option("header", "false")
>   .option("inferSchema", "false") 
>   .option("nullValue", "?")
>   .option("dateFormat", "-MM-dd'T'HH:mm:ss")
>   .schema(dfSchema)
>   .csv(dataFile)
> The data contains some null dates (represented with ?).
> The error I get is:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 
> (TID 10, localhost): java.text.ParseException: Unparseable date: "?"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17064) Reconsider spark.job.interruptOnCancel

2016-08-15 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421610#comment-15421610
 ] 

Reynold Xin commented on SPARK-17064:
-

Yea so my worry is that even if Hadoop has addressed this (HDFS-1208 actually 
hasn't been resolved), the problem exists outside Hadoop and a lot of 
underlying storage clients are not handling interrupts well and taking this 
away will make users' life very difficult when working with other storage 
systems.

> Reconsider spark.job.interruptOnCancel
> --
>
> Key: SPARK-17064
> URL: https://issues.apache.org/jira/browse/SPARK-17064
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Mark Hamstra
>
> There is a frequent need or desire in Spark to cancel already running Tasks.  
> This has been recognized for a very long time (see, e.g., the ancient TODO 
> comment in the DAGScheduler: "Cancel running tasks in the stage"), but we've 
> never had more than an incomplete solution.  Killing running Tasks at the 
> Executor level has been implemented by interrupting the threads running the 
> Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since 
> https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3
>  addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting 
> Task threads in this way has only been possible if interruptThread is true, 
> and that typically comes from the setting of the interruptOnCancel property 
> in the JobGroup, which in turn typically comes from the setting of 
> spark.job.interruptOnCancel.  Because of concerns over 
> https://issues.apache.org/jira/browse/HDFS-1208 and the possibility of nodes 
> being marked dead when a Task thread is interrupted, the default value of the 
> boolean has been "false" -- i.e. by default we do not interrupt Tasks already 
> running on Executor even when the Task has been canceled in the DAGScheduler, 
> or the Stage has been abort, or the Job has been killed, etc.
> There are several issues resulting from this current state of affairs, and 
> they each probably need to spawn their own JIRA issue and PR once we decide 
> on an overall strategy here.  Among those issues:
> * Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS 
> versions that Spark now supports so that we can set the default value of 
> spark.job.interruptOnCancel to "true" or eliminate this boolean flag entirely?
> * Even if interrupting Task threads is no longer an issue for HDFS, is it 
> still enough of an issue for non-HDFS usage (e.g. Cassandra) so that we still 
> need protection similar to what the current default value of 
> spark.job.interruptOnCancel provides?
> * If interrupting Task threads isn't safe enough, what should we do instead?
> * Once we have a safe mechanism to stop and clean up after already executing 
> Tasks, there is still the question of whether we _should_ end executing 
> Tasks.  While that is likely a good thing to do in cases where individual 
> Tasks are lightweight in terms of resource usage, at least in some cases not 
> all running Tasks should be ended: https://github.com/apache/spark/pull/12436 
>  That means that we probably need to continue to make allowing Task 
> interruption configurable at the Job or JobGroup level (and we need better 
> documentation explaining how and when to allow interruption or not.)
> * There is one place in the current code 
> (TaskSetManager#handleSuccessfulTask) that hard codes interruptThread to 
> "true".  This should be fixed, and similar misuses of killTask be denied in 
> pull requests until this issue is adequately resolved.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16700) StructType doesn't accept Python dicts anymore

2016-08-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-16700.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14469
[https://github.com/apache/spark/pull/14469]

> StructType doesn't accept Python dicts anymore
> --
>
> Key: SPARK-16700
> URL: https://issues.apache.org/jira/browse/SPARK-16700
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>Assignee: Davies Liu
> Fix For: 2.1.0
>
>
> Hello,
> I found this issue while testing my codebase with 2.0.0-rc5
> StructType in Spark 1.6.2 accepts the Python  type, which is very 
> handy. 2.0.0-rc5 does not and throws an error.
> I don't know if this was intended but I'd advocate for this behaviour to 
> remain the same. MapType is probably wasteful when your key names never 
> change and switching to Python tuples would be cumbersome.
> Here is a minimal script to reproduce the issue: 
> {code}
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> struct_schema = SparkTypes.StructType([
> SparkTypes.StructField("id", SparkTypes.LongType())
> ])
> rdd = sc.parallelize([{"id": 0}, {"id": 1}])
> df = sqlc.createDataFrame(rdd, struct_schema)
> print df.collect()
> # 1.6.2 prints [Row(id=0), Row(id=1)]
> # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in 
> type 
> {code}
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16321) [Spark 2.0] Performance regression when reading parquet and using PPD and non-vectorized reader

2016-08-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421468#comment-15421468
 ] 

Maciej Bryński edited comment on SPARK-16321 at 8/15/16 7:34 PM:
-

[~davies]
I think you can mark this one as resolved.


was (Author: maver1ck):
[~davies]
I think you mark this one as resolved.

> [Spark 2.0] Performance regression when reading parquet and using PPD and 
> non-vectorized reader
> ---
>
> Key: SPARK-16321
> URL: https://issues.apache.org/jira/browse/SPARK-16321
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Assignee: Liang-Chi Hsieh
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
> Attachments: Spark16.nps, Spark2.nps, spark16._trace.png, 
> spark16_query.nps, spark2_nofilterpushdown.nps, spark2_query.nps, 
> spark2_trace.png, visualvm_spark16.png, visualvm_spark2.png, 
> visualvm_spark2_G1GC.png
>
>
> *UPDATE*
> Please start with this comment 
> https://issues.apache.org/jira/browse/SPARK-16321?focusedCommentId=15383785=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15383785
> I assume that problem results from the performance problem with reading 
> parquet files
> *Original Issue description*
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is 2x slower.
> {code}
> df = sqlctx.read.parquet(path)
> df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 
> else []).collect()
> {code}
> Spark 1.6 -> 2.3 min
> Spark 2.0 -> 4.6 min (2x slower)
> I used BasicProfiler for this task and cumulative time was:
> Spark 1.6 - 4300 sec
> Spark 2.0 - 5800 sec
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16321) [Spark 2.0] Performance regression when reading parquet and using PPD and non-vectorized reader

2016-08-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16321.
---
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.1.0
   2.0.1

Resolved by https://github.com/apache/spark/pull/13701

> [Spark 2.0] Performance regression when reading parquet and using PPD and 
> non-vectorized reader
> ---
>
> Key: SPARK-16321
> URL: https://issues.apache.org/jira/browse/SPARK-16321
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Assignee: Liang-Chi Hsieh
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
> Attachments: Spark16.nps, Spark2.nps, spark16._trace.png, 
> spark16_query.nps, spark2_nofilterpushdown.nps, spark2_query.nps, 
> spark2_trace.png, visualvm_spark16.png, visualvm_spark2.png, 
> visualvm_spark2_G1GC.png
>
>
> *UPDATE*
> Please start with this comment 
> https://issues.apache.org/jira/browse/SPARK-16321?focusedCommentId=15383785=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15383785
> I assume that problem results from the performance problem with reading 
> parquet files
> *Original Issue description*
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is 2x slower.
> {code}
> df = sqlctx.read.parquet(path)
> df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 
> else []).collect()
> {code}
> Spark 1.6 -> 2.3 min
> Spark 2.0 -> 4.6 min (2x slower)
> I used BasicProfiler for this task and cumulative time was:
> Spark 1.6 - 4300 sec
> Spark 2.0 - 5800 sec
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0

2016-08-15 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421524#comment-15421524
 ] 

Sital Kedia commented on SPARK-16922:
-

Yes, I have the above mentioned PR as well. 

> Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
> -
>
> Key: SPARK-16922
> URL: https://issues.apache.org/jira/browse/SPARK-16922
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> A query which used to work in Spark 1.6 fails with executor OOM in 2.0.
> Stack trace - 
> {code}
>   at 
> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Query plan in Spark 1.6
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3])
> +- TungstenExchange hashpartitioning(field1#101,200), None
>+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111])
>   +- Project [field1#101,field2#74]
>  +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as 
> decimal(20,0)) as bigint)], BuildRight
> :- ConvertToUnsafe
> :  +- HiveTableScan [field2#74,field5#63L], MetastoreRelation 
> foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)]
> +- ConvertToUnsafe
>+- HiveTableScan [field1#101,field4#97], MetastoreRelation 
> foo, table2, Some(b)
> {code}
> Query plan in 2.0
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))])
> +- Exchange hashpartitioning(field1#160, 200)
>+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / 
> 100.0))])
>   +- *Project [field2#133, field1#160]
>  +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as 
> decimal(20,0)) as bigint)], Inner, BuildRight
> :- *Filter isnotnull(field5#122L)
> :  +- HiveTableScan [field5#122L, field2#133], MetastoreRelation 
> foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= 
> 2013-12-31)]
> +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as 
> decimal(20,0)) as bigint)))
>+- *Filter isnotnull(field4#156)
>   +- HiveTableScan [field4#156, field1#160], 
> MetastoreRelation foo, table2, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17064) Reconsider spark.job.interruptOnCancel

2016-08-15 Thread Mark Hamstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-17064:
-
Description: 
There is a frequent need or desire in Spark to cancel already running Tasks.  
This has been recognized for a very long time (see, e.g., the ancient TODO 
comment in the DAGScheduler: "Cancel running tasks in the stage"), but we've 
never had more than an incomplete solution.  Killing running Tasks at the 
Executor level has been implemented by interrupting the threads running the 
Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since 
https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3 
addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting Task 
threads in this way has only been possible if interruptThread is true, and that 
typically comes from the setting of the interruptOnCancel property in the 
JobGroup, which in turn typically comes from the setting of 
spark.job.interruptOnCancel.  Because of concerns over 
https://issues.apache.org/jira/browse/HDFS-1208 and the possibility of nodes 
being marked dead when a Task thread is interrupted, the default value of the 
boolean has been "false" -- i.e. by default we do not interrupt Tasks already 
running on Executor even when the Task has been canceled in the DAGScheduler, 
or the Stage has been abort, or the Job has been killed, etc.

There are several issues resulting from this current state of affairs, and they 
each probably need to spawn their own JIRA issue and PR once we decide on an 
overall strategy here.  Among those issues:

* Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS 
versions that Spark now supports so that we can set the default value of 
spark.job.interruptOnCancel to "true" or eliminate this boolean flag entirely?
* Even if interrupting Task threads is no longer an issue for HDFS, is it still 
enough of an issue for non-HDFS usage (e.g. Cassandra) so that we still need 
protection similar to what the current default value of 
spark.job.interruptOnCancel provides?
* If interrupting Task threads isn't safe enough, what should we do instead?
* Once we have a safe mechanism to stop and clean up after already executing 
Tasks, there is still the question of whether we _should_ end executing Tasks.  
While that is likely a good thing to do in cases where individual Tasks are 
lightweight in terms of resource usage, at least in some cases not all running 
Tasks should be ended: https://github.com/apache/spark/pull/12436  That means 
that we probably need to continue to make allowing Task interruption 
configurable at the Job or JobGroup level (and we need better documentation 
explaining how and when to allow interruption or not.)
* There is one place in the current code (TaskSetManager#handleSuccessfulTask) 
that hard codes interruptThread to "true".  This should be fixed, and similar 
misuses of killTask be denied in pull requests until this issue is adequately 
resolved.   

  was:
There is a frequent need or desire in Spark to cancel already running Tasks.  
This has been recognized for a very long time (see, e.g., the ancient TODO 
comment in the DAGScheduler: "Cancel running tasks in the stage"), but we've 
never had more than an incomplete solution.  Killing running Tasks at the 
Executor level has been implemented by interrupting the threads running the 
Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since 
https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3 
addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting Task 
threads in this way has only been possible if interruptThread is true, and that 
typically comes from the setting of the interruptOnCancel property in the 
JobGroup, which in turn typically comes from the setting of 
spark.job.interruptOnCancel.  Because of concerns over 
https://issues.apache.org/jira/browse/HDFS-1208 and the possibility of nodes 
being marked dead when a Task thread is interrupted, the default value of the 
boolean has been "false" -- i.e. by default we do not interrupt Tasks already 
running on Executor even when the Task has been canceled in the DAGScheduler, 
or the Stage has been abort, or the Job has been killed, etc.

There are several issues resulting from this current state of affairs, and they 
each probably need to spawn their own JIRA issue and PR once we decide on an 
overall strategy here.  Among those issues:

* Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS 
versions that Spark now supports so that we set the default value of 
spark.job.interruptOnCancel to "true" or eliminate this boolean flag entirely?
* Even if interrupting Task threads is no longer an issue for HDFS, is it still 
enough of an issue for non-HDFS usage (e.g. Cassandra) so that we still need 
protection similar to what the current 

[jira] [Commented] (SPARK-17064) Reconsider spark.job.interruptOnCancel

2016-08-15 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421503#comment-15421503
 ] 

Mark Hamstra commented on SPARK-17064:
--

[~kayousterhout] [~r...@databricks.com] [~imranr]

> Reconsider spark.job.interruptOnCancel
> --
>
> Key: SPARK-17064
> URL: https://issues.apache.org/jira/browse/SPARK-17064
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Mark Hamstra
>
> There is a frequent need or desire in Spark to cancel already running Tasks.  
> This has been recognized for a very long time (see, e.g., the ancient TODO 
> comment in the DAGScheduler: "Cancel running tasks in the stage"), but we've 
> never had more than an incomplete solution.  Killing running Tasks at the 
> Executor level has been implemented by interrupting the threads running the 
> Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since 
> https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3
>  addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting 
> Task threads in this way has only been possible if interruptThread is true, 
> and that typically comes from the setting of the interruptOnCancel property 
> in the JobGroup, which in turn typically comes from the setting of 
> spark.job.interruptOnCancel.  Because of concerns over 
> https://issues.apache.org/jira/browse/HDFS-1208 and the possibility of nodes 
> being marked dead when a Task thread is interrupted, the default value of the 
> boolean has been "false" -- i.e. by default we do not interrupt Tasks already 
> running on Executor even when the Task has been canceled in the DAGScheduler, 
> or the Stage has been abort, or the Job has been killed, etc.
> There are several issues resulting from this current state of affairs, and 
> they each probably need to spawn their own JIRA issue and PR once we decide 
> on an overall strategy here.  Among those issues:
> * Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS 
> versions that Spark now supports so that we set the default value of 
> spark.job.interruptOnCancel to "true" or eliminate this boolean flag entirely?
> * Even if interrupting Task threads is no longer an issue for HDFS, is it 
> still enough of an issue for non-HDFS usage (e.g. Cassandra) so that we still 
> need protection similar to what the current default value of 
> spark.job.interruptOnCancel provides?
> * If interrupting Task threads isn't safe enough, what should we do instead?
> * Once we have a safe mechanism to stop and clean up after already executing 
> Tasks, there is still the question of whether we _should_ end executing 
> Tasks.  While that is likely a good thing to do in cases where individual 
> Tasks are lightweight in terms of resource usage, at least in some cases not 
> all running Tasks should be ended: https://github.com/apache/spark/pull/12436 
>  That means that we probably need to continue to make allowing Task 
> interruption configurable at the Job or JobGroup level (and we need better 
> documentation explaining how and when to allow interruption or not.)
> * There is one place in the current code 
> (TaskSetManager#handleSuccessfulTask) that hard codes interruptThread to 
> "true".  This should be fixed, and similar misuses of killTask be denied in 
> pull requests until this issue is adequately resolved.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17064) Reconsider spark.job.interruptOnCancel

2016-08-15 Thread Mark Hamstra (JIRA)
Mark Hamstra created SPARK-17064:


 Summary: Reconsider spark.job.interruptOnCancel
 Key: SPARK-17064
 URL: https://issues.apache.org/jira/browse/SPARK-17064
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler, Spark Core
Reporter: Mark Hamstra


There is a frequent need or desire in Spark to cancel already running Tasks.  
This has been recognized for a very long time (see, e.g., the ancient TODO 
comment in the DAGScheduler: "Cancel running tasks in the stage"), but we've 
never had more than an incomplete solution.  Killing running Tasks at the 
Executor level has been implemented by interrupting the threads running the 
Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since 
https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3 
addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting Task 
threads in this way has only been possible if interruptThread is true, and that 
typically comes from the setting of the interruptOnCancel property in the 
JobGroup, which in turn typically comes from the setting of 
spark.job.interruptOnCancel.  Because of concerns over 
https://issues.apache.org/jira/browse/HDFS-1208 and the possibility of nodes 
being marked dead when a Task thread is interrupted, the default value of the 
boolean has been "false" -- i.e. by default we do not interrupt Tasks already 
running on Executor even when the Task has been canceled in the DAGScheduler, 
or the Stage has been abort, or the Job has been killed, etc.

There are several issues resulting from this current state of affairs, and they 
each probably need to spawn their own JIRA issue and PR once we decide on an 
overall strategy here.  Among those issues:

* Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS 
versions that Spark now supports so that we set the default value of 
spark.job.interruptOnCancel to "true" or eliminate this boolean flag entirely?
* Even if interrupting Task threads is no longer an issue for HDFS, is it still 
enough of an issue for non-HDFS usage (e.g. Cassandra) so that we still need 
protection similar to what the current default value of 
spark.job.interruptOnCancel provides?
* If interrupting Task threads isn't safe enough, what should we do instead?
* Once we have a safe mechanism to stop and clean up after already executing 
Tasks, there is still the question of whether we _should_ end executing Tasks.  
While that is likely a good thing to do in cases where individual Tasks are 
lightweight in terms of resource usage, at least in some cases not all running 
Tasks should be ended: https://github.com/apache/spark/pull/12436  That means 
that we probably need to continue to make allowing Task interruption 
configurable at the Job or JobGroup level (and we need better documentation 
explaining how and when to allow interruption or not.)
* There is one place in the current code (TaskSetManager#handleSuccessfulTask) 
that hard codes interruptThread to "true".  This should be fixed, and similar 
misuses of killTask be denied in pull requests until this issue is adequately 
resolved.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0

2016-08-15 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421484#comment-15421484
 ] 

Davies Liu commented on SPARK-16922:


Have you also have this one? https://github.com/apache/spark/pull/14373

> Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
> -
>
> Key: SPARK-16922
> URL: https://issues.apache.org/jira/browse/SPARK-16922
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> A query which used to work in Spark 1.6 fails with executor OOM in 2.0.
> Stack trace - 
> {code}
>   at 
> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Query plan in Spark 1.6
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3])
> +- TungstenExchange hashpartitioning(field1#101,200), None
>+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111])
>   +- Project [field1#101,field2#74]
>  +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as 
> decimal(20,0)) as bigint)], BuildRight
> :- ConvertToUnsafe
> :  +- HiveTableScan [field2#74,field5#63L], MetastoreRelation 
> foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)]
> +- ConvertToUnsafe
>+- HiveTableScan [field1#101,field4#97], MetastoreRelation 
> foo, table2, Some(b)
> {code}
> Query plan in 2.0
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))])
> +- Exchange hashpartitioning(field1#160, 200)
>+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / 
> 100.0))])
>   +- *Project [field2#133, field1#160]
>  +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as 
> decimal(20,0)) as bigint)], Inner, BuildRight
> :- *Filter isnotnull(field5#122L)
> :  +- HiveTableScan [field5#122L, field2#133], MetastoreRelation 
> foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= 
> 2013-12-31)]
> +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as 
> decimal(20,0)) as bigint)))
>+- *Filter isnotnull(field4#156)
>   +- HiveTableScan [field4#156, field1#160], 
> MetastoreRelation foo, table2, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11714) Make Spark on Mesos honor port restrictions

2016-08-15 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421478#comment-15421478
 ] 

Charles Allen commented on SPARK-11714:
---

Awesome! Thanks guys!

> Make Spark on Mesos honor port restrictions
> ---
>
> Key: SPARK-11714
> URL: https://issues.apache.org/jira/browse/SPARK-11714
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Charles Allen
>Assignee: Stavros Kontopoulos
> Fix For: 2.1.0
>
>
> Currently the MesosSchedulerBackend does not make any effort to honor "ports" 
> as a resource offer in Mesos. This ask is to have the ports which the 
> executor binds to honor the limits of the "ports" resource of an offer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16321) [Spark 2.0] Performance regression when reading parquet and using PPD and non-vectorized reader

2016-08-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421468#comment-15421468
 ] 

Maciej Bryński commented on SPARK-16321:


[~davies]
I think you mark this one as resolved.

> [Spark 2.0] Performance regression when reading parquet and using PPD and 
> non-vectorized reader
> ---
>
> Key: SPARK-16321
> URL: https://issues.apache.org/jira/browse/SPARK-16321
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: Spark16.nps, Spark2.nps, spark16._trace.png, 
> spark16_query.nps, spark2_nofilterpushdown.nps, spark2_query.nps, 
> spark2_trace.png, visualvm_spark16.png, visualvm_spark2.png, 
> visualvm_spark2_G1GC.png
>
>
> *UPDATE*
> Please start with this comment 
> https://issues.apache.org/jira/browse/SPARK-16321?focusedCommentId=15383785=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15383785
> I assume that problem results from the performance problem with reading 
> parquet files
> *Original Issue description*
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is 2x slower.
> {code}
> df = sqlctx.read.parquet(path)
> df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 
> else []).collect()
> {code}
> Spark 1.6 -> 2.3 min
> Spark 2.0 -> 4.6 min (2x slower)
> I used BasicProfiler for this task and cumulative time was:
> Spark 1.6 - 4300 sec
> Spark 2.0 - 5800 sec
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16087) Spark Hangs When Using Union With Persisted Hadoop RDD

2016-08-15 Thread Nick Sakovich (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421457#comment-15421457
 ] 

Nick Sakovich commented on SPARK-16087:
---

[~kevinconaway], [~srowen] today i met the same issue .. my app works correctly 
with locally installed Redis. But when i try to use remote Redis instance my 
app just stucks. It also happens only in case when i use "union" feature. I use 
Spark 1.6.2 and just run the app locally without Hadoop. Do you have any 
updates what can be wrong here?

> Spark Hangs When Using Union With Persisted Hadoop RDD
> --
>
> Key: SPARK-16087
> URL: https://issues.apache.org/jira/browse/SPARK-16087
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.1, 1.6.1
>Reporter: Kevin Conaway
>Priority: Critical
> Attachments: SPARK-16087.dump.log, SPARK-16087.log, Screen Shot 
> 2016-06-21 at 4.27.26 PM.png, Screen Shot 2016-06-21 at 4.27.35 PM.png, 
> part-0, part-1, spark-16087.tar.gz
>
>
> Spark hangs when materializing a persisted RDD that was built from a Hadoop 
> sequence file and then union-ed with a similar RDD.
> Below is a small file that exhibits the issue:
> {code:java}
> import org.apache.hadoop.io.BytesWritable;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.spark.SparkConf;
> import org.apache.spark.api.java.JavaPairRDD;
> import org.apache.spark.api.java.JavaSparkContext;
> import org.apache.spark.api.java.function.PairFunction;
> import org.apache.spark.serializer.KryoSerializer;
> import org.apache.spark.storage.StorageLevel;
> import scala.Tuple2;
> public class SparkBug {
> public static void main(String [] args) throws Exception {
> JavaSparkContext sc = new JavaSparkContext(
> new SparkConf()
> .set("spark.serializer", KryoSerializer.class.getName())
> .set("spark.master", "local[*]")
> .setAppName(SparkBug.class.getName())
> );
> JavaPairRDD rdd1 = sc.sequenceFile(
>"hdfs://localhost:9000/part-0",
> LongWritable.class,
> BytesWritable.class
> ).mapToPair(new PairFunction, 
> LongWritable, BytesWritable>() {
> @Override
> public Tuple2 
> call(Tuple2 tuple) throws Exception {
> return new Tuple2<>(
> new LongWritable(tuple._1.get()),
> new BytesWritable(tuple._2.copyBytes())
> );
> }
> }).persist(
> StorageLevel.MEMORY_ONLY()
> );
> System.out.println("Before union: " + rdd1.count());
> JavaPairRDD rdd2 = sc.sequenceFile(
> "hdfs://localhost:9000/part-1",
> LongWritable.class,
> BytesWritable.class
> );
> JavaPairRDD joined = rdd1.union(rdd2);
> System.out.println("After union: " + joined.count());
> }
> }
> {code}
> You'll need to upload the attached part-0 and part-1 to a local hdfs 
> instance (I'm just using a dummy [Single Node 
> Cluster|http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html]
>  locally).
> Some things to note:
> - It does not hang if rdd1 is not persisted
> - It does not hang is rdd1 is not materialized (via calling rdd1.count()) 
> before the union-ed RDD is materialized
> - It does not hang if the mapToPair() transformation is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16671) Merge variable substitution code in core and SQL

2016-08-15 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-16671.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.1.0

> Merge variable substitution code in core and SQL
> 
>
> Key: SPARK-16671
> URL: https://issues.apache.org/jira/browse/SPARK-16671
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.1.0
>
>
> SPARK-16272 added support for variable substitution in configs in the core 
> Spark configuration. That code has a lot of similarities with SQL's 
> {{VariableSubstitution}}, and we should make both use the same code as much 
> as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16508) Fix documentation warnings found by R CMD check

2016-08-15 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421393#comment-15421393
 ] 

Shivaram Venkataraman commented on SPARK-16508:
---

We merged https://github.com/apache/spark/pull/14522 but I'm keeping the JIRA 
open till we merge PR 14558

> Fix documentation warnings found by R CMD check
> ---
>
> Key: SPARK-16508
> URL: https://issues.apache.org/jira/browse/SPARK-16508
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> A full list of warnings after the fixes in SPARK-16507 is at 
> https://gist.github.com/shivaram/62866c4ca59c5d34b8963939cf04b5eb 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17054) SparkR can not run in yarn-cluster mode on mac os

2016-08-15 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421351#comment-15421351
 ] 

Shivaram Venkataraman commented on SPARK-17054:
---

[~zjffdu] We added this new feature as a part of SPARK-15799 -- The main idea 
here is that if users only download the R package from CRAN and dont have Spark 
installed on their machines then we will auto download spark. Note that this 
should only happen when the Spark jars cannot be found using SPARK_HOME -- If 
you launch SparkR using bin/sparkR or bin/spark-submit it should find the jars 
that are a part of Spark.

Could you give us more details on how you launched SparkR and any other 
configuration used ?

> SparkR can not run in yarn-cluster mode on mac os
> -
>
> Key: SPARK-17054
> URL: https://issues.apache.org/jira/browse/SPARK-17054
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>
> This is due to it download sparkR to the wrong place.
> {noformat}
> Warning message:
> 'sparkR.init' is deprecated.
> Use 'sparkR.session' instead.
> See help("Deprecated")
> Spark not found in SPARK_HOME:  .
> To search in the cache directory. Installation will start if not found.
> Mirror site not provided.
> Looking for site suggested from apache website...
> Preferred mirror site found: http://apache.mirror.cdnetworks.com/spark
> Downloading Spark spark-2.0.0 for Hadoop 2.7 from:
> - 
> http://apache.mirror.cdnetworks.com/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz
> Fetch failed from http://apache.mirror.cdnetworks.com/spark
>  open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', 
> reason 'No such file or directory'>
> To use backup site...
> Downloading Spark spark-2.0.0 for Hadoop 2.7 from:
> - 
> http://www-us.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz
> Fetch failed from http://www-us.apache.org/dist/spark
>  open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', 
> reason 'No such file or directory'>
> Error in robust_download_tar(mirrorUrl, version, hadoopVersion, packageName,  
> :
>   Unable to download Spark spark-2.0.0 for Hadoop 2.7. Please check network 
> connection, Hadoop version, or provide other mirror sites.
> Calls: sparkRSQL.init ... sparkR.session -> install.spark -> 
> robust_download_tar
> In addition: Warning messages:
> 1: 'sparkRSQL.init' is deprecated.
> Use 'sparkR.session' instead.
> See help("Deprecated")
> 2: In dir.create(localDir, recursive = TRUE) :
>   cannot create dir '/home//Library', reason 'Operation not supported'
> Execution halted
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16578) Configurable hostname for RBackend

2016-08-15 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421336#comment-15421336
 ] 

Shivaram Venkataraman commented on SPARK-16578:
---

[~zjffdu] The main goal I had for this ticket was to enable running the R 
frontend in a different machine (say a user's laptop) while running the 
SparkContext JVM on a different machine (say on a cluster). It will be 
interesting to see if we can support more of the use cases used by downstream 
projects like zeppelin / livy etc. 

I thinkj [~junyangq] has been doing some work on this and gives us some more 
details on the changes.

> Configurable hostname for RBackend
> --
>
> Key: SPARK-16578
> URL: https://issues.apache.org/jira/browse/SPARK-16578
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> One of the requirements that comes up with SparkR being a standalone package 
> is that users can now install just the R package on the client side and 
> connect to a remote machine which runs the RBackend class.
> We should check if we can support this mode of execution and what are the 
> pros / cons of it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17063) MSCK REPAIR TABLE is super slow with Hive metastore

2016-08-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17063:


Assignee: Apache Spark  (was: Davies Liu)

> MSCK REPAIR TABLE is super slow with Hive metastore
> ---
>
> Key: SPARK-17063
> URL: https://issues.apache.org/jira/browse/SPARK-17063
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> When repair a table with thousands of partitions, it could take hundreds of 
> seconds, Hive metastore can only add a few partitioins per seconds, because 
> it will list all the files for each partition to gather the fast stats 
> (number of files, total size of files).
> We could improve this by listing the files in Spark in parallel, than sending 
> the fast stats to Hive metastore to avoid this sequential listing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17063) MSCK REPAIR TABLE is super slow with Hive metastore

2016-08-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17063:


Assignee: Davies Liu  (was: Apache Spark)

> MSCK REPAIR TABLE is super slow with Hive metastore
> ---
>
> Key: SPARK-17063
> URL: https://issues.apache.org/jira/browse/SPARK-17063
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> When repair a table with thousands of partitions, it could take hundreds of 
> seconds, Hive metastore can only add a few partitioins per seconds, because 
> it will list all the files for each partition to gather the fast stats 
> (number of files, total size of files).
> We could improve this by listing the files in Spark in parallel, than sending 
> the fast stats to Hive metastore to avoid this sequential listing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17063) MSCK REPAIR TABLE is super slow with Hive metastore

2016-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421335#comment-15421335
 ] 

Apache Spark commented on SPARK-17063:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/14607

> MSCK REPAIR TABLE is super slow with Hive metastore
> ---
>
> Key: SPARK-17063
> URL: https://issues.apache.org/jira/browse/SPARK-17063
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> When repair a table with thousands of partitions, it could take hundreds of 
> seconds, Hive metastore can only add a few partitioins per seconds, because 
> it will list all the files for each partition to gather the fast stats 
> (number of files, total size of files).
> We could improve this by listing the files in Spark in parallel, than sending 
> the fast stats to Hive metastore to avoid this sequential listing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17063) MSCK REPAIR TABLE is super slow with Hive metastore

2016-08-15 Thread Davies Liu (JIRA)
Davies Liu created SPARK-17063:
--

 Summary: MSCK REPAIR TABLE is super slow with Hive metastore
 Key: SPARK-17063
 URL: https://issues.apache.org/jira/browse/SPARK-17063
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


When repair a table with thousands of partitions, it could take hundreds of 
seconds, Hive metastore can only add a few partitioins per seconds, because it 
will list all the files for each partition to gather the fast stats (number of 
files, total size of files).

We could improve this by listing the files in Spark in parallel, than sending 
the fast stats to Hive metastore to avoid this sequential listing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16158) Support pluggable dynamic allocation heuristics

2016-08-15 Thread Nezih Yigitbasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421327#comment-15421327
 ] 

Nezih Yigitbasi commented on SPARK-16158:
-

[~andrewor14] [~rxin] how do you guys feel about this?

> Support pluggable dynamic allocation heuristics
> ---
>
> Key: SPARK-16158
> URL: https://issues.apache.org/jira/browse/SPARK-16158
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Nezih Yigitbasi
>
> It would be nice if Spark supports plugging in custom dynamic allocation 
> heuristics. This feature would be useful for experimenting with new 
> heuristics and also useful for plugging in different heuristics per job etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17062) Add --conf to mesos dispatcher process

2016-08-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17062:


Assignee: Apache Spark

> Add --conf to mesos dispatcher process
> --
>
> Key: SPARK-17062
> URL: https://issues.apache.org/jira/browse/SPARK-17062
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Stavros Kontopoulos
>Assignee: Apache Spark
>
> Sometimes we simply need to add a property in Spark Config for the Mesos 
> Dispatcher. The only option right now is to created a property file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17062) Add --conf to mesos dispatcher process

2016-08-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17062:


Assignee: (was: Apache Spark)

> Add --conf to mesos dispatcher process
> --
>
> Key: SPARK-17062
> URL: https://issues.apache.org/jira/browse/SPARK-17062
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Stavros Kontopoulos
>
> Sometimes we simply need to add a property in Spark Config for the Mesos 
> Dispatcher. The only option right now is to created a property file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17062) Add --conf to mesos dispatcher process

2016-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421315#comment-15421315
 ] 

Apache Spark commented on SPARK-17062:
--

User 'skonto' has created a pull request for this issue:
https://github.com/apache/spark/pull/14650

> Add --conf to mesos dispatcher process
> --
>
> Key: SPARK-17062
> URL: https://issues.apache.org/jira/browse/SPARK-17062
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Stavros Kontopoulos
>
> Sometimes we simply need to add a property in Spark Config for the Mesos 
> Dispatcher. The only option right now is to created a property file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17062) Add --conf to mesos dispatcher process

2016-08-15 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created SPARK-17062:
---

 Summary: Add --conf to mesos dispatcher process
 Key: SPARK-17062
 URL: https://issues.apache.org/jira/browse/SPARK-17062
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 2.0.0
Reporter: Stavros Kontopoulos


Sometimes we simply need to add a property in Spark Config for the Mesos 
Dispatcher. The only option right now is to created a property file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15002) Calling unpersist can cause spark to hang indefinitely when writing out a result

2016-08-15 Thread Jamie Hutton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421306#comment-15421306
 ] 

Jamie Hutton commented on SPARK-15002:
--

Hi Sean,

Looking at the stack trace on the executors, quite a few of them have the 
following stack trace - which mentions "fillInStackTrace" - it seems like kryo 
is going wrong to me. 

java.lang.Throwable.fillInStackTrace(Native Method)
java.lang.Throwable.fillInStackTrace(Throwable.java:783)
java.lang.Throwable.(Throwable.java:265)
java.lang.Exception.(Exception.java:66)
java.lang.ReflectiveOperationException.(ReflectiveOperationException.java:56)
java.lang.NoSuchMethodException.(NoSuchMethodException.java:51)
java.lang.Class.getConstructor0(Class.java:3082)
java.lang.Class.getConstructor(Class.java:1825)
com.esotericsoftware.kryo.factories.ReflectionSerializerFactory.makeSerializer(ReflectionSerializerFactory.java:60)
com.esotericsoftware.kryo.factories.ReflectionSerializerFactory.makeSerializer(ReflectionSerializerFactory.java:45)
com.esotericsoftware.kryo.Kryo.getDefaultSerializer(Kryo.java:359)
com.esotericsoftware.kryo.Kryo.register(Kryo.java:394)
org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$1.apply(KryoSerializer.scala:97)
org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$1.apply(KryoSerializer.scala:96)
scala.collection.immutable.List.foreach(List.scala:381)
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:96)
org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:274)
org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:259)
org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:175)
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:59)
org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:109)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)

(the last 3 lines above repeat a number of times in the trace so I truncated 
them)


The main thread is clearly locked:

sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:623)
org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)

> Calling unpersist can cause spark to hang indefinitely when writing out a 
> result
> 
>
> Key: SPARK-15002
> URL: https://issues.apache.org/jira/browse/SPARK-15002
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.5.2, 1.6.0, 2.0.0
> Environment: AWS and Linux VM, both in spark-shell and spark-submit. 
> tested in 1.5.2 and 1.6. Tested in 2.0
>Reporter: Jamie Hutton
>
> The following code will cause spark to hang indefinitely. It happens when you 
> have an unpersist which is followed by some futher processing of that data 
> (in my case writing it out).
> I first experienced this issue with graphX so my example below involved 
> graphX code, however i suspect it might be more of a core issue than a graphx 
> one. I have raised another bug with similar results (indefinite hanging) but 
> in different circumstances, so these may well be linked
> Code to reproduce (can be run in spark-shell):
> import org.apache.spark.graphx._
> import org.apache.spark.rdd.RDD
> import org.apache.spark.sql.types._
> import org.apache.spark.sql._
> val r = scala.util.Random
> val list = (0L to 500L).map(i=>(i,r.nextInt(500).asInstanceOf[Long],"LABEL"))
> val distData = sc.parallelize(list)
> val edgesRDD = distData.map(x => Edge(x._1, x._2, x._3))
> val distinctNodes = distData.flatMap{row => Iterable((row._1, ("A")),(row._2, 
> ("B")))}.distinct()
> val nodesRDD: RDD[(VertexId, (String))] = distinctNodes
> val graph = Graph(nodesRDD, edgesRDD)
> graph.persist()
> val ccGraph = graph.connectedComponents()
> ccGraph.cache
>
> val schema = 

[jira] [Commented] (SPARK-17054) SparkR can not run in yarn-cluster mode on mac os

2016-08-15 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421288#comment-15421288
 ] 

Miao Wang commented on SPARK-17054:
---

I use Mac and build from source. sparkR works fine. How to reproduce this 
issue? Thanks!

> SparkR can not run in yarn-cluster mode on mac os
> -
>
> Key: SPARK-17054
> URL: https://issues.apache.org/jira/browse/SPARK-17054
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>
> This is due to it download sparkR to the wrong place.
> {noformat}
> Warning message:
> 'sparkR.init' is deprecated.
> Use 'sparkR.session' instead.
> See help("Deprecated")
> Spark not found in SPARK_HOME:  .
> To search in the cache directory. Installation will start if not found.
> Mirror site not provided.
> Looking for site suggested from apache website...
> Preferred mirror site found: http://apache.mirror.cdnetworks.com/spark
> Downloading Spark spark-2.0.0 for Hadoop 2.7 from:
> - 
> http://apache.mirror.cdnetworks.com/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz
> Fetch failed from http://apache.mirror.cdnetworks.com/spark
>  open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', 
> reason 'No such file or directory'>
> To use backup site...
> Downloading Spark spark-2.0.0 for Hadoop 2.7 from:
> - 
> http://www-us.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz
> Fetch failed from http://www-us.apache.org/dist/spark
>  open destfile '/home//Library/Caches/spark/spark-2.0.0-bin-hadoop2.7.tgz', 
> reason 'No such file or directory'>
> Error in robust_download_tar(mirrorUrl, version, hadoopVersion, packageName,  
> :
>   Unable to download Spark spark-2.0.0 for Hadoop 2.7. Please check network 
> connection, Hadoop version, or provide other mirror sites.
> Calls: sparkRSQL.init ... sparkR.session -> install.spark -> 
> robust_download_tar
> In addition: Warning messages:
> 1: 'sparkRSQL.init' is deprecated.
> Use 'sparkR.session' instead.
> See help("Deprecated")
> 2: In dir.create(localDir, recursive = TRUE) :
>   cannot create dir '/home//Library', reason 'Operation not supported'
> Execution halted
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17061) Incorrect results returned following a join of two datasets and a map step where total number of columns >100

2016-08-15 Thread Jamie Hutton (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jamie Hutton updated SPARK-17061:
-
Affects Version/s: 2.0.1

> Incorrect results returned following a join of two datasets and a map step 
> where total number of columns >100
> -
>
> Key: SPARK-17061
> URL: https://issues.apache.org/jira/browse/SPARK-17061
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Jamie Hutton
>Priority: Critical
>
> We have hit a consistent bug where we have a dataset with more than 100 
> columns. I am raising as a blocker because spark is returning the WRONG 
> results rather than erroring, leading to data integrity issues
> I have put together the following test case which will show the issue (it 
> will run in spark-shell). In this example i am joining a dataset with lots of 
> fields onto another dataset. 
> The join works fine and if you show the dataset you will get the expected 
> result. However if you run a map step over the dataset you end up with a 
> strange error where the sequence that is in the right dataset now only 
> contains the last value.
> Whilst this test may seem a rather contrived example, what we are doing here 
> is a very standard analtical pattern. My original code was designed to:
>  - take a dataset of child records
>  - groupByKey up to the parent: giving a Dataset of (ParentID, Seq[Children])
>  - join the children onto the parent by parentID: giving 
> ((Parent),(ParentID,Seq[Children])
>  - map over the result to give a tuple of (Parent,Seq[Children])
> Notes:
> - The issue is resolved by having less fields - as soon as we go <= 100 the 
> integrity issue goes away. Try removing one of the fields from BigCaseClass 
> below
> - The issue will arise based on the total number of fields in the resulting 
> dataset. Below i have a small case class and a big case class, but two case 
> classes of 50 variable would give the same issue
> - the issue occurs where the case class being joined on (on the right) has a 
> case class type. It doesnt occur if you have a Seq[String]
> - If i go back to an RDD for the map step after the join i can workaround the 
> issue, but i lose all the benefits of datasets
> Scala code test case:
>   case class Name(name: String)
>   case class SmallCaseClass (joinkey: Integer, names: Seq[Name])
>   case class BigCaseClass  (field1: Integer,field2: Integer,field3: 
> Integer,field4: Integer,field5: Integer,field6: Integer,field7: 
> Integer,field8: Integer,field9: Integer,field10: Integer,field11: 
> Integer,field12: Integer,field13: Integer,field14: Integer,field15: 
> Integer,field16: Integer,field17: Integer,field18: Integer,field19: 
> Integer,field20: Integer,field21: Integer,field22: Integer,field23: 
> Integer,field24: Integer,field25: Integer,field26: Integer,field27: 
> Integer,field28: Integer,field29: Integer,field30: Integer,field31: 
> Integer,field32: Integer,field33: Integer,field34: Integer,field35: 
> Integer,field36: Integer,field37: Integer,field38: Integer,field39: 
> Integer,field40: Integer,field41: Integer,field42: Integer,field43: 
> Integer,field44: Integer,field45: Integer,field46: Integer,field47: 
> Integer,field48: Integer,field49: Integer,field50: Integer,field51: 
> Integer,field52: Integer,field53: Integer,field54: Integer,field55: 
> Integer,field56: Integer,field57: Integer,field58: Integer,field59: 
> Integer,field60: Integer,field61: Integer,field62: Integer,field63: 
> Integer,field64: Integer,field65: Integer,field66: Integer,field67: 
> Integer,field68: Integer,field69: Integer,field70: Integer,field71: 
> Integer,field72: Integer,field73: Integer,field74: Integer,field75: 
> Integer,field76: Integer,field77: Integer,field78: Integer,field79: 
> Integer,field80: Integer,field81: Integer,field82: Integer,field83: 
> Integer,field84: Integer,field85: Integer,field86: Integer,field87: 
> Integer,field88: Integer,field89: Integer,field90: Integer,field91: 
> Integer,field92: Integer,field93: Integer,field94: Integer,field95: 
> Integer,field96: Integer,field97: Integer,field98: Integer,field99: Integer)
>   
> val bigCC=Seq(BigCaseClass(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
> 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 
> 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 
> 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 
> 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 
> 91, 92, 93, 94, 95, 96, 97, 98, 99))
> 
> val smallCC=Seq(SmallCaseClass(1,Seq(
> Name("Jamie"), 
> Name("Ian"),
> Name("Dave"),
> 

[jira] [Reopened] (SPARK-17061) Incorrect results returned following a join of two datasets and a map step where total number of columns >100

2016-08-15 Thread Jamie Hutton (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jamie Hutton reopened SPARK-17061:
--

Tested in 2.0.1 nightly snapshot and still not resolved so this appears not to 
be a dupe

> Incorrect results returned following a join of two datasets and a map step 
> where total number of columns >100
> -
>
> Key: SPARK-17061
> URL: https://issues.apache.org/jira/browse/SPARK-17061
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jamie Hutton
>Priority: Critical
>
> We have hit a consistent bug where we have a dataset with more than 100 
> columns. I am raising as a blocker because spark is returning the WRONG 
> results rather than erroring, leading to data integrity issues
> I have put together the following test case which will show the issue (it 
> will run in spark-shell). In this example i am joining a dataset with lots of 
> fields onto another dataset. 
> The join works fine and if you show the dataset you will get the expected 
> result. However if you run a map step over the dataset you end up with a 
> strange error where the sequence that is in the right dataset now only 
> contains the last value.
> Whilst this test may seem a rather contrived example, what we are doing here 
> is a very standard analtical pattern. My original code was designed to:
>  - take a dataset of child records
>  - groupByKey up to the parent: giving a Dataset of (ParentID, Seq[Children])
>  - join the children onto the parent by parentID: giving 
> ((Parent),(ParentID,Seq[Children])
>  - map over the result to give a tuple of (Parent,Seq[Children])
> Notes:
> - The issue is resolved by having less fields - as soon as we go <= 100 the 
> integrity issue goes away. Try removing one of the fields from BigCaseClass 
> below
> - The issue will arise based on the total number of fields in the resulting 
> dataset. Below i have a small case class and a big case class, but two case 
> classes of 50 variable would give the same issue
> - the issue occurs where the case class being joined on (on the right) has a 
> case class type. It doesnt occur if you have a Seq[String]
> - If i go back to an RDD for the map step after the join i can workaround the 
> issue, but i lose all the benefits of datasets
> Scala code test case:
>   case class Name(name: String)
>   case class SmallCaseClass (joinkey: Integer, names: Seq[Name])
>   case class BigCaseClass  (field1: Integer,field2: Integer,field3: 
> Integer,field4: Integer,field5: Integer,field6: Integer,field7: 
> Integer,field8: Integer,field9: Integer,field10: Integer,field11: 
> Integer,field12: Integer,field13: Integer,field14: Integer,field15: 
> Integer,field16: Integer,field17: Integer,field18: Integer,field19: 
> Integer,field20: Integer,field21: Integer,field22: Integer,field23: 
> Integer,field24: Integer,field25: Integer,field26: Integer,field27: 
> Integer,field28: Integer,field29: Integer,field30: Integer,field31: 
> Integer,field32: Integer,field33: Integer,field34: Integer,field35: 
> Integer,field36: Integer,field37: Integer,field38: Integer,field39: 
> Integer,field40: Integer,field41: Integer,field42: Integer,field43: 
> Integer,field44: Integer,field45: Integer,field46: Integer,field47: 
> Integer,field48: Integer,field49: Integer,field50: Integer,field51: 
> Integer,field52: Integer,field53: Integer,field54: Integer,field55: 
> Integer,field56: Integer,field57: Integer,field58: Integer,field59: 
> Integer,field60: Integer,field61: Integer,field62: Integer,field63: 
> Integer,field64: Integer,field65: Integer,field66: Integer,field67: 
> Integer,field68: Integer,field69: Integer,field70: Integer,field71: 
> Integer,field72: Integer,field73: Integer,field74: Integer,field75: 
> Integer,field76: Integer,field77: Integer,field78: Integer,field79: 
> Integer,field80: Integer,field81: Integer,field82: Integer,field83: 
> Integer,field84: Integer,field85: Integer,field86: Integer,field87: 
> Integer,field88: Integer,field89: Integer,field90: Integer,field91: 
> Integer,field92: Integer,field93: Integer,field94: Integer,field95: 
> Integer,field96: Integer,field97: Integer,field98: Integer,field99: Integer)
>   
> val bigCC=Seq(BigCaseClass(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
> 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 
> 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 
> 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 
> 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 
> 91, 92, 93, 94, 95, 96, 97, 98, 99))
> 
> val smallCC=Seq(SmallCaseClass(1,Seq(
> Name("Jamie"), 
>

[jira] [Commented] (SPARK-17061) Incorrect results returned following a join of two datasets and a map step where total number of columns >100

2016-08-15 Thread Jamie Hutton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421277#comment-15421277
 ] 

Jamie Hutton commented on SPARK-17061:
--

I have just downloaded 2.0.1 nightly build from here:
http://people.apache.org/~pwendell/spark-nightly/spark-branch-2.0-bin/spark-2.0.1-SNAPSHOT-2016_08_15_00_23-e02d0d0-bin/

Unfortunately the issue is still present. I am going to down-grade the issue to 
critical as you suggested and re-open if thats ok. Please can you guys take a 
look at it?

Thanks

> Incorrect results returned following a join of two datasets and a map step 
> where total number of columns >100
> -
>
> Key: SPARK-17061
> URL: https://issues.apache.org/jira/browse/SPARK-17061
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jamie Hutton
>Priority: Critical
>
> We have hit a consistent bug where we have a dataset with more than 100 
> columns. I am raising as a blocker because spark is returning the WRONG 
> results rather than erroring, leading to data integrity issues
> I have put together the following test case which will show the issue (it 
> will run in spark-shell). In this example i am joining a dataset with lots of 
> fields onto another dataset. 
> The join works fine and if you show the dataset you will get the expected 
> result. However if you run a map step over the dataset you end up with a 
> strange error where the sequence that is in the right dataset now only 
> contains the last value.
> Whilst this test may seem a rather contrived example, what we are doing here 
> is a very standard analtical pattern. My original code was designed to:
>  - take a dataset of child records
>  - groupByKey up to the parent: giving a Dataset of (ParentID, Seq[Children])
>  - join the children onto the parent by parentID: giving 
> ((Parent),(ParentID,Seq[Children])
>  - map over the result to give a tuple of (Parent,Seq[Children])
> Notes:
> - The issue is resolved by having less fields - as soon as we go <= 100 the 
> integrity issue goes away. Try removing one of the fields from BigCaseClass 
> below
> - The issue will arise based on the total number of fields in the resulting 
> dataset. Below i have a small case class and a big case class, but two case 
> classes of 50 variable would give the same issue
> - the issue occurs where the case class being joined on (on the right) has a 
> case class type. It doesnt occur if you have a Seq[String]
> - If i go back to an RDD for the map step after the join i can workaround the 
> issue, but i lose all the benefits of datasets
> Scala code test case:
>   case class Name(name: String)
>   case class SmallCaseClass (joinkey: Integer, names: Seq[Name])
>   case class BigCaseClass  (field1: Integer,field2: Integer,field3: 
> Integer,field4: Integer,field5: Integer,field6: Integer,field7: 
> Integer,field8: Integer,field9: Integer,field10: Integer,field11: 
> Integer,field12: Integer,field13: Integer,field14: Integer,field15: 
> Integer,field16: Integer,field17: Integer,field18: Integer,field19: 
> Integer,field20: Integer,field21: Integer,field22: Integer,field23: 
> Integer,field24: Integer,field25: Integer,field26: Integer,field27: 
> Integer,field28: Integer,field29: Integer,field30: Integer,field31: 
> Integer,field32: Integer,field33: Integer,field34: Integer,field35: 
> Integer,field36: Integer,field37: Integer,field38: Integer,field39: 
> Integer,field40: Integer,field41: Integer,field42: Integer,field43: 
> Integer,field44: Integer,field45: Integer,field46: Integer,field47: 
> Integer,field48: Integer,field49: Integer,field50: Integer,field51: 
> Integer,field52: Integer,field53: Integer,field54: Integer,field55: 
> Integer,field56: Integer,field57: Integer,field58: Integer,field59: 
> Integer,field60: Integer,field61: Integer,field62: Integer,field63: 
> Integer,field64: Integer,field65: Integer,field66: Integer,field67: 
> Integer,field68: Integer,field69: Integer,field70: Integer,field71: 
> Integer,field72: Integer,field73: Integer,field74: Integer,field75: 
> Integer,field76: Integer,field77: Integer,field78: Integer,field79: 
> Integer,field80: Integer,field81: Integer,field82: Integer,field83: 
> Integer,field84: Integer,field85: Integer,field86: Integer,field87: 
> Integer,field88: Integer,field89: Integer,field90: Integer,field91: 
> Integer,field92: Integer,field93: Integer,field94: Integer,field95: 
> Integer,field96: Integer,field97: Integer,field98: Integer,field99: Integer)
>   
> val bigCC=Seq(BigCaseClass(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
> 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 
> 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 

[jira] [Updated] (SPARK-17061) Incorrect results returned following a join of two datasets and a map step where total number of columns >100

2016-08-15 Thread Jamie Hutton (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jamie Hutton updated SPARK-17061:
-
Priority: Critical  (was: Blocker)

> Incorrect results returned following a join of two datasets and a map step 
> where total number of columns >100
> -
>
> Key: SPARK-17061
> URL: https://issues.apache.org/jira/browse/SPARK-17061
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jamie Hutton
>Priority: Critical
>
> We have hit a consistent bug where we have a dataset with more than 100 
> columns. I am raising as a blocker because spark is returning the WRONG 
> results rather than erroring, leading to data integrity issues
> I have put together the following test case which will show the issue (it 
> will run in spark-shell). In this example i am joining a dataset with lots of 
> fields onto another dataset. 
> The join works fine and if you show the dataset you will get the expected 
> result. However if you run a map step over the dataset you end up with a 
> strange error where the sequence that is in the right dataset now only 
> contains the last value.
> Whilst this test may seem a rather contrived example, what we are doing here 
> is a very standard analtical pattern. My original code was designed to:
>  - take a dataset of child records
>  - groupByKey up to the parent: giving a Dataset of (ParentID, Seq[Children])
>  - join the children onto the parent by parentID: giving 
> ((Parent),(ParentID,Seq[Children])
>  - map over the result to give a tuple of (Parent,Seq[Children])
> Notes:
> - The issue is resolved by having less fields - as soon as we go <= 100 the 
> integrity issue goes away. Try removing one of the fields from BigCaseClass 
> below
> - The issue will arise based on the total number of fields in the resulting 
> dataset. Below i have a small case class and a big case class, but two case 
> classes of 50 variable would give the same issue
> - the issue occurs where the case class being joined on (on the right) has a 
> case class type. It doesnt occur if you have a Seq[String]
> - If i go back to an RDD for the map step after the join i can workaround the 
> issue, but i lose all the benefits of datasets
> Scala code test case:
>   case class Name(name: String)
>   case class SmallCaseClass (joinkey: Integer, names: Seq[Name])
>   case class BigCaseClass  (field1: Integer,field2: Integer,field3: 
> Integer,field4: Integer,field5: Integer,field6: Integer,field7: 
> Integer,field8: Integer,field9: Integer,field10: Integer,field11: 
> Integer,field12: Integer,field13: Integer,field14: Integer,field15: 
> Integer,field16: Integer,field17: Integer,field18: Integer,field19: 
> Integer,field20: Integer,field21: Integer,field22: Integer,field23: 
> Integer,field24: Integer,field25: Integer,field26: Integer,field27: 
> Integer,field28: Integer,field29: Integer,field30: Integer,field31: 
> Integer,field32: Integer,field33: Integer,field34: Integer,field35: 
> Integer,field36: Integer,field37: Integer,field38: Integer,field39: 
> Integer,field40: Integer,field41: Integer,field42: Integer,field43: 
> Integer,field44: Integer,field45: Integer,field46: Integer,field47: 
> Integer,field48: Integer,field49: Integer,field50: Integer,field51: 
> Integer,field52: Integer,field53: Integer,field54: Integer,field55: 
> Integer,field56: Integer,field57: Integer,field58: Integer,field59: 
> Integer,field60: Integer,field61: Integer,field62: Integer,field63: 
> Integer,field64: Integer,field65: Integer,field66: Integer,field67: 
> Integer,field68: Integer,field69: Integer,field70: Integer,field71: 
> Integer,field72: Integer,field73: Integer,field74: Integer,field75: 
> Integer,field76: Integer,field77: Integer,field78: Integer,field79: 
> Integer,field80: Integer,field81: Integer,field82: Integer,field83: 
> Integer,field84: Integer,field85: Integer,field86: Integer,field87: 
> Integer,field88: Integer,field89: Integer,field90: Integer,field91: 
> Integer,field92: Integer,field93: Integer,field94: Integer,field95: 
> Integer,field96: Integer,field97: Integer,field98: Integer,field99: Integer)
>   
> val bigCC=Seq(BigCaseClass(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
> 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 
> 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 
> 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 
> 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 
> 91, 92, 93, 94, 95, 96, 97, 98, 99))
> 
> val smallCC=Seq(SmallCaseClass(1,Seq(
> Name("Jamie"), 
> Name("Ian"),
> Name("Dave"),
> 

[jira] [Commented] (SPARK-15002) Calling unpersist can cause spark to hang indefinitely when writing out a result

2016-08-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421234#comment-15421234
 ] 

Sean Owen commented on SPARK-15002:
---

Yeah I just mean it ought to be fine and there's no obvious reason something 
would fail. If nothing is RUNNING ... can you search for "deadlock" ? the JVM 
can identify some deadlocks by itself. It's sometimes worth browsing any 
"waiting on lock" messages in the trace to see if something is suspicious. I'm 
wondering however how you can have 100% CPU but no running threads... are we 
talking about the same thing in both cases? an executor?

> Calling unpersist can cause spark to hang indefinitely when writing out a 
> result
> 
>
> Key: SPARK-15002
> URL: https://issues.apache.org/jira/browse/SPARK-15002
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.5.2, 1.6.0, 2.0.0
> Environment: AWS and Linux VM, both in spark-shell and spark-submit. 
> tested in 1.5.2 and 1.6. Tested in 2.0
>Reporter: Jamie Hutton
>
> The following code will cause spark to hang indefinitely. It happens when you 
> have an unpersist which is followed by some futher processing of that data 
> (in my case writing it out).
> I first experienced this issue with graphX so my example below involved 
> graphX code, however i suspect it might be more of a core issue than a graphx 
> one. I have raised another bug with similar results (indefinite hanging) but 
> in different circumstances, so these may well be linked
> Code to reproduce (can be run in spark-shell):
> import org.apache.spark.graphx._
> import org.apache.spark.rdd.RDD
> import org.apache.spark.sql.types._
> import org.apache.spark.sql._
> val r = scala.util.Random
> val list = (0L to 500L).map(i=>(i,r.nextInt(500).asInstanceOf[Long],"LABEL"))
> val distData = sc.parallelize(list)
> val edgesRDD = distData.map(x => Edge(x._1, x._2, x._3))
> val distinctNodes = distData.flatMap{row => Iterable((row._1, ("A")),(row._2, 
> ("B")))}.distinct()
> val nodesRDD: RDD[(VertexId, (String))] = distinctNodes
> val graph = Graph(nodesRDD, edgesRDD)
> graph.persist()
> val ccGraph = graph.connectedComponents()
> ccGraph.cache
>
> val schema = StructType(Seq(StructField("id", LongType, false), 
> StructField("netid", LongType, false)))
> val 
> rdd=ccGraph.vertices.map(row=>(Row(row._1.asInstanceOf[Long],row._2.asInstanceOf[Long])))
> val builtEntityDF = sqlContext.createDataFrame(rdd, schema)
>  
> /*this unpersist step causes the issue*/
> ccGraph.unpersist()
>  
> /*write step hangs for ever*/
> builtEntityDF.write.format("parquet").mode("overwrite").save("/user/root/writetest.parquet")
>  
> If you take out the ccGraph.unpersist() step the write step completes 
> successfully



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15002) Calling unpersist can cause spark to hang indefinitely when writing out a result

2016-08-15 Thread Jamie Hutton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421192#comment-15421192
 ] 

Jamie Hutton commented on SPARK-15002:
--

I took a look at the executors and there is nothing in a RUNNING state. 

I do agree that there is no major reason to call unpersist in the above code, 
but i put that example together as the simplest possible test case i could find 
that would replicate it. In the actual code we have written we do wish to use 
unpersist and ended up in this hanging state. 

> Calling unpersist can cause spark to hang indefinitely when writing out a 
> result
> 
>
> Key: SPARK-15002
> URL: https://issues.apache.org/jira/browse/SPARK-15002
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.5.2, 1.6.0, 2.0.0
> Environment: AWS and Linux VM, both in spark-shell and spark-submit. 
> tested in 1.5.2 and 1.6. Tested in 2.0
>Reporter: Jamie Hutton
>
> The following code will cause spark to hang indefinitely. It happens when you 
> have an unpersist which is followed by some futher processing of that data 
> (in my case writing it out).
> I first experienced this issue with graphX so my example below involved 
> graphX code, however i suspect it might be more of a core issue than a graphx 
> one. I have raised another bug with similar results (indefinite hanging) but 
> in different circumstances, so these may well be linked
> Code to reproduce (can be run in spark-shell):
> import org.apache.spark.graphx._
> import org.apache.spark.rdd.RDD
> import org.apache.spark.sql.types._
> import org.apache.spark.sql._
> val r = scala.util.Random
> val list = (0L to 500L).map(i=>(i,r.nextInt(500).asInstanceOf[Long],"LABEL"))
> val distData = sc.parallelize(list)
> val edgesRDD = distData.map(x => Edge(x._1, x._2, x._3))
> val distinctNodes = distData.flatMap{row => Iterable((row._1, ("A")),(row._2, 
> ("B")))}.distinct()
> val nodesRDD: RDD[(VertexId, (String))] = distinctNodes
> val graph = Graph(nodesRDD, edgesRDD)
> graph.persist()
> val ccGraph = graph.connectedComponents()
> ccGraph.cache
>
> val schema = StructType(Seq(StructField("id", LongType, false), 
> StructField("netid", LongType, false)))
> val 
> rdd=ccGraph.vertices.map(row=>(Row(row._1.asInstanceOf[Long],row._2.asInstanceOf[Long])))
> val builtEntityDF = sqlContext.createDataFrame(rdd, schema)
>  
> /*this unpersist step causes the issue*/
> ccGraph.unpersist()
>  
> /*write step hangs for ever*/
> builtEntityDF.write.format("parquet").mode("overwrite").save("/user/root/writetest.parquet")
>  
> If you take out the ccGraph.unpersist() step the write step completes 
> successfully



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15002) Calling unpersist can cause spark to hang indefinitely when writing out a result

2016-08-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421186#comment-15421186
 ] 

Sean Owen commented on SPARK-15002:
---

It would be the workers. Oh, and I meant RUNNING rather than RUNNABLE. The idea 
is, where are the running tasks working when you snapshot them? a couple looks 
at a couple executors should reveal if something is stuck or just busy in a 
particular piece of code. That's the easiest first step to figure out if it's 
just that something is busy. It may not be. there's no particuarly good reason 
that removing unpersist() there should help. All it does is make it like you 
never persisted it to begin with, and it's only touched once anyway.

> Calling unpersist can cause spark to hang indefinitely when writing out a 
> result
> 
>
> Key: SPARK-15002
> URL: https://issues.apache.org/jira/browse/SPARK-15002
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.5.2, 1.6.0, 2.0.0
> Environment: AWS and Linux VM, both in spark-shell and spark-submit. 
> tested in 1.5.2 and 1.6. Tested in 2.0
>Reporter: Jamie Hutton
>
> The following code will cause spark to hang indefinitely. It happens when you 
> have an unpersist which is followed by some futher processing of that data 
> (in my case writing it out).
> I first experienced this issue with graphX so my example below involved 
> graphX code, however i suspect it might be more of a core issue than a graphx 
> one. I have raised another bug with similar results (indefinite hanging) but 
> in different circumstances, so these may well be linked
> Code to reproduce (can be run in spark-shell):
> import org.apache.spark.graphx._
> import org.apache.spark.rdd.RDD
> import org.apache.spark.sql.types._
> import org.apache.spark.sql._
> val r = scala.util.Random
> val list = (0L to 500L).map(i=>(i,r.nextInt(500).asInstanceOf[Long],"LABEL"))
> val distData = sc.parallelize(list)
> val edgesRDD = distData.map(x => Edge(x._1, x._2, x._3))
> val distinctNodes = distData.flatMap{row => Iterable((row._1, ("A")),(row._2, 
> ("B")))}.distinct()
> val nodesRDD: RDD[(VertexId, (String))] = distinctNodes
> val graph = Graph(nodesRDD, edgesRDD)
> graph.persist()
> val ccGraph = graph.connectedComponents()
> ccGraph.cache
>
> val schema = StructType(Seq(StructField("id", LongType, false), 
> StructField("netid", LongType, false)))
> val 
> rdd=ccGraph.vertices.map(row=>(Row(row._1.asInstanceOf[Long],row._2.asInstanceOf[Long])))
> val builtEntityDF = sqlContext.createDataFrame(rdd, schema)
>  
> /*this unpersist step causes the issue*/
> ccGraph.unpersist()
>  
> /*write step hangs for ever*/
> builtEntityDF.write.format("parquet").mode("overwrite").save("/user/root/writetest.parquet")
>  
> If you take out the ccGraph.unpersist() step the write step completes 
> successfully



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17061) Incorrect results returned following a join of two datasets and a map step where total number of columns >100

2016-08-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421178#comment-15421178
 ] 

Sean Owen commented on SPARK-17061:
---

It is likely to be -- see also SPARK-17043. At least, I'd try a version with 
this fix before reopening this one.

> Incorrect results returned following a join of two datasets and a map step 
> where total number of columns >100
> -
>
> Key: SPARK-17061
> URL: https://issues.apache.org/jira/browse/SPARK-17061
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jamie Hutton
>Priority: Blocker
>
> We have hit a consistent bug where we have a dataset with more than 100 
> columns. I am raising as a blocker because spark is returning the WRONG 
> results rather than erroring, leading to data integrity issues
> I have put together the following test case which will show the issue (it 
> will run in spark-shell). In this example i am joining a dataset with lots of 
> fields onto another dataset. 
> The join works fine and if you show the dataset you will get the expected 
> result. However if you run a map step over the dataset you end up with a 
> strange error where the sequence that is in the right dataset now only 
> contains the last value.
> Whilst this test may seem a rather contrived example, what we are doing here 
> is a very standard analtical pattern. My original code was designed to:
>  - take a dataset of child records
>  - groupByKey up to the parent: giving a Dataset of (ParentID, Seq[Children])
>  - join the children onto the parent by parentID: giving 
> ((Parent),(ParentID,Seq[Children])
>  - map over the result to give a tuple of (Parent,Seq[Children])
> Notes:
> - The issue is resolved by having less fields - as soon as we go <= 100 the 
> integrity issue goes away. Try removing one of the fields from BigCaseClass 
> below
> - The issue will arise based on the total number of fields in the resulting 
> dataset. Below i have a small case class and a big case class, but two case 
> classes of 50 variable would give the same issue
> - the issue occurs where the case class being joined on (on the right) has a 
> case class type. It doesnt occur if you have a Seq[String]
> - If i go back to an RDD for the map step after the join i can workaround the 
> issue, but i lose all the benefits of datasets
> Scala code test case:
>   case class Name(name: String)
>   case class SmallCaseClass (joinkey: Integer, names: Seq[Name])
>   case class BigCaseClass  (field1: Integer,field2: Integer,field3: 
> Integer,field4: Integer,field5: Integer,field6: Integer,field7: 
> Integer,field8: Integer,field9: Integer,field10: Integer,field11: 
> Integer,field12: Integer,field13: Integer,field14: Integer,field15: 
> Integer,field16: Integer,field17: Integer,field18: Integer,field19: 
> Integer,field20: Integer,field21: Integer,field22: Integer,field23: 
> Integer,field24: Integer,field25: Integer,field26: Integer,field27: 
> Integer,field28: Integer,field29: Integer,field30: Integer,field31: 
> Integer,field32: Integer,field33: Integer,field34: Integer,field35: 
> Integer,field36: Integer,field37: Integer,field38: Integer,field39: 
> Integer,field40: Integer,field41: Integer,field42: Integer,field43: 
> Integer,field44: Integer,field45: Integer,field46: Integer,field47: 
> Integer,field48: Integer,field49: Integer,field50: Integer,field51: 
> Integer,field52: Integer,field53: Integer,field54: Integer,field55: 
> Integer,field56: Integer,field57: Integer,field58: Integer,field59: 
> Integer,field60: Integer,field61: Integer,field62: Integer,field63: 
> Integer,field64: Integer,field65: Integer,field66: Integer,field67: 
> Integer,field68: Integer,field69: Integer,field70: Integer,field71: 
> Integer,field72: Integer,field73: Integer,field74: Integer,field75: 
> Integer,field76: Integer,field77: Integer,field78: Integer,field79: 
> Integer,field80: Integer,field81: Integer,field82: Integer,field83: 
> Integer,field84: Integer,field85: Integer,field86: Integer,field87: 
> Integer,field88: Integer,field89: Integer,field90: Integer,field91: 
> Integer,field92: Integer,field93: Integer,field94: Integer,field95: 
> Integer,field96: Integer,field97: Integer,field98: Integer,field99: Integer)
>   
> val bigCC=Seq(BigCaseClass(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
> 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 
> 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 
> 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 
> 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 
> 91, 92, 93, 94, 95, 96, 97, 98, 99))
> 
> val 

[jira] [Assigned] (SPARK-17059) Allow FileFormat to specify partition pruning strategy

2016-08-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17059:


Assignee: Apache Spark

> Allow FileFormat to specify partition pruning strategy
> --
>
> Key: SPARK-17059
> URL: https://issues.apache.org/jira/browse/SPARK-17059
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
>Assignee: Apache Spark
>
> Allow Spark to have pluggable pruning of input files for FileSourceScanExec 
> by allowing FileFormat's to specify format-specific filterPartitions method.
> This is especially useful for Parquet as Spark does not currently make use of 
> the summary metadata, instead reading the footer of all part files for a 
> Parquet data source. This can lead to massive speedups when reading a 
> filtered chunk of a dataset, especially when using remote storage (S3).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17059) Allow FileFormat to specify partition pruning strategy

2016-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421183#comment-15421183
 ] 

Apache Spark commented on SPARK-17059:
--

User 'andreweduffy' has created a pull request for this issue:
https://github.com/apache/spark/pull/14649

> Allow FileFormat to specify partition pruning strategy
> --
>
> Key: SPARK-17059
> URL: https://issues.apache.org/jira/browse/SPARK-17059
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
>
> Allow Spark to have pluggable pruning of input files for FileSourceScanExec 
> by allowing FileFormat's to specify format-specific filterPartitions method.
> This is especially useful for Parquet as Spark does not currently make use of 
> the summary metadata, instead reading the footer of all part files for a 
> Parquet data source. This can lead to massive speedups when reading a 
> filtered chunk of a dataset, especially when using remote storage (S3).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17059) Allow FileFormat to specify partition pruning strategy

2016-08-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17059:


Assignee: (was: Apache Spark)

> Allow FileFormat to specify partition pruning strategy
> --
>
> Key: SPARK-17059
> URL: https://issues.apache.org/jira/browse/SPARK-17059
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
>
> Allow Spark to have pluggable pruning of input files for FileSourceScanExec 
> by allowing FileFormat's to specify format-specific filterPartitions method.
> This is especially useful for Parquet as Spark does not currently make use of 
> the summary metadata, instead reading the footer of all part files for a 
> Parquet data source. This can lead to massive speedups when reading a 
> filtered chunk of a dataset, especially when using remote storage (S3).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15002) Calling unpersist can cause spark to hang indefinitely when writing out a result

2016-08-15 Thread Jamie Hutton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421182#comment-15421182
 ] 

Jamie Hutton commented on SPARK-15002:
--

GC time is 1.5seconds and does not increase whilst in the hanging state
The runnable tasks are:
Signal Dispatcher
shuffle-server-0 (x2)
executor task launch worker (x8)
ResponseProcessor for block 
BP-267552868-172.16.137.143-1457691099567:blk_1073937571_196881

Does any of the above help?

> Calling unpersist can cause spark to hang indefinitely when writing out a 
> result
> 
>
> Key: SPARK-15002
> URL: https://issues.apache.org/jira/browse/SPARK-15002
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.5.2, 1.6.0, 2.0.0
> Environment: AWS and Linux VM, both in spark-shell and spark-submit. 
> tested in 1.5.2 and 1.6. Tested in 2.0
>Reporter: Jamie Hutton
>
> The following code will cause spark to hang indefinitely. It happens when you 
> have an unpersist which is followed by some futher processing of that data 
> (in my case writing it out).
> I first experienced this issue with graphX so my example below involved 
> graphX code, however i suspect it might be more of a core issue than a graphx 
> one. I have raised another bug with similar results (indefinite hanging) but 
> in different circumstances, so these may well be linked
> Code to reproduce (can be run in spark-shell):
> import org.apache.spark.graphx._
> import org.apache.spark.rdd.RDD
> import org.apache.spark.sql.types._
> import org.apache.spark.sql._
> val r = scala.util.Random
> val list = (0L to 500L).map(i=>(i,r.nextInt(500).asInstanceOf[Long],"LABEL"))
> val distData = sc.parallelize(list)
> val edgesRDD = distData.map(x => Edge(x._1, x._2, x._3))
> val distinctNodes = distData.flatMap{row => Iterable((row._1, ("A")),(row._2, 
> ("B")))}.distinct()
> val nodesRDD: RDD[(VertexId, (String))] = distinctNodes
> val graph = Graph(nodesRDD, edgesRDD)
> graph.persist()
> val ccGraph = graph.connectedComponents()
> ccGraph.cache
>
> val schema = StructType(Seq(StructField("id", LongType, false), 
> StructField("netid", LongType, false)))
> val 
> rdd=ccGraph.vertices.map(row=>(Row(row._1.asInstanceOf[Long],row._2.asInstanceOf[Long])))
> val builtEntityDF = sqlContext.createDataFrame(rdd, schema)
>  
> /*this unpersist step causes the issue*/
> ccGraph.unpersist()
>  
> /*write step hangs for ever*/
> builtEntityDF.write.format("parquet").mode("overwrite").save("/user/root/writetest.parquet")
>  
> If you take out the ccGraph.unpersist() step the write step completes 
> successfully



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17061) Incorrect results returned following a join of two datasets and a map step where total number of columns >100

2016-08-15 Thread Jamie Hutton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421169#comment-15421169
 ] 

Jamie Hutton commented on SPARK-17061:
--

Apologies for setting blocker. I wont use that again.

Is the above definitely the same issue as the one you marked as a duplicate? 
That does seem to be slightly different as it relates to persist and 200 
columns. Did you manage to run the above test case on a 2.0.1 codebase and did 
it work?

> Incorrect results returned following a join of two datasets and a map step 
> where total number of columns >100
> -
>
> Key: SPARK-17061
> URL: https://issues.apache.org/jira/browse/SPARK-17061
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jamie Hutton
>Priority: Blocker
>
> We have hit a consistent bug where we have a dataset with more than 100 
> columns. I am raising as a blocker because spark is returning the WRONG 
> results rather than erroring, leading to data integrity issues
> I have put together the following test case which will show the issue (it 
> will run in spark-shell). In this example i am joining a dataset with lots of 
> fields onto another dataset. 
> The join works fine and if you show the dataset you will get the expected 
> result. However if you run a map step over the dataset you end up with a 
> strange error where the sequence that is in the right dataset now only 
> contains the last value.
> Whilst this test may seem a rather contrived example, what we are doing here 
> is a very standard analtical pattern. My original code was designed to:
>  - take a dataset of child records
>  - groupByKey up to the parent: giving a Dataset of (ParentID, Seq[Children])
>  - join the children onto the parent by parentID: giving 
> ((Parent),(ParentID,Seq[Children])
>  - map over the result to give a tuple of (Parent,Seq[Children])
> Notes:
> - The issue is resolved by having less fields - as soon as we go <= 100 the 
> integrity issue goes away. Try removing one of the fields from BigCaseClass 
> below
> - The issue will arise based on the total number of fields in the resulting 
> dataset. Below i have a small case class and a big case class, but two case 
> classes of 50 variable would give the same issue
> - the issue occurs where the case class being joined on (on the right) has a 
> case class type. It doesnt occur if you have a Seq[String]
> - If i go back to an RDD for the map step after the join i can workaround the 
> issue, but i lose all the benefits of datasets
> Scala code test case:
>   case class Name(name: String)
>   case class SmallCaseClass (joinkey: Integer, names: Seq[Name])
>   case class BigCaseClass  (field1: Integer,field2: Integer,field3: 
> Integer,field4: Integer,field5: Integer,field6: Integer,field7: 
> Integer,field8: Integer,field9: Integer,field10: Integer,field11: 
> Integer,field12: Integer,field13: Integer,field14: Integer,field15: 
> Integer,field16: Integer,field17: Integer,field18: Integer,field19: 
> Integer,field20: Integer,field21: Integer,field22: Integer,field23: 
> Integer,field24: Integer,field25: Integer,field26: Integer,field27: 
> Integer,field28: Integer,field29: Integer,field30: Integer,field31: 
> Integer,field32: Integer,field33: Integer,field34: Integer,field35: 
> Integer,field36: Integer,field37: Integer,field38: Integer,field39: 
> Integer,field40: Integer,field41: Integer,field42: Integer,field43: 
> Integer,field44: Integer,field45: Integer,field46: Integer,field47: 
> Integer,field48: Integer,field49: Integer,field50: Integer,field51: 
> Integer,field52: Integer,field53: Integer,field54: Integer,field55: 
> Integer,field56: Integer,field57: Integer,field58: Integer,field59: 
> Integer,field60: Integer,field61: Integer,field62: Integer,field63: 
> Integer,field64: Integer,field65: Integer,field66: Integer,field67: 
> Integer,field68: Integer,field69: Integer,field70: Integer,field71: 
> Integer,field72: Integer,field73: Integer,field74: Integer,field75: 
> Integer,field76: Integer,field77: Integer,field78: Integer,field79: 
> Integer,field80: Integer,field81: Integer,field82: Integer,field83: 
> Integer,field84: Integer,field85: Integer,field86: Integer,field87: 
> Integer,field88: Integer,field89: Integer,field90: Integer,field91: 
> Integer,field92: Integer,field93: Integer,field94: Integer,field95: 
> Integer,field96: Integer,field97: Integer,field98: Integer,field99: Integer)
>   
> val bigCC=Seq(BigCaseClass(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
> 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 
> 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 
> 53, 54, 55, 56, 57, 58, 59, 60, 

[jira] [Commented] (SPARK-15002) Calling unpersist can cause spark to hang indefinitely when writing out a result

2016-08-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421164#comment-15421164
 ] 

Sean Owen commented on SPARK-15002:
---

In the UI, go look at a heap dump of the pegged executor. It should show you 
RUNNABLE threads and it ought to be sort of clear where they're busy. Also 
check GC time. If it's a non-trivial fraction of execution time, at least it's 
clear that somehow it's memory pressure.

> Calling unpersist can cause spark to hang indefinitely when writing out a 
> result
> 
>
> Key: SPARK-15002
> URL: https://issues.apache.org/jira/browse/SPARK-15002
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.5.2, 1.6.0, 2.0.0
> Environment: AWS and Linux VM, both in spark-shell and spark-submit. 
> tested in 1.5.2 and 1.6. Tested in 2.0
>Reporter: Jamie Hutton
>
> The following code will cause spark to hang indefinitely. It happens when you 
> have an unpersist which is followed by some futher processing of that data 
> (in my case writing it out).
> I first experienced this issue with graphX so my example below involved 
> graphX code, however i suspect it might be more of a core issue than a graphx 
> one. I have raised another bug with similar results (indefinite hanging) but 
> in different circumstances, so these may well be linked
> Code to reproduce (can be run in spark-shell):
> import org.apache.spark.graphx._
> import org.apache.spark.rdd.RDD
> import org.apache.spark.sql.types._
> import org.apache.spark.sql._
> val r = scala.util.Random
> val list = (0L to 500L).map(i=>(i,r.nextInt(500).asInstanceOf[Long],"LABEL"))
> val distData = sc.parallelize(list)
> val edgesRDD = distData.map(x => Edge(x._1, x._2, x._3))
> val distinctNodes = distData.flatMap{row => Iterable((row._1, ("A")),(row._2, 
> ("B")))}.distinct()
> val nodesRDD: RDD[(VertexId, (String))] = distinctNodes
> val graph = Graph(nodesRDD, edgesRDD)
> graph.persist()
> val ccGraph = graph.connectedComponents()
> ccGraph.cache
>
> val schema = StructType(Seq(StructField("id", LongType, false), 
> StructField("netid", LongType, false)))
> val 
> rdd=ccGraph.vertices.map(row=>(Row(row._1.asInstanceOf[Long],row._2.asInstanceOf[Long])))
> val builtEntityDF = sqlContext.createDataFrame(rdd, schema)
>  
> /*this unpersist step causes the issue*/
> ccGraph.unpersist()
>  
> /*write step hangs for ever*/
> builtEntityDF.write.format("parquet").mode("overwrite").save("/user/root/writetest.parquet")
>  
> If you take out the ccGraph.unpersist() step the write step completes 
> successfully



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15002) Calling unpersist can cause spark to hang indefinitely when writing out a result

2016-08-15 Thread Jamie Hutton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421160#comment-15421160
 ] 

Jamie Hutton commented on SPARK-15002:
--

Hi Sean. There are no errors. When i run the code above the progress bar 
appears but never moves. The CPU is pegged at 100%. I have done a thread dump 
but i am not really sure how to interpret - if you run the code above its very 
simple to reproduce. 

If you remove the "unpersist" step in the code the entire script runs in a 
couple of seconds, so it is not an issue with a complex task. 

> Calling unpersist can cause spark to hang indefinitely when writing out a 
> result
> 
>
> Key: SPARK-15002
> URL: https://issues.apache.org/jira/browse/SPARK-15002
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.5.2, 1.6.0, 2.0.0
> Environment: AWS and Linux VM, both in spark-shell and spark-submit. 
> tested in 1.5.2 and 1.6. Tested in 2.0
>Reporter: Jamie Hutton
>
> The following code will cause spark to hang indefinitely. It happens when you 
> have an unpersist which is followed by some futher processing of that data 
> (in my case writing it out).
> I first experienced this issue with graphX so my example below involved 
> graphX code, however i suspect it might be more of a core issue than a graphx 
> one. I have raised another bug with similar results (indefinite hanging) but 
> in different circumstances, so these may well be linked
> Code to reproduce (can be run in spark-shell):
> import org.apache.spark.graphx._
> import org.apache.spark.rdd.RDD
> import org.apache.spark.sql.types._
> import org.apache.spark.sql._
> val r = scala.util.Random
> val list = (0L to 500L).map(i=>(i,r.nextInt(500).asInstanceOf[Long],"LABEL"))
> val distData = sc.parallelize(list)
> val edgesRDD = distData.map(x => Edge(x._1, x._2, x._3))
> val distinctNodes = distData.flatMap{row => Iterable((row._1, ("A")),(row._2, 
> ("B")))}.distinct()
> val nodesRDD: RDD[(VertexId, (String))] = distinctNodes
> val graph = Graph(nodesRDD, edgesRDD)
> graph.persist()
> val ccGraph = graph.connectedComponents()
> ccGraph.cache
>
> val schema = StructType(Seq(StructField("id", LongType, false), 
> StructField("netid", LongType, false)))
> val 
> rdd=ccGraph.vertices.map(row=>(Row(row._1.asInstanceOf[Long],row._2.asInstanceOf[Long])))
> val builtEntityDF = sqlContext.createDataFrame(rdd, schema)
>  
> /*this unpersist step causes the issue*/
> ccGraph.unpersist()
>  
> /*write step hangs for ever*/
> builtEntityDF.write.format("parquet").mode("overwrite").save("/user/root/writetest.parquet")
>  
> If you take out the ccGraph.unpersist() step the write step completes 
> successfully



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17061) Incorrect results returned following a join of two datasets and a map step where total number of columns >100

2016-08-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17061.
---
Resolution: Duplicate

Search JIRA please, and don't set blocker

> Incorrect results returned following a join of two datasets and a map step 
> where total number of columns >100
> -
>
> Key: SPARK-17061
> URL: https://issues.apache.org/jira/browse/SPARK-17061
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jamie Hutton
>Priority: Blocker
>
> We have hit a consistent bug where we have a dataset with more than 100 
> columns. I am raising as a blocker because spark is returning the WRONG 
> results rather than erroring, leading to data integrity issues
> I have put together the following test case which will show the issue (it 
> will run in spark-shell). In this example i am joining a dataset with lots of 
> fields onto another dataset. 
> The join works fine and if you show the dataset you will get the expected 
> result. However if you run a map step over the dataset you end up with a 
> strange error where the sequence that is in the right dataset now only 
> contains the last value.
> Whilst this test may seem a rather contrived example, what we are doing here 
> is a very standard analtical pattern. My original code was designed to:
>  - take a dataset of child records
>  - groupByKey up to the parent: giving a Dataset of (ParentID, Seq[Children])
>  - join the children onto the parent by parentID: giving 
> ((Parent),(ParentID,Seq[Children])
>  - map over the result to give a tuple of (Parent,Seq[Children])
> Notes:
> - The issue is resolved by having less fields - as soon as we go <= 100 the 
> integrity issue goes away. Try removing one of the fields from BigCaseClass 
> below
> - The issue will arise based on the total number of fields in the resulting 
> dataset. Below i have a small case class and a big case class, but two case 
> classes of 50 variable would give the same issue
> - the issue occurs where the case class being joined on (on the right) has a 
> case class type. It doesnt occur if you have a Seq[String]
> - If i go back to an RDD for the map step after the join i can workaround the 
> issue, but i lose all the benefits of datasets
> Scala code test case:
>   case class Name(name: String)
>   case class SmallCaseClass (joinkey: Integer, names: Seq[Name])
>   case class BigCaseClass  (field1: Integer,field2: Integer,field3: 
> Integer,field4: Integer,field5: Integer,field6: Integer,field7: 
> Integer,field8: Integer,field9: Integer,field10: Integer,field11: 
> Integer,field12: Integer,field13: Integer,field14: Integer,field15: 
> Integer,field16: Integer,field17: Integer,field18: Integer,field19: 
> Integer,field20: Integer,field21: Integer,field22: Integer,field23: 
> Integer,field24: Integer,field25: Integer,field26: Integer,field27: 
> Integer,field28: Integer,field29: Integer,field30: Integer,field31: 
> Integer,field32: Integer,field33: Integer,field34: Integer,field35: 
> Integer,field36: Integer,field37: Integer,field38: Integer,field39: 
> Integer,field40: Integer,field41: Integer,field42: Integer,field43: 
> Integer,field44: Integer,field45: Integer,field46: Integer,field47: 
> Integer,field48: Integer,field49: Integer,field50: Integer,field51: 
> Integer,field52: Integer,field53: Integer,field54: Integer,field55: 
> Integer,field56: Integer,field57: Integer,field58: Integer,field59: 
> Integer,field60: Integer,field61: Integer,field62: Integer,field63: 
> Integer,field64: Integer,field65: Integer,field66: Integer,field67: 
> Integer,field68: Integer,field69: Integer,field70: Integer,field71: 
> Integer,field72: Integer,field73: Integer,field74: Integer,field75: 
> Integer,field76: Integer,field77: Integer,field78: Integer,field79: 
> Integer,field80: Integer,field81: Integer,field82: Integer,field83: 
> Integer,field84: Integer,field85: Integer,field86: Integer,field87: 
> Integer,field88: Integer,field89: Integer,field90: Integer,field91: 
> Integer,field92: Integer,field93: Integer,field94: Integer,field95: 
> Integer,field96: Integer,field97: Integer,field98: Integer,field99: Integer)
>   
> val bigCC=Seq(BigCaseClass(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
> 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 
> 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 
> 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 
> 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 
> 91, 92, 93, 94, 95, 96, 97, 98, 99))
> 
> val smallCC=Seq(SmallCaseClass(1,Seq(
> Name("Jamie"), 
> Name("Ian"),
> 

[jira] [Assigned] (SPARK-16995) TreeNodeException when flat mapping RelationalGroupedDataset created from DataFrame containing a column created with lit/expr

2016-08-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16995:


Assignee: (was: Apache Spark)

> TreeNodeException when flat mapping RelationalGroupedDataset created from 
> DataFrame containing a column created with lit/expr
> -
>
> Key: SPARK-16995
> URL: https://issues.apache.org/jira/browse/SPARK-16995
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cédric Perriard
>
> A TreeNodeException is thrown when executing the following minimal example in 
> Spark 2.0. Crucial is that the column q is generated with lit/expr. 
> {code}
> import spark.implicits._
> case class test (x: Int, q: Int)
> val d = Seq(1).toDF("x")
> d.withColumn("q", lit(0)).as[test].groupByKey(_.x).flatMapGroups{case (x, 
> iter) => List()}.show
> d.withColumn("q", expr("0")).as[test].groupByKey(_.x).flatMapGroups{case (x, 
> iter) => List()}.show
> // this works fine
> d.withColumn("q", lit(0)).as[test].groupByKey(_.x).count()
> {code}
> The exception is: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: x#5
> A possible workaround is to write the dataframe to disk before grouping and 
> mapping.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16995) TreeNodeException when flat mapping RelationalGroupedDataset created from DataFrame containing a column created with lit/expr

2016-08-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16995:


Assignee: Apache Spark

> TreeNodeException when flat mapping RelationalGroupedDataset created from 
> DataFrame containing a column created with lit/expr
> -
>
> Key: SPARK-16995
> URL: https://issues.apache.org/jira/browse/SPARK-16995
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cédric Perriard
>Assignee: Apache Spark
>
> A TreeNodeException is thrown when executing the following minimal example in 
> Spark 2.0. Crucial is that the column q is generated with lit/expr. 
> {code}
> import spark.implicits._
> case class test (x: Int, q: Int)
> val d = Seq(1).toDF("x")
> d.withColumn("q", lit(0)).as[test].groupByKey(_.x).flatMapGroups{case (x, 
> iter) => List()}.show
> d.withColumn("q", expr("0")).as[test].groupByKey(_.x).flatMapGroups{case (x, 
> iter) => List()}.show
> // this works fine
> d.withColumn("q", lit(0)).as[test].groupByKey(_.x).count()
> {code}
> The exception is: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: x#5
> A possible workaround is to write the dataframe to disk before grouping and 
> mapping.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16995) TreeNodeException when flat mapping RelationalGroupedDataset created from DataFrame containing a column created with lit/expr

2016-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421146#comment-15421146
 ] 

Apache Spark commented on SPARK-16995:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/14648

> TreeNodeException when flat mapping RelationalGroupedDataset created from 
> DataFrame containing a column created with lit/expr
> -
>
> Key: SPARK-16995
> URL: https://issues.apache.org/jira/browse/SPARK-16995
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cédric Perriard
>
> A TreeNodeException is thrown when executing the following minimal example in 
> Spark 2.0. Crucial is that the column q is generated with lit/expr. 
> {code}
> import spark.implicits._
> case class test (x: Int, q: Int)
> val d = Seq(1).toDF("x")
> d.withColumn("q", lit(0)).as[test].groupByKey(_.x).flatMapGroups{case (x, 
> iter) => List()}.show
> d.withColumn("q", expr("0")).as[test].groupByKey(_.x).flatMapGroups{case (x, 
> iter) => List()}.show
> // this works fine
> d.withColumn("q", lit(0)).as[test].groupByKey(_.x).count()
> {code}
> The exception is: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: x#5
> A possible workaround is to write the dataframe to disk before grouping and 
> mapping.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >