[jira] [Commented] (SPARK-12107) Update spark-ec2 versions

2015-12-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038561#comment-15038561
 ] 

Michael Armbrust commented on SPARK-12107:
--

Yeah, I was planning do a bulk update if/when that happens.

> Update spark-ec2 versions
> -
>
> Key: SPARK-12107
> URL: https://issues.apache.org/jira/browse/SPARK-12107
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 1.6.1, 2.0.0
>
>
> spark-ec2's version strings are out-of-date. The latest versions of Spark 
> need to be reflected in its internal version maps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12083) java.lang.IllegalArgumentException: requirement failed: Overflowed precision (q98)

2015-12-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038829#comment-15038829
 ] 

Michael Armbrust commented on SPARK-12083:
--

I mean the first release candidate (RC1): 
http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/

> java.lang.IllegalArgumentException: requirement failed: Overflowed precision 
> (q98)
> --
>
> Key: SPARK-12083
> URL: https://issues.apache.org/jira/browse/SPARK-12083
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: CentOS release 6.6 
>Reporter: Dileep Kumar
>  Labels: perfomance
>
> While running with 10 users we found that one of the executor randomly hangs 
> during q98 execution. The behavior is random in way that it happens at 
> different time but for the same query. Tried to get a stack trace but was not 
> successful in generating the stack trace.
> Here is the last exception that I saw before the hang:
> java.lang.IllegalArgumentException: requirement failed: Overflowed precision
>   at scala.Predef$.require(Predef.scala:233)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:111)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:335)
>   at org.apache.spark.sql.types.Decimal.apply(Decimal.scala)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.getDecimal(UnsafeRow.java:388)
>   at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.getDecimal(JoinedRow.scala:95)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
> ===
> One of the other executor before this had the following exception:
> FetchFailed(BlockManagerId(10, d2412.halxg.cloudera.com, 45956), shuffleId=0, 
> mapId=212, reduceId=492, message=
> org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
> d2412.halxg.cloudera.com/10.20.122.112:45956
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:321)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:306)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:173)
>   at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$executePartition$1(sort.scala:160)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-02 Thread Michael Armbrust
you might also coalesce to 1 (or some small number) before writing to avoid
creating a lot of files in that partition if you know that there is not a
ton of data.

On Wed, Dec 2, 2015 at 12:59 AM, Rishi Mishra  wrote:

> As long as all your data is being inserted by Spark , hence using the same
> hash partitioner,  what Fengdong mentioned should work.
>
> On Wed, Dec 2, 2015 at 9:32 AM, Fengdong Yu 
> wrote:
>
>> Hi
>> you can try:
>>
>> if your table under location “/test/table/“ on HDFS
>> and has partitions:
>>
>>  “/test/table/dt=2012”
>>  “/test/table/dt=2013”
>>
>> df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table")
>>
>>
>>
>> On Dec 2, 2015, at 10:50 AM, Isabelle Phan  wrote:
>>
>> df.write.partitionBy("date").insertInto("my_table")
>>
>>
>>
>
>
> --
> Regards,
> Rishitesh Mishra,
> SnappyData . (http://www.snappydata.io/)
>
> https://in.linkedin.com/in/rishiteshmishra
>


[VOTE] Release Apache Spark 1.6.0 (RC1)

2015-12-02 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version
1.6.0!

The vote is open until Saturday, December 5, 2015 at 21:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.6.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is *v1.6.0-rc1
(bf525845cef159d2d4c9f4d64e158f037179b5c4)
*

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1165/

The test repository (versioned as v1.6.0-rc1) for this release can be found
at:
https://repository.apache.org/content/repositories/orgapachespark-1164/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc1-docs/


===
== How can I help test this release? ==
===
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.


== What justifies a -1 vote for this release? ==

This vote is happening towards the end of the 1.6 QA period, so -1 votes
should only occur for significant regressions from 1.5. Bugs already
present in 1.5, minor regressions, or bugs related to new features will not
block this release.

===
== What should happen to JIRA tickets still targeting 1.6.0? ==
===
1. It is OK for documentation patches to target 1.6.0 and still go into
branch-1.6, since documentations will be published separately from the
release.
2. New features for non-alpha-modules should target 1.7+.
3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
version.


==
== Major changes to help you focus your testing ==
==

Spark SQL

   - SPARK-10810 
   Session Management - The ability to create multiple isolated SQL
   Contexts that have their own configuration and default database.  This is
   turned on by default in the thrift server.
   - SPARK-   Dataset
   API - A type-safe API (similar to RDDs) that performs many operations on
   serialized binary data and code generation (i.e. Project Tungsten).
   - SPARK-1  Unified
   Memory Management - Shared memory for execution and caching instead of
   exclusive division of the regions.
   - SPARK-11197  SQL
   Queries on Files - Concise syntax for running SQL queries over files of
   any supported format without registering a table.
   - SPARK-11745  Reading
   non-standard JSON files - Added options to read non-standard JSON files
   (e.g. single-quotes, unquoted attributes)
   - SPARK-10412 
Per-operator
   Metics for SQL Execution - Display statistics on a per-operator basis
   for memory usage and spilled data size.
   - SPARK-11329  Star
   (*) expansion for StructTypes - Makes it easier to nest and unest
   arbitrary numbers of columns
   - SPARK-10917 ,
   SPARK-11149  In-memory
   Columnar Cache Performance - Significant (up to 14x) speed up when
   caching data that contains complex types in DataFrames or SQL.
   - SPARK-1  Fast
   null-safe joins - Joins using null-safe equality (<=>) will now execute
   using SortMergeJoin instead of computing a cartisian product.
   - SPARK-11389  SQL
   Execution Using Off-Heap Memory - Support for configuring query
   execution to occur using off-heap memory to avoid GC overhead
   - SPARK-10978  Datasource
   API Avoid Double Filter - When implementing a datasource with filter
   pushdown, developers can now tell Spark SQL to avoid double evaluating a
   pushed-down filter.
   - SPARK-4849   Advanced
   Layout of 

[jira] [Updated] (SPARK-12000) `sbt publishLocal` hits a Scala compiler bug caused by `Since` annotation

2015-12-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12000:
-
Target Version/s: 1.7.0  (was: 1.6.0)

> `sbt publishLocal` hits a Scala compiler bug caused by `Since` annotation
> -
>
> Key: SPARK-12000
> URL: https://issues.apache.org/jira/browse/SPARK-12000
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation, MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Josh Rosen
>Priority: Blocker
>
> Reported by [~josephkb]. Not sure what is the root cause, but this is the 
> error message when I ran "sbt publishLocal":
> {code}
> [error] (launcher/compile:doc) javadoc returned nonzero exit code
> [error] (mllib/compile:doc) scala.reflect.internal.FatalError:
> [error]  while compiling: 
> /Users/meng/src/spark/mllib/src/main/scala/org/apache/spark/mllib/util/modelSaveLoad.scala
> [error] during phase: global=terminal, atPhase=parser
> [error]  library version: version 2.10.5
> [error] compiler version: version 2.10.5
> [error]   reconstructed args: -Yno-self-type-checks -groups -classpath 
> /Users/meng/src/spark/core/target/scala-2.10/classes:/Users/meng/src/spark/launcher/target/scala-2.10/classes:/Users/meng/src/spark/network/common/target/scala-2.10/classes:/Users/meng/src/spark/network/shuffle/target/scala-2.10/classes:/Users/meng/src/spark/unsafe/target/scala-2.10/classes:/Users/meng/src/spark/streaming/target/scala-2.10/classes:/Users/meng/src/spark/sql/core/target/scala-2.10/classes:/Users/meng/src/spark/sql/catalyst/target/scala-2.10/classes:/Users/meng/src/spark/graphx/target/scala-2.10/classes:/Users/meng/.ivy2/cache/org.spark-project.spark/unused/jars/unused-1.0.0.jar:/Users/meng/.ivy2/cache/com.google.guava/guava/bundles/guava-14.0.1.jar:/Users/meng/.ivy2/cache/io.netty/netty-all/jars/netty-all-4.0.29.Final.jar:/Users/meng/.ivy2/cache/org.fusesource.leveldbjni/leveldbjni-all/bundles/leveldbjni-all-1.8.jar:/Users/meng/.ivy2/cache/com.fasterxml.jackson.core/jackson-databind/bundles/jackson-databind-2.4.4.jar:/Users/meng/.ivy2/cache/com.fasterxml.jackson.core/jackson-annotations/bundles/jackson-annotations-2.4.4.jar:/Users/meng/.ivy2/cache/com.fasterxml.jackson.core/jackson-core/bundles/jackson-core-2.4.4.jar:/Users/meng/.ivy2/cache/com.twitter/chill_2.10/jars/chill_2.10-0.5.0.jar:/Users/meng/.ivy2/cache/com.twitter/chill-java/jars/chill-java-0.5.0.jar:/Users/meng/.ivy2/cache/com.esotericsoftware.kryo/kryo/bundles/kryo-2.21.jar:/Users/meng/.ivy2/cache/com.esotericsoftware.reflectasm/reflectasm/jars/reflectasm-1.07-shaded.jar:/Users/meng/.ivy2/cache/com.esotericsoftware.minlog/minlog/jars/minlog-1.2.jar:/Users/meng/.ivy2/cache/org.objenesis/objenesis/jars/objenesis-1.2.jar:/Users/meng/.ivy2/cache/org.apache.avro/avro-mapred/jars/avro-mapred-1.7.7-hadoop2.jar:/Users/meng/.ivy2/cache/org.apache.avro/avro-ipc/jars/avro-ipc-1.7.7-tests.jar:/Users/meng/.ivy2/cache/org.apache.avro/avro-ipc/jars/avro-ipc-1.7.7.jar:/Users/meng/.ivy2/cache/org.apache.avro/avro/jars/avro-1.7.7.jar:/Users/meng/.ivy2/cache/org.codehaus.jackson/jackson-core-asl/jars/jackson-core-asl-1.9.13.jar:/Users/meng/.ivy2/cache/org.codehaus.jackson/jackson-mapper-asl/jars/jackson-mapper-asl-1.9.13.jar:/Users/meng/.ivy2/cache/org.apache.commons/commons-compress/jars/commons-compress-1.4.1.jar:/Users/meng/.ivy2/cache/org.tukaani/xz/jars/xz-1.0.jar:/Users/meng/.ivy2/cache/org.slf4j/slf4j-api/jars/slf4j-api-1.7.10.jar:/Users/meng/.ivy2/cache/org.apache.xbean/xbean-asm5-shaded/bundles/xbean-asm5-shaded-4.4.jar:/Users/meng/.ivy2/cache/org.apache.hadoop/hadoop-client/jars/hadoop-client-2.2.0.jar:/Users/meng/.ivy2/cache/org.apache.hadoop/hadoop-common/jars/hadoop-common-2.2.0.jar:/Users/meng/.ivy2/cache/org.apache.hadoop/hadoop-annotations/jars/hadoop-annotations-2.2.0.jar:/Users/meng/.ivy2/cache/commons-cli/commons-cli/jars/commons-cli-1.2.jar:/Users/meng/.ivy2/cache/org.apache.commons/commons-math/jars/commons-math-2.1.jar:/Users/meng/.ivy2/cache/xmlenc/xmlenc/jars/xmlenc-0.52.jar:/Users/meng/.ivy2/cache/commons-httpclient/commons-httpclient/jars/commons-httpclient-3.1.jar:/Users/meng/.ivy2/cache/commons-net/commons-net/jars/commons-net-3.1.jar:/Users/meng/.ivy2/cache/log4j/log4j/bundles/log4j-1.2.17.jar:/Users/meng/.ivy2/cache/commons-lang/commons-lang/jars/commons-lang-2.5.jar:/Users/meng/.ivy2/cache/commons-configuration/commons-configuration/jars/commons-configuration-1.6.jar:/Users/meng/.ivy2/cache/commons-collections/commons-collections/jars/commons-collections-3.2.1.jar:/Users/meng/.ivy2/cache/commons-digester/commons-digester/jars/commons-digester-1.8.jar:/Users/meng/.ivy2/cache/commons

Re: When to cut RCs

2015-12-02 Thread Michael Armbrust
>
> Sorry for a second email so soon. I meant to also ask, what keeps the cost
> of making an RC high? Can we bring it down with better tooling?
>

There is a lot of tooling:
https://amplab.cs.berkeley.edu/jenkins/view/Spark-Packaging/

Still you have check JIRA, sync with people who have been working on known
issues, run the jenkins jobs (which take 1+ hours) and then write that
email which has a bunch of links in it.  Short of automating the creation
of the email (PRs welcome!) I'm not sure what else you would automate.
That said, this is all I have done since I came into work today.


[jira] [Updated] (SPARK-12107) Update spark-ec2 versions

2015-12-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12107:
-
Target Version/s: 1.6.0

> Update spark-ec2 versions
> -
>
> Key: SPARK-12107
> URL: https://issues.apache.org/jira/browse/SPARK-12107
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> spark-ec2's version strings are out-of-date. The latest versions of Spark 
> need to be reflected in its internal version maps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12066) spark sql throw java.lang.ArrayIndexOutOfBoundsException when use table.* with join

2015-12-02 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036683#comment-15036683
 ] 

Michael Armbrust commented on SPARK-12066:
--

Can you reproduce this on 1.6-rc1?

> spark sql  throw java.lang.ArrayIndexOutOfBoundsException when use table.* 
> with join 
> -
>
> Key: SPARK-12066
> URL: https://issues.apache.org/jira/browse/SPARK-12066
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.5.2
> Environment: linux 
>Reporter: Ricky Yang
>Priority: Blocker
>
> throw java.lang.ArrayIndexOutOfBoundsException  when I use following spark 
> sql on spark standlone or yarn.
>the sql:
> select ta.* 
> from bi_td.dm_price_seg_td tb 
> join bi_sor.sor_ord_detail_tf ta 
> on 1 = 1 
> where ta.sale_dt = '20140514' 
> and ta.sale_price >= tb.pri_from 
> and ta.sale_price < tb.pri_to limit 10 ; 
> But ,the result is correct when using no * as following:
> select ta.sale_dt 
> from bi_td.dm_price_seg_td tb 
> join bi_sor.sor_ord_detail_tf ta 
> on 1 = 1 
> where ta.sale_dt = '20140514' 
> and ta.sale_price >= tb.pri_from 
> and ta.sale_price < tb.pri_to limit 10 ; 
> standlone version is 1.4.0 and version spark on yarn  is 1.5.2
> error log :
>   
> 15/11/30 14:19:59 ERROR SparkSQLDriver: Failed in [select ta.* 
> from bi_td.dm_price_seg_td tb 
> join bi_sor.sor_ord_detail_tf ta 
> on 1 = 1 
> where ta.sale_dt = '20140514' 
> and ta.sale_price >= tb.pri_from 
> and ta.sale_price < tb.pri_to limit 10 ] 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3, namenode2-sit.cnsuning.com): java.lang.ArrayIndexOutOfBoundsException 
> Driver stacktrace: 
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
>  
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270) 
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>  
> at scala.Option.foreach(Option.scala:236) 
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
>  
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850) 
> at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:215) 
> at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:207) 
> at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:587)
>  
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
>  
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:308)
>  
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) 
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:311) 
> at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:409) 
> at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:425) 
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:166)
>  
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>  
> at sun.reflect.NativeMethodAcc

[jira] [Updated] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder

2015-12-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12089:
-
Target Version/s: 1.6.0

> java.lang.NegativeArraySizeException when growing BufferHolder
> --
>
> Key: SPARK-12089
> URL: https://issues.apache.org/jira/browse/SPARK-12089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Erik Selin
>Priority: Critical
>
> When running a large spark sql query including multiple joins I see tasks 
> failing with the following trace:
> {code}
> java.lang.NegativeArraySizeException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From the spark code it looks like this is due to a integer overflow when 
> growing a buffer length. The offending line {{BufferHolder.java:36}} is the 
> following in the version I'm running:
> {code}
> final byte[] tmp = new byte[length * 2];
> {code}
> This seems to indicate to me that this buffer will never be able to hold more 
> then 2G worth of data. And likely will hold even less since any length > 
> 1073741824 will cause a integer overflow and turn the new buffer size 
> negative.
> I hope I'm simply missing some critical config setting but it still seems 
> weird that we have a (rather low) upper limit on these buffers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder

2015-12-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12089:
-
Priority: Blocker  (was: Critical)

> java.lang.NegativeArraySizeException when growing BufferHolder
> --
>
> Key: SPARK-12089
> URL: https://issues.apache.org/jira/browse/SPARK-12089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Erik Selin
>Priority: Blocker
>
> When running a large spark sql query including multiple joins I see tasks 
> failing with the following trace:
> {code}
> java.lang.NegativeArraySizeException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From the spark code it looks like this is due to a integer overflow when 
> growing a buffer length. The offending line {{BufferHolder.java:36}} is the 
> following in the version I'm running:
> {code}
> final byte[] tmp = new byte[length * 2];
> {code}
> This seems to indicate to me that this buffer will never be able to hold more 
> then 2G worth of data. And likely will hold even less since any length > 
> 1073741824 will cause a integer overflow and turn the new buffer size 
> negative.
> I hope I'm simply missing some critical config setting but it still seems 
> weird that we have a (rather low) upper limit on these buffers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12108) Event logs are much bigger in 1.6 than in 1.5

2015-12-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12108:
-
Target Version/s: 1.6.0  (was: 1.6.1)

> Event logs are much bigger in 1.6 than in 1.5
> -
>
> Key: SPARK-12108
> URL: https://issues.apache.org/jira/browse/SPARK-12108
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> From running page rank, the event log in 1.5 is 1.3GB uncompressed, but in 
> 1.6 it's 6GB!
> From a preliminary bisect, this commit is suspect:
> https://github.com/apache/spark/commit/42d933fbba0584b39bd8218eafc44fb03aeb157d



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7264) SparkR API for parallel functions

2015-12-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7264:

Target Version/s:   (was: 1.6.0)

> SparkR API for parallel functions
> -
>
> Key: SPARK-7264
> URL: https://issues.apache.org/jira/browse/SPARK-7264
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> This is a JIRA to discuss design proposals for enabling parallel R 
> computation in SparkR without exposing the entire RDD API. 
> The rationale for this is that the RDD API has a number of low level 
> functions and we would like to expose a more light-weight API that is both 
> friendly to R users and easy to maintain.
> http://goo.gl/GLHKZI has a first cut design doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9697) Project Tungsten (Spark 1.6)

2015-12-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9697:

Target Version/s:   (was: 1.6.0)

> Project Tungsten (Spark 1.6)
> 
>
> Key: SPARK-9697
> URL: https://issues.apache.org/jira/browse/SPARK-9697
> Project: Spark
>  Issue Type: Epic
>  Components: Block Manager, Shuffle, Spark Core, SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This epic tracks the 2nd phase of Project Tungsten, slotted for Spark 1.6 
> release.
> This epic tracks work items for Spark 1.6. More tickets can be found in:
> SPARK-7075: Tungsten-related work in Spark 1.5
> SPARK-9697: Tungsten-related work in Spark 1.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

2015-12-02 Thread Michael Armbrust
I'm going to kick the voting off with a +1 (binding).  We ran TPC-DS and
most queries are faster than 1.5.  We've also ported several production
pipelines to 1.6.


[jira] [Commented] (SPARK-12083) java.lang.IllegalArgumentException: requirement failed: Overflowed precision (q98)

2015-12-02 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036954#comment-15036954
 ] 

Michael Armbrust commented on SPARK-12083:
--

Can you test with 1.6-RC1?

> java.lang.IllegalArgumentException: requirement failed: Overflowed precision 
> (q98)
> --
>
> Key: SPARK-12083
> URL: https://issues.apache.org/jira/browse/SPARK-12083
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: CentOS release 6.6 
>Reporter: Dileep Kumar
>  Labels: perfomance
>
> While running with 10 users we found that one of the executor randomly hangs 
> during q98 execution. The behavior is random in way that it happens at 
> different time but for the same query. Tried to get a stack trace but was not 
> successful in generating the stack trace.
> Here is the last exception that I saw before the hang:
> java.lang.IllegalArgumentException: requirement failed: Overflowed precision
>   at scala.Predef$.require(Predef.scala:233)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:111)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:335)
>   at org.apache.spark.sql.types.Decimal.apply(Decimal.scala)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.getDecimal(UnsafeRow.java:388)
>   at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.getDecimal(JoinedRow.scala:95)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
> ===
> One of the other executor before this had the following exception:
> FetchFailed(BlockManagerId(10, d2412.halxg.cloudera.com, 45956), shuffleId=0, 
> mapId=212, reduceId=492, message=
> org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
> d2412.halxg.cloudera.com/10.20.122.112:45956
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:321)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:306)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:173)
>   at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$executePartition$1(sort.scala:160)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd

[jira] [Updated] (SPARK-12063) Group by Column Number identifier is not successfully parsed

2015-12-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12063:
-
Issue Type: New Feature  (was: Bug)

> Group by Column Number identifier is not successfully parsed
> 
>
> Key: SPARK-12063
> URL: https://issues.apache.org/jira/browse/SPARK-12063
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Derek Sabry
>Priority: Minor
>
> When performing a query of the form:
> select A
> from B
> group by 1
> 1 refers to the first column 'A', but but this is not parsed correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12088) check connection.isClose before connection.getAutoCommit in JDBCRDD.close

2015-12-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12088:
-
Target Version/s: 1.6.0

> check connection.isClose before connection.getAutoCommit in JDBCRDD.close
> -
>
> Key: SPARK-12088
> URL: https://issues.apache.org/jira/browse/SPARK-12088
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> in JDBCRDD, it has
> if (!conn.getAutoCommit && !conn.isClosed) {
> try {
>   conn.commit()
> } 
> . . . . . .
> In my test, the connection is already closed so conn.getAutoCommit throw 
> Exception. We should check !conn.isClosed before checking !conn.getAutoCommit 
> to avoid the Exception. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11868) wrong results returned from dataframe create from Rows without consistent schma on pyspark

2015-12-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11868:
-
Target Version/s:   (was: 1.6.0)

> wrong results returned from dataframe create from Rows without consistent 
> schma on pyspark
> --
>
> Key: SPARK-11868
> URL: https://issues.apache.org/jira/browse/SPARK-11868
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
> Environment: pyspark
>Reporter: Yuval Tanny
>
> When schema is inconsistent (but is the sames for the 10 first rows), it's 
> possible to create a dataframe form dictionaries and if a key is missing, its 
> value is None. But when trying to create dataframe from corresponding rows, 
> we get inconsistent behavior (wrong values for keys) without exception. See 
> example below.
> The problems seems to be:
> 1. Not verifying all rows in schema.
> 2. In pyspark.sql.types._create_converter, None is being set when converting 
> dictionary and field is not exist:
> {code}
> return tuple([conv(d.get(name)) for name, conv in zip(names, converters)])
> {code}
> But for Rows, it is just assumed that the number of fields in tuple is equal 
> the number of in the inferred schema, and we place wrong values for wrong 
> keys otherwise:
> {code}
> return tuple(conv(v) for v, conv in zip(obj, converters))
> {code}
> Thanks. 
> example:
> {code}
> dicts = [{'1':1,'2':2,'3':3}]*10+[{'1':1,'3':3}]
> rows = [pyspark.sql.Row(**r) for r in dicts]
> rows_rdd = sc.parallelize(rows)
> dicts_rdd = sc.parallelize(dicts)
> rows_df = sqlContext.createDataFrame(rows_rdd)
> dicts_df = sqlContext.createDataFrame(dicts_rdd)
> print(rows_df.select(['2']).collect()[10])
> print(dicts_df.select(['2']).collect()[10])
> {code}
> output:
> {code}
> Row(2=3)
> Row(2=None)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: When to cut RCs

2015-12-02 Thread Michael Armbrust
Thanks for bringing this up Sean. I think we are all happy to adopt
concrete suggestions to make the release process more transparent,
including pinging the list before kicking off the release build.

Technically there's still a Blocker bug:
> https://issues.apache.org/jira/browse/SPARK-12000


Sorry, I misprioritized this particular issue when I thought that it was
going to block the release by causing the doc build to fail. When I
realized the failure was non-deterministic and isolated to OSX (i.e. the
official release build on jenkins is not affected) I failed to update the
issue.  It doesn't show up on the dashboard that I've been using to track
the release

since
its labeled documentation.


> The 'race' doesn't matter that much, but release planning remains the
> real bug-bear here. There are still, for instance, 52 issues targeted
> at 1.6.0, 42 of which were raised and targeted by committers. For
> example, I count 6 'umbrella' JIRAs in ML alone that are still open.
>

This can be debated, but I explicitly ignored test and documentation
issues.  Since the docs are published separately and easy to update, I
don't think its worth further disturbing the release cadence for these
JIRAs.


> The release is theoretically several weeks behind plan on what's
> intended to be a fixed release cycle too. This is why I'm not sure why
> today it's suddenly potentially ready for release.
>

Up until today various committers have told me that there were known issues
with branch-1.6 that would cause them to -1 the release.  Whenever this
happened, I asked them to ensure there was a properly targeted blocker JIRA
open so people could publicly track the status of the release.  As long as
such issues were open, I only published a preview since making an RC is
pretty high cost.

I'm sorry that it felt sudden to you, but as of last night all such known
issues were resolved and thus I cut a release as soon as this was the case.

I'm just curious, am I the only one that thinks this isn't roughly
> normal, or do other people manage releases this way? I know the real
> world is messy and this is better than in the past, but I still get
> surprised by how each 1.x release actually comes about.
>

I actually did spent quite a bit of time asking people to close various
umbrella issues, and I was pretty strict about watching JIRA throughout the
process.  Perhaps as an additional step, future preview releases or branch
cuts can include a link to an authoritative dashboard that we will use to
decide when we are ready to make an RC.  I'm also open to other suggestions.

Michael


[jira] [Commented] (SPARK-11873) Regression for TPC-DS query 63 when used with decimal datatype and windows function

2015-12-01 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034908#comment-15034908
 ] 

Michael Armbrust commented on SPARK-11873:
--

We did a lot of performance work in Spark 1.6 (e.g., [SPARK-11787]).

> Regression for TPC-DS query 63 when used with decimal datatype and windows 
> function
> ---
>
> Key: SPARK-11873
> URL: https://issues.apache.org/jira/browse/SPARK-11873
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Dileep Kumar
>  Labels: perfomance
> Attachments: 63.1.1, 63.1.5, 63.decimal_schema, 
> 63.decimal_schema_windows_function, 63.double_schema, 98.1.1, 98.1.5, 
> decimal_schema.sql, double_schema.sql
>
>
> When running the TPC-DS based queries for benchmarking spark found that query 
> 63 (after making it similar to original query) show different behavior 
> compared to other queries eg. q98 which has similar function.
> Here are performance numbers(execution time in seconds):
>   1.1 Baseline1.5 1.5 + Decimal
> q63   27  26  38
> q98   18  26  24
> As you can see q63 is showing regression compared to similar query. I am 
> attaching the both version of queries and affected schemas. When adding the 
> windows function back this is the only query seem to be slower than 1.1 in 
> 1.5.
> I have attached the both version of schema and queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12061) Persist for Map/filter with Lambda Functions don't always read from Cache

2015-12-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12061:
-
Target Version/s: 1.7.0

> Persist for Map/filter with Lambda Functions don't always read from Cache
> -
>
> Key: SPARK-12061
> URL: https://issues.apache.org/jira/browse/SPARK-12061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> So far, the existing caching mechanisms do not work on dataset operations 
> when using map/filter with lambda functions. For example, 
> {code}
>   test("persist and then map/filter with lambda functions") {
> val f = (i: Int) => i + 1
> val ds = Seq(1, 2, 3).toDS()
> val mapped = ds.map(f)
> mapped.cache()
> val mapped2 = ds.map(f)
> assertCached(mapped2)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12061) [SQL] Dataset API: Adding Persist for Map/filter with Lambda Functions

2015-12-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12061:
-
Issue Type: Bug  (was: Improvement)

> [SQL] Dataset API: Adding Persist for Map/filter with Lambda Functions
> --
>
> Key: SPARK-12061
> URL: https://issues.apache.org/jira/browse/SPARK-12061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> So far, the existing caching mechanisms do not work on dataset operations 
> when using map/filter with lambda functions. For example, 
> {code}
>   test("persist and then map/filter with lambda functions") {
> val f = (i: Int) => i + 1
> val ds = Seq(1, 2, 3).toDS()
> val mapped = ds.map(f)
> mapped.cache()
> val mapped2 = ds.map(f)
> assertCached(mapped2)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12000) `sbt publishLocal` hits a Scala compiler bug caused by `Since` annotation

2015-12-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12000:
-
Priority: Blocker  (was: Major)

> `sbt publishLocal` hits a Scala compiler bug caused by `Since` annotation
> -
>
> Key: SPARK-12000
> URL: https://issues.apache.org/jira/browse/SPARK-12000
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation, MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Josh Rosen
>Priority: Blocker
>
> Reported by [~josephkb]. Not sure what is the root cause, but this is the 
> error message when I ran "sbt publishLocal":
> {code}
> [error] (launcher/compile:doc) javadoc returned nonzero exit code
> [error] (mllib/compile:doc) scala.reflect.internal.FatalError:
> [error]  while compiling: 
> /Users/meng/src/spark/mllib/src/main/scala/org/apache/spark/mllib/util/modelSaveLoad.scala
> [error] during phase: global=terminal, atPhase=parser
> [error]  library version: version 2.10.5
> [error] compiler version: version 2.10.5
> [error]   reconstructed args: -Yno-self-type-checks -groups -classpath 
> /Users/meng/src/spark/core/target/scala-2.10/classes:/Users/meng/src/spark/launcher/target/scala-2.10/classes:/Users/meng/src/spark/network/common/target/scala-2.10/classes:/Users/meng/src/spark/network/shuffle/target/scala-2.10/classes:/Users/meng/src/spark/unsafe/target/scala-2.10/classes:/Users/meng/src/spark/streaming/target/scala-2.10/classes:/Users/meng/src/spark/sql/core/target/scala-2.10/classes:/Users/meng/src/spark/sql/catalyst/target/scala-2.10/classes:/Users/meng/src/spark/graphx/target/scala-2.10/classes:/Users/meng/.ivy2/cache/org.spark-project.spark/unused/jars/unused-1.0.0.jar:/Users/meng/.ivy2/cache/com.google.guava/guava/bundles/guava-14.0.1.jar:/Users/meng/.ivy2/cache/io.netty/netty-all/jars/netty-all-4.0.29.Final.jar:/Users/meng/.ivy2/cache/org.fusesource.leveldbjni/leveldbjni-all/bundles/leveldbjni-all-1.8.jar:/Users/meng/.ivy2/cache/com.fasterxml.jackson.core/jackson-databind/bundles/jackson-databind-2.4.4.jar:/Users/meng/.ivy2/cache/com.fasterxml.jackson.core/jackson-annotations/bundles/jackson-annotations-2.4.4.jar:/Users/meng/.ivy2/cache/com.fasterxml.jackson.core/jackson-core/bundles/jackson-core-2.4.4.jar:/Users/meng/.ivy2/cache/com.twitter/chill_2.10/jars/chill_2.10-0.5.0.jar:/Users/meng/.ivy2/cache/com.twitter/chill-java/jars/chill-java-0.5.0.jar:/Users/meng/.ivy2/cache/com.esotericsoftware.kryo/kryo/bundles/kryo-2.21.jar:/Users/meng/.ivy2/cache/com.esotericsoftware.reflectasm/reflectasm/jars/reflectasm-1.07-shaded.jar:/Users/meng/.ivy2/cache/com.esotericsoftware.minlog/minlog/jars/minlog-1.2.jar:/Users/meng/.ivy2/cache/org.objenesis/objenesis/jars/objenesis-1.2.jar:/Users/meng/.ivy2/cache/org.apache.avro/avro-mapred/jars/avro-mapred-1.7.7-hadoop2.jar:/Users/meng/.ivy2/cache/org.apache.avro/avro-ipc/jars/avro-ipc-1.7.7-tests.jar:/Users/meng/.ivy2/cache/org.apache.avro/avro-ipc/jars/avro-ipc-1.7.7.jar:/Users/meng/.ivy2/cache/org.apache.avro/avro/jars/avro-1.7.7.jar:/Users/meng/.ivy2/cache/org.codehaus.jackson/jackson-core-asl/jars/jackson-core-asl-1.9.13.jar:/Users/meng/.ivy2/cache/org.codehaus.jackson/jackson-mapper-asl/jars/jackson-mapper-asl-1.9.13.jar:/Users/meng/.ivy2/cache/org.apache.commons/commons-compress/jars/commons-compress-1.4.1.jar:/Users/meng/.ivy2/cache/org.tukaani/xz/jars/xz-1.0.jar:/Users/meng/.ivy2/cache/org.slf4j/slf4j-api/jars/slf4j-api-1.7.10.jar:/Users/meng/.ivy2/cache/org.apache.xbean/xbean-asm5-shaded/bundles/xbean-asm5-shaded-4.4.jar:/Users/meng/.ivy2/cache/org.apache.hadoop/hadoop-client/jars/hadoop-client-2.2.0.jar:/Users/meng/.ivy2/cache/org.apache.hadoop/hadoop-common/jars/hadoop-common-2.2.0.jar:/Users/meng/.ivy2/cache/org.apache.hadoop/hadoop-annotations/jars/hadoop-annotations-2.2.0.jar:/Users/meng/.ivy2/cache/commons-cli/commons-cli/jars/commons-cli-1.2.jar:/Users/meng/.ivy2/cache/org.apache.commons/commons-math/jars/commons-math-2.1.jar:/Users/meng/.ivy2/cache/xmlenc/xmlenc/jars/xmlenc-0.52.jar:/Users/meng/.ivy2/cache/commons-httpclient/commons-httpclient/jars/commons-httpclient-3.1.jar:/Users/meng/.ivy2/cache/commons-net/commons-net/jars/commons-net-3.1.jar:/Users/meng/.ivy2/cache/log4j/log4j/bundles/log4j-1.2.17.jar:/Users/meng/.ivy2/cache/commons-lang/commons-lang/jars/commons-lang-2.5.jar:/Users/meng/.ivy2/cache/commons-configuration/commons-configuration/jars/commons-configuration-1.6.jar:/Users/meng/.ivy2/cache/commons-collections/commons-collections/jars/commons-collections-3.2.1.jar:/Users/meng/.ivy2/cache/commons-digester/commons-digester/jars/commons-digester-1.8.jar:/Users/meng/.ivy2/cache/commons-beanuti

[jira] [Updated] (SPARK-11932) trackStateByKey throws java.lang.IllegalArgumentException: requirement failed on restarting from checkpoint

2015-12-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11932:
-
Priority: Critical  (was: Blocker)

> trackStateByKey throws java.lang.IllegalArgumentException: requirement failed 
> on restarting from checkpoint
> ---
>
> Key: SPARK-11932
> URL: https://issues.apache.org/jira/browse/SPARK-11932
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> The problem is that when recovering a streaming application using 
> trackStateByKey from Dstream checkpoints, there is the following exception.
> Code 
> {code}
>   StreamingContext.getOrCreate(".", () => createContext(args))
>   ...
>   def createContext(args: Array[String]) : StreamingContext = {
> val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount")
> // Create the context with a 1 second batch size
> val ssc = new StreamingContext(sparkConf, Seconds(1))
> 
> ssc.checkpoint(".")
> // Initial RDD input to trackStateByKey
> val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), 
> ("world", 1)))
> // Create a ReceiverInputDStream on target ip:port and count the
> // words in input stream of \n delimited test (eg. generated by 'nc')
> val lines = ssc.socketTextStream(args(0), args(1).toInt)
> val words = lines.flatMap(_.split(" "))
> val wordDstream = words.map(x => (x, 1))
> // Update the cumulative count using updateStateByKey
> // This will give a DStream made of state (which is the cumulative count 
> of the words)
> val trackStateFunc = (batchTime: Time, word: String, one: Option[Int], 
> state: State[Int]) => {
>   val sum = one.getOrElse(0) + state.getOption.getOrElse(0)
>   val output = (word, sum)
>   state.update(sum)
>   Some(output)
> }
> val stateDstream = wordDstream.trackStateByKey(
>   StateSpec.function(trackStateFunc).initialState(initialRDD))
> stateDstream.print()
> 
> ssc
>   
>   }
> {code}
> Error 
> {code}
> 15/11/23 10:55:07 ERROR StreamingContext: Error starting the context, marking 
> it as stopped
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.streaming.rdd.TrackStateRDD.(TrackStateRDD.scala:133)
> at 
> org.apache.spark.streaming.dstream.InternalTrackStateDStream$$anonfun$compute$2.apply(TrackStateDStream.scala:148)
> at 
> org.apache.spark.streaming.dstream.InternalTrackStateDStream$$anonfun$compute$2.apply(TrackStateDStream.scala:143)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.streaming.dstream.InternalTrackStateDStream.compute(TrackStateDStream.scala:143)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:424)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
> at scala.Option.orElse(Option.scala:257)
> at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339)
> at 
> org.apache.spark.streaming.dstream.TrackStateDStreamImpl.compute(TrackStateDStream.scala:66)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.

[jira] [Resolved] (SPARK-11503) SQL API audit for Spark 1.6

2015-12-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11503.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

> SQL API audit for Spark 1.6
> ---
>
> Key: SPARK-11503
> URL: https://issues.apache.org/jira/browse/SPARK-11503
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 1.6.0
>
>
> Umbrella ticket to walk through all newly introduced APIs to make sure they 
> are consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11780) Provide type aliases in org.apache.spark.sql.types for backwards compatibility

2015-12-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11780:
-
Assignee: Santiago M. Mola

> Provide type aliases in org.apache.spark.sql.types for backwards compatibility
> --
>
> Key: SPARK-11780
> URL: https://issues.apache.org/jira/browse/SPARK-11780
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Santiago M. Mola
>Assignee: Santiago M. Mola
>
> With SPARK-11273, ArrayData, MapData and others were moved from  
> org.apache.spark.sql.types to org.apache.spark.sql.catalyst.util.
> Since this is a backward incompatible change, it would be good to provide 
> type aliases from the old package (deprecated) to the new one.
> For example:
> {code}
> package object types {
>@deprecated
>type ArrayData = org.apache.spark.sql.catalyst.util.ArrayData
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId

2015-12-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11596:
-
Assignee: Yin Huai

> SQL execution very slow for nested query plans because of 
> DataFrame.withNewExecutionId
> --
>
> Key: SPARK-11596
> URL: https://issues.apache.org/jira/browse/SPARK-11596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Cristian
>Assignee: Yin Huai
> Attachments: screenshot-1.png
>
>
> For nested query plans like a recursive unionAll, withExecutionId is 
> extremely slow, likely because of repeated string concatenation in 
> QueryPlan.simpleString
> Test case:
> {code}
> (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
> println(s"PROCESSING >>>>>>>>>>> $idx")
> val df = sqlContext.sparkContext.parallelize((0 to 
> 10).zipWithIndex).toDF("A", "B")
> val union = curr.map(_.unionAll(df)).getOrElse(df)
> union.cache()
> println(">>" + union.count)
> //union.show()
> Some(union)
>   }
> {code}
> Stack trace:
> {quote}
> scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
> scala.collection.AbstractIterator.addString(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
> scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
> scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
> org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364)
> org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367)
> org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372)
> org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369)
> org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936)
> org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949)
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
> org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
> org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12061) Persist for Map/filter with Lambda Functions don't always read from Cache

2015-12-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12061:
-
Summary: Persist for Map/filter with Lambda Functions don't always read 
from Cache  (was: Persist for Map/filter with Lambda Functions)

> Persist for Map/filter with Lambda Functions don't always read from Cache
> -
>
> Key: SPARK-12061
> URL: https://issues.apache.org/jira/browse/SPARK-12061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> So far, the existing caching mechanisms do not work on dataset operations 
> when using map/filter with lambda functions. For example, 
> {code}
>   test("persist and then map/filter with lambda functions") {
> val f = (i: Int) => i + 1
> val ds = Seq(1, 2, 3).toDS()
> val mapped = ds.map(f)
> mapped.cache()
> val mapped2 = ds.map(f)
> assertCached(mapped2)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11352) codegen.GeneratePredicate fails due to unquoted comment

2015-12-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11352:
-
Assignee: Yin Huai

> codegen.GeneratePredicate fails due to unquoted comment
> ---
>
> Key: SPARK-11352
> URL: https://issues.apache.org/jira/browse/SPARK-11352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Rares Mirica
>Assignee: Yin Huai
>
> Somehow the code being generated ends up having comments with 
> comment-terminators unquoted, eg.:
> /* ((input[35, StringType] <= 
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8) && 
> (text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 <= input[36, 
> StringType])) */
> with emphasis on ... =0.9,*/...
> This leads to a org.codehaus.commons.compiler.CompileException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12061) Persist for Map/filter with Lambda Functions

2015-12-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12061:
-
Summary: Persist for Map/filter with Lambda Functions  (was: [SQL] Dataset 
API: Adding Persist for Map/filter with Lambda Functions)

> Persist for Map/filter with Lambda Functions
> 
>
> Key: SPARK-12061
> URL: https://issues.apache.org/jira/browse/SPARK-12061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> So far, the existing caching mechanisms do not work on dataset operations 
> when using map/filter with lambda functions. For example, 
> {code}
>   test("persist and then map/filter with lambda functions") {
> val f = (i: Int) => i + 1
> val ds = Seq(1, 2, 3).toDS()
> val mapped = ds.map(f)
> mapped.cache()
> val mapped2 = ds.map(f)
> assertCached(mapped2)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId

2015-12-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11596.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 10079
[https://github.com/apache/spark/pull/10079]

> SQL execution very slow for nested query plans because of 
> DataFrame.withNewExecutionId
> --
>
> Key: SPARK-11596
> URL: https://issues.apache.org/jira/browse/SPARK-11596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Cristian
>Assignee: Yin Huai
> Fix For: 1.6.0
>
> Attachments: screenshot-1.png
>
>
> For nested query plans like a recursive unionAll, withExecutionId is 
> extremely slow, likely because of repeated string concatenation in 
> QueryPlan.simpleString
> Test case:
> {code}
> (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
> println(s"PROCESSING >>>>>>>>>>> $idx")
> val df = sqlContext.sparkContext.parallelize((0 to 
> 10).zipWithIndex).toDF("A", "B")
> val union = curr.map(_.unionAll(df)).getOrElse(df)
> union.cache()
> println(">>" + union.count)
> //union.show()
> Some(union)
>   }
> {code}
> Stack trace:
> {quote}
> scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
> scala.collection.AbstractIterator.addString(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
> scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
> scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
> org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364)
> org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367)
> org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372)
> org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369)
> org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936)
> org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949)
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
> org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
> org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12046) Visibility and format issues in ScalaDoc/JavaDoc for branch-1.6

2015-12-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12046.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 10063
[https://github.com/apache/spark/pull/10063]

> Visibility and format issues in ScalaDoc/JavaDoc for branch-1.6
> ---
>
> Key: SPARK-12046
> URL: https://issues.apache.org/jira/browse/SPARK-12046
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12068) use a single column in Dataset.groupBy and count will fail

2015-12-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12068.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 10059
[https://github.com/apache/spark/pull/10059]

> use a single column in Dataset.groupBy and count will fail
> --
>
> Key: SPARK-12068
> URL: https://issues.apache.org/jira/browse/SPARK-12068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>
> {code}
> val ds = Seq("a" -> 1, "b" -> 1, "a" -> 2).toDS()
> val count = ds.groupBy($"_1").count()
> count.collect() // will fail
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11856) add type cast if the real type is different but compatible with encoder schema

2015-12-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11856:
-
Assignee: Wenchen Fan

> add type cast if the real type is different but compatible with encoder schema
> --
>
> Key: SPARK-11856
> URL: https://issues.apache.org/jira/browse/SPARK-11856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11856) add type cast if the real type is different but compatible with encoder schema

2015-12-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11856.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9840
[https://github.com/apache/spark/pull/9840]

> add type cast if the real type is different but compatible with encoder schema
> --
>
> Key: SPARK-11856
> URL: https://issues.apache.org/jira/browse/SPARK-11856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11780) Provide type aliases in org.apache.spark.sql.types for backwards compatibility

2015-12-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11780:
-
Target Version/s: 1.6.0

> Provide type aliases in org.apache.spark.sql.types for backwards compatibility
> --
>
> Key: SPARK-11780
> URL: https://issues.apache.org/jira/browse/SPARK-11780
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Santiago M. Mola
>
> With SPARK-11273, ArrayData, MapData and others were moved from  
> org.apache.spark.sql.types to org.apache.spark.sql.catalyst.util.
> Since this is a backward incompatible change, it would be good to provide 
> type aliases from the old package (deprecated) to the new one.
> For example:
> {code}
> package object types {
>@deprecated
>type ArrayData = org.apache.spark.sql.catalyst.util.ArrayData
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11954) Encoder for JavaBeans / POJOs

2015-12-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11954.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9937
[https://github.com/apache/spark/pull/9937]

> Encoder for JavaBeans / POJOs
> -
>
> Key: SPARK-11954
> URL: https://issues.apache.org/jira/browse/SPARK-11954
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11905) [SQL] Support Persist/Cache and Unpersist in Dataset APIs

2015-12-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11905.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9889
[https://github.com/apache/spark/pull/9889]

> [SQL] Support Persist/Cache and Unpersist in Dataset APIs
> -
>
> Key: SPARK-11905
> URL: https://issues.apache.org/jira/browse/SPARK-11905
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
> Fix For: 1.6.0
>
>
> Introducing Persist/Cache and Unpersist into Dataset APIs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Getting all files of a table

2015-12-01 Thread Michael Armbrust
sqlContext.table("...").inputFiles

(this is best effort, but should work for hive tables).

Michael

On Tue, Dec 1, 2015 at 10:55 AM, Krzysztof Zarzycki 
wrote:

> Hi there,
> Do you know how easily I can get a list of all files of a Hive table?
>
> What I want to achieve is to get all files that are underneath parquet
> table and using sparksql-protobuf[1] library(really handy library!) and its
> helper class ProtoParquetRDD:
>
> val protobufsRdd = new ProtoParquetRDD(sc, "files", classOf[MyProto])
>
> Access the underlying parquet files as normal protocol buffers. But I
> don't know how to get them. I pointed the call above to one file by hand it
> worked well.
> The parquet table was created based on the same library and it's implicit
> hiveContext extension createDataFrame, which creates a DataFrame based on
> Protocol buffer class.
>
> (The revert read operation is needed to support legacy code, where after
> converting protocol buffers to parquet, I still want some code to access
> parquet as normal protocol buffers).
>
> Maybe someone will have other way to get an Rdd of protocol buffers from
> Parquet-stored table.
>
> [1] https://github.com/saurfang/sparksql-protobuf
>
> Thanks!
> Krzysztof
>
>
>
>


[jira] [Updated] (SPARK-8414) Ensure ContextCleaner actually triggers clean ups

2015-12-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8414:

Target Version/s:   (was: 1.6.0)

> Ensure ContextCleaner actually triggers clean ups
> -
>
> Key: SPARK-8414
> URL: https://issues.apache.org/jira/browse/SPARK-8414
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>
> Right now it cleans up old references only through natural GCs, which may not 
> occur if the driver has infinite RAM. We should do a periodic GC to make sure 
> that we actually do clean things up. Something like once per 30 minutes seems 
> relatively inexpensive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join

2015-11-30 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032403#comment-15032403
 ] 

Michael Armbrust commented on SPARK-12032:
--

The standard algorithm for join reordering should handle this case, but adding 
it in is probably going to be a bit of work.

> Filter can't be pushed down to correct Join because of bad order of Join
> 
>
> Key: SPARK-12032
> URL: https://issues.apache.org/jira/browse/SPARK-12032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Priority: Critical
>
> For this query:
> {code}
>   select d.d_year, count(*) cnt
>FROM store_sales, date_dim d, customer c
>WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = 
> d.d_date_sk
>group by d.d_year
> {code}
> Current optimized plan is
> {code}
> == Optimized Logical Plan ==
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && 
> (c_first_shipto_date_sk#106 = d_date_sk#141)))
>Project [d_date_sk#141,d_year#147,ss_customer_sk#283]
> Join Inner, None
>  Project [ss_customer_sk#283]
>   Relation[] ParquetRelation[store_sales]
>  Project [d_date_sk#141,d_year#147]
>   Relation[] ParquetRelation[date_dim]
>Project [c_customer_sk#101,c_first_shipto_date_sk#106]
> Relation[] ParquetRelation[customer]
> {code}
> It will join store_sales and date_dim together without any condition, the 
> condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because 
> the bad order of joins.
> The optimizer should re-order the joins, join date_dim after customer, then 
> it can pushed down the condition correctly.
> The plan should be 
> {code}
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141))
>Project [c_first_shipto_date_sk#106]
> Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101))
>  Project [ss_customer_sk#283]
>   Relation[store_sales]
>  Project [c_first_shipto_date_sk#106,c_customer_sk#101]
>   Relation[customer]
>Project [d_year#147,d_date_sk#141]
> Relation[date_dim]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11553) row.getInt(i) if row[i]=null returns 0

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11553:
-
Labels: releasenotes  (was: )

> row.getInt(i) if row[i]=null returns 0
> --
>
> Key: SPARK-11553
> URL: https://issues.apache.org/jira/browse/SPARK-11553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tofigh
>Assignee: Bartlomiej Alberski
>Priority: Blocker
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> row.getInt|Float|Double in SPARK RDD return 0 if row[index] is null. (Even 
> according to the document they should throw nullException error)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11966) Spark API for UDTFs

2015-11-30 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032877#comment-15032877
 ] 

Michael Armbrust commented on SPARK-11966:
--

Have you seen 
[explode|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L1146].
  Does this do what you want, or is something missing?

> Spark API for UDTFs
> ---
>
> Key: SPARK-11966
> URL: https://issues.apache.org/jira/browse/SPARK-11966
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Jaka Jancar
>Priority: Minor
>
> Defining UDFs is easy using sqlContext.udf.register, but not table-generating 
> functions. For those you still have to use these horrendous Hive interfaces:
> https://github.com/prongs/apache-hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/udtf/example/GenericUDTFCount2.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11941) JSON representation of nested StructTypes could be more uniform

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11941:
-
Issue Type: Improvement  (was: Bug)

> JSON representation of nested StructTypes could be more uniform
> ---
>
> Key: SPARK-11941
> URL: https://issues.apache.org/jira/browse/SPARK-11941
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Henri DF
>
> I have a json file with a single row {code}{"a":1, "b": 1.0, "c": "asdfasd", 
> "d":[1, 2, 4]}{code} After reading that file in, the schema is correctly 
> inferred:
> {code}
> scala> df.printSchema
> root
>  |-- a: long (nullable = true)
>  |-- b: double (nullable = true)
>  |-- c: string (nullable = true)
>  |-- d: array (nullable = true)
>  ||-- element: long (containsNull = true)
> {code}
> However, the json representation has a strange nesting under "type" for 
> column "d":
> {code}
> scala> df.collect()(0).schema.prettyJson
> res60: String = 
> {
>   "type" : "struct",
>   "fields" : [ {
> "name" : "a",
> "type" : "long",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "b",
> "type" : "double",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "c",
> "type" : "string",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "d",
> "type" : {
>   "type" : "array",
>   "elementType" : "long",
>   "containsNull" : true
> },
> "nullable" : true,
> "metadata" : { }
>   }]
> }
> {code}
> Specifically, in the last element, "type" is an object instead of being a 
> string. I would expect the last element to be:
> {code}
>   {
>  "name":"d",
>  "type":"array",
>  "elementType":"long",
>  "containsNull":true,
>  "nullable":true,
>  "metadata":{}
>   }
> {code}
> There's a similar issue for nested structs.
> (I ran into this while writing node.js bindings, wanted to recurse down this 
> representation, which would be nicer if it was uniform...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11941) JSON representation of nested StructTypes could be more uniform

2015-11-30 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032899#comment-15032899
 ] 

Michael Armbrust commented on SPARK-11941:
--

/cc [~lian cheng]


> JSON representation of nested StructTypes could be more uniform
> ---
>
> Key: SPARK-11941
> URL: https://issues.apache.org/jira/browse/SPARK-11941
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Henri DF
>
> I have a json file with a single row {code}{"a":1, "b": 1.0, "c": "asdfasd", 
> "d":[1, 2, 4]}{code} After reading that file in, the schema is correctly 
> inferred:
> {code}
> scala> df.printSchema
> root
>  |-- a: long (nullable = true)
>  |-- b: double (nullable = true)
>  |-- c: string (nullable = true)
>  |-- d: array (nullable = true)
>  ||-- element: long (containsNull = true)
> {code}
> However, the json representation has a strange nesting under "type" for 
> column "d":
> {code}
> scala> df.collect()(0).schema.prettyJson
> res60: String = 
> {
>   "type" : "struct",
>   "fields" : [ {
> "name" : "a",
> "type" : "long",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "b",
> "type" : "double",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "c",
> "type" : "string",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "d",
> "type" : {
>   "type" : "array",
>   "elementType" : "long",
>   "containsNull" : true
> },
> "nullable" : true,
> "metadata" : { }
>   }]
> }
> {code}
> Specifically, in the last element, "type" is an object instead of being a 
> string. I would expect the last element to be:
> {code}
>   {
>  "name":"d",
>  "type":"array",
>  "elementType":"long",
>  "containsNull":true,
>  "nullable":true,
>  "metadata":{}
>   }
> {code}
> There's a similar issue for nested structs.
> (I ran into this while writing node.js bindings, wanted to recurse down this 
> representation, which would be nicer if it was uniform...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12032:
-
Issue Type: Improvement  (was: Bug)

> Filter can't be pushed down to correct Join because of bad order of Join
> 
>
> Key: SPARK-12032
> URL: https://issues.apache.org/jira/browse/SPARK-12032
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Priority: Critical
>
> For this query:
> {code}
>   select d.d_year, count(*) cnt
>FROM store_sales, date_dim d, customer c
>WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = 
> d.d_date_sk
>group by d.d_year
> {code}
> Current optimized plan is
> {code}
> == Optimized Logical Plan ==
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && 
> (c_first_shipto_date_sk#106 = d_date_sk#141)))
>Project [d_date_sk#141,d_year#147,ss_customer_sk#283]
> Join Inner, None
>  Project [ss_customer_sk#283]
>   Relation[] ParquetRelation[store_sales]
>  Project [d_date_sk#141,d_year#147]
>   Relation[] ParquetRelation[date_dim]
>Project [c_customer_sk#101,c_first_shipto_date_sk#106]
> Relation[] ParquetRelation[customer]
> {code}
> It will join store_sales and date_dim together without any condition, the 
> condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because 
> the bad order of joins.
> The optimizer should re-order the joins, join date_dim after customer, then 
> it can pushed down the condition correctly.
> The plan should be 
> {code}
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141))
>Project [c_first_shipto_date_sk#106]
> Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101))
>  Project [ss_customer_sk#283]
>   Relation[store_sales]
>  Project [c_first_shipto_date_sk#106,c_customer_sk#101]
>   Relation[customer]
>Project [d_year#147,d_date_sk#141]
> Relation[date_dim]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11941) JSON representation of nested StructTypes is incorrect

2015-11-30 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032889#comment-15032889
 ] 

Michael Armbrust commented on SPARK-11941:
--

While I can appreciate that this might be nicer if it was flat, I don't think 
that changing it at this point is worth the cost.  This is a stable 
representation that we persist with data.  As such, if we change it we are 
going to have to support parsing both representations forever.

> JSON representation of nested StructTypes is incorrect
> --
>
> Key: SPARK-11941
> URL: https://issues.apache.org/jira/browse/SPARK-11941
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Henri DF
>
> I have a json file with a single row {code}{"a":1, "b": 1.0, "c": "asdfasd", 
> "d":[1, 2, 4]}{code} After reading that file in, the schema is correctly 
> inferred:
> {code}
> scala> df.printSchema
> root
>  |-- a: long (nullable = true)
>  |-- b: double (nullable = true)
>  |-- c: string (nullable = true)
>  |-- d: array (nullable = true)
>  ||-- element: long (containsNull = true)
> {code}
> However, the json representation has a strange nesting under "type" for 
> column "d":
> {code}
> scala> df.collect()(0).schema.prettyJson
> res60: String = 
> {
>   "type" : "struct",
>   "fields" : [ {
> "name" : "a",
> "type" : "long",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "b",
> "type" : "double",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "c",
> "type" : "string",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "d",
> "type" : {
>   "type" : "array",
>   "elementType" : "long",
>   "containsNull" : true
> },
> "nullable" : true,
> "metadata" : { }
>   }]
> }
> {code}
> Specifically, in the last element, "type" is an object instead of being a 
> string. I would expect the last element to be:
> {code}
>   {
>  "name":"d",
>  "type":"array",
>  "elementType":"long",
>  "containsNull":true,
>  "nullable":true,
>  "metadata":{}
>   }
> {code}
> There's a similar issue for nested structs.
> (I ran into this while writing node.js bindings, wanted to recurse down this 
> representation, which would be nicer if it was uniform...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11941) JSON representation of nested StructTypes could be more uniform

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11941:
-
Summary: JSON representation of nested StructTypes could be more uniform  
(was: JSON representation of nested StructTypes is incorrect)

> JSON representation of nested StructTypes could be more uniform
> ---
>
> Key: SPARK-11941
> URL: https://issues.apache.org/jira/browse/SPARK-11941
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Henri DF
>
> I have a json file with a single row {code}{"a":1, "b": 1.0, "c": "asdfasd", 
> "d":[1, 2, 4]}{code} After reading that file in, the schema is correctly 
> inferred:
> {code}
> scala> df.printSchema
> root
>  |-- a: long (nullable = true)
>  |-- b: double (nullable = true)
>  |-- c: string (nullable = true)
>  |-- d: array (nullable = true)
>  ||-- element: long (containsNull = true)
> {code}
> However, the json representation has a strange nesting under "type" for 
> column "d":
> {code}
> scala> df.collect()(0).schema.prettyJson
> res60: String = 
> {
>   "type" : "struct",
>   "fields" : [ {
> "name" : "a",
> "type" : "long",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "b",
> "type" : "double",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "c",
> "type" : "string",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "d",
> "type" : {
>   "type" : "array",
>   "elementType" : "long",
>   "containsNull" : true
> },
> "nullable" : true,
> "metadata" : { }
>   }]
> }
> {code}
> Specifically, in the last element, "type" is an object instead of being a 
> string. I would expect the last element to be:
> {code}
>   {
>  "name":"d",
>  "type":"array",
>  "elementType":"long",
>  "containsNull":true,
>  "nullable":true,
>  "metadata":{}
>   }
> {code}
> There's a similar issue for nested structs.
> (I ran into this while writing node.js bindings, wanted to recurse down this 
> representation, which would be nicer if it was uniform...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11966) Spark API for UDTFs

2015-11-30 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032925#comment-15032925
 ] 

Michael Armbrust commented on SPARK-11966:
--

Ah, I was proposing the DataFrame function explode as it gives you something 
very close to UDTFs.  However, if you want to be able to use the functions in 
pure SQL then thats not going to be sufficient.

> Spark API for UDTFs
> ---
>
> Key: SPARK-11966
> URL: https://issues.apache.org/jira/browse/SPARK-11966
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Jaka Jancar
>Priority: Minor
>
> Defining UDFs is easy using sqlContext.udf.register, but not table-generating 
> functions. For those you still have to use these horrendous Hive interfaces:
> https://github.com/prongs/apache-hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/udtf/example/GenericUDTFCount2.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11941) JSON representation of nested StructTypes could be more uniform

2015-11-30 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032946#comment-15032946
 ] 

Michael Armbrust commented on SPARK-11941:
--

Sorry, maybe I'm misunderstanding.  Can you construct a case where we
serialize the case class representation to and from json and we lose
information?

If you can, then I agree this is a bug and we should fix it.  Otherwise, it
seems like an inconvenience.



> JSON representation of nested StructTypes could be more uniform
> ---
>
> Key: SPARK-11941
> URL: https://issues.apache.org/jira/browse/SPARK-11941
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Henri DF
>
> I have a json file with a single row {code}{"a":1, "b": 1.0, "c": "asdfasd", 
> "d":[1, 2, 4]}{code} After reading that file in, the schema is correctly 
> inferred:
> {code}
> scala> df.printSchema
> root
>  |-- a: long (nullable = true)
>  |-- b: double (nullable = true)
>  |-- c: string (nullable = true)
>  |-- d: array (nullable = true)
>  ||-- element: long (containsNull = true)
> {code}
> However, the json representation has a strange nesting under "type" for 
> column "d":
> {code}
> scala> df.collect()(0).schema.prettyJson
> res60: String = 
> {
>   "type" : "struct",
>   "fields" : [ {
> "name" : "a",
> "type" : "long",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "b",
> "type" : "double",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "c",
> "type" : "string",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "d",
> "type" : {
>   "type" : "array",
>   "elementType" : "long",
>   "containsNull" : true
> },
> "nullable" : true,
> "metadata" : { }
>   }]
> }
> {code}
> Specifically, in the last element, "type" is an object instead of being a 
> string. I would expect the last element to be:
> {code}
>   {
>  "name":"d",
>  "type":"array",
>  "elementType":"long",
>  "containsNull":true,
>  "nullable":true,
>  "metadata":{}
>   }
> {code}
> There's a similar issue for nested structs.
> (I ran into this while writing node.js bindings, wanted to recurse down this 
> representation, which would be nicer if it was uniform...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11873) Regression for TPC-DS query 63 when used with decimal datatype and windows function

2015-11-30 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032921#comment-15032921
 ] 

Michael Armbrust commented on SPARK-11873:
--

What about with Spark 1.6?

> Regression for TPC-DS query 63 when used with decimal datatype and windows 
> function
> ---
>
> Key: SPARK-11873
> URL: https://issues.apache.org/jira/browse/SPARK-11873
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Dileep Kumar
>  Labels: perfomance
> Attachments: 63.1.1, 63.1.5, 63.decimal_schema, 
> 63.decimal_schema_windows_function, 63.double_schema, 98.1.1, 98.1.5, 
> decimal_schema.sql, double_schema.sql
>
>
> When running the TPC-DS based queries for benchmarking spark found that query 
> 63 (after making it similar to original query) show different behavior 
> compared to other queries eg. q98 which has similar function.
> Here are performance numbers(execution time in seconds):
>   1.1 Baseline1.5 1.5 + Decimal
> q63   27  26  38
> q98   18  26  24
> As you can see q63 is showing regression compared to similar query. I am 
> attaching the both version of queries and affected schemas. When adding the 
> windows function back this is the only query seem to be slower than 1.1 in 
> 1.5.
> I have attached the both version of schema and queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12030:
-
Target Version/s: 1.6.0
Priority: Blocker  (was: Critical)

> Incorrect results when aggregate joined data
> 
>
> Key: SPARK-12030
> URL: https://issues.apache.org/jira/browse/SPARK-12030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Priority: Blocker
> Attachments: spark.jpg, t1.tar.gz, t2.tar.gz
>
>
> I have following issue.
> I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2)
> {code}
> t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache()
> t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache()
> joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer")
> {code}
> Important: both table are cached, so results should be the same on every 
> query.
> Then I did come counts:
> {code}
> t1.count() -> 5900729
> t1.registerTempTable("t1")
> sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729
> t2.count() -> 54298
> joined.count() -> 5900729
> {code}
> And here magic begins - I counted distinct id1 from joined table
> {code}
> joined.registerTempTable("joined")
> sqlCtx.sql("select distinct(id1) from joined").count()
> {code}
> Results varies *(are different on every run)* between 5899000 and 
> 590 but never are equal to 5900729.
> In addition. I did more queries:
> {code}
> sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > 
> 1").collect() 
> {code}
> This gives some results but this query return *1*
> {code}
> len(sqlCtx.sql("select * from joined where id1 = result").collect())
> {code}
> What's wrong ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12018) Refactor common subexpression elimination code

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12018.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 10009
[https://github.com/apache/spark/pull/10009]

> Refactor common subexpression elimination code
> --
>
> Key: SPARK-12018
> URL: https://issues.apache.org/jira/browse/SPARK-12018
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 1.6.0
>
>
> The code of common subexpression elimination can be factored and simplified. 
> Some unnecessary variables can be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11315) Add YARN extension service to publish Spark events to YARN timeline service

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11315:
-
Target Version/s:   (was: 1.6.0)

> Add YARN extension service to publish Spark events to YARN timeline service
> ---
>
> Key: SPARK-11315
> URL: https://issues.apache.org/jira/browse/SPARK-11315
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Hadoop 2.6+
>Reporter: Steve Loughran
>
> Add an extension service (using SPARK-11314) to subscribe to Spark lifecycle 
> events, batch them and forward them to the YARN Application Timeline Service. 
> This data can then be retrieved by a new back end for the Spark History 
> Service, and by other analytics tools.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11796) Docker JDBC integration tests fail in Maven build due to dependency issue

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11796:
-
Component/s: Tests

> Docker JDBC integration tests fail in Maven build due to dependency issue
> -
>
> Key: SPARK-11796
> URL: https://issues.apache.org/jira/browse/SPARK-11796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>
> Our new Docker integration tests for JDBC dialects are failing in the Maven 
> builds. For now, I've disabled this for Maven by adding the 
> {{-Dtest.exclude.tags=org.apache.spark.tags.DockerTest}} flag to our Jenkins 
> builds, but we should fix this soon. The test failures seem to be related to 
> dependency or classpath issues:
> {code}
> *** RUN ABORTED ***
>   java.lang.NoSuchMethodError: 
> org.apache.http.impl.client.HttpClientBuilder.setConnectionManagerShared(Z)Lorg/apache/http/impl/client/HttpClientBuilder;
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnector.(ApacheConnector.java:240)
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnectorProvider.getConnector(ApacheConnectorProvider.java:115)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:418)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117)
>   at 
> org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
>   at 
> org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726)
>   at 
> org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285)
>   at 
> org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126)
>   ...
> {code}
> To reproduce locally: {{build/mvn -pl docker-integration-tests package}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11601) ML 1.6 QA: API: Binary incompatible changes

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11601:
-
Component/s: Documentation

> ML 1.6 QA: API: Binary incompatible changes
> ---
>
> Key: SPARK-11601
> URL: https://issues.apache.org/jira/browse/SPARK-11601
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Timothy Hunter
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, ping [~mengxr] for advice since he did it for 
> 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11954) Encoder for JavaBeans / POJOs

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11954:
-
Assignee: Wenchen Fan

> Encoder for JavaBeans / POJOs
> -
>
> Key: SPARK-11954
> URL: https://issues.apache.org/jira/browse/SPARK-11954
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12031) Integer overflow when do sampling.

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12031:
-
Description: 
In my case, some partitions contain too much items. When do range partition, 
exception thrown as:

{code}
java.lang.IllegalArgumentException: n must be positive
at java.util.Random.nextInt(Random.java:300)
at 
org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:58)
at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:259)
at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:257)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

  was:
In my case, some partitions contain too much items. When do range partition, 
exception thrown as:


java.lang.IllegalArgumentException: n must be positive
at java.util.Random.nextInt(Random.java:300)
at 
org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:58)
at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:259)
at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:257)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


> Integer overflow when do sampling.
> --
>
> Key: SPARK-12031
> URL: https://issues.apache.org/jira/browse/SPARK-12031
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1, 1.5.2
>Reporter: uncleGen
>
> In my case, some partitions contain too much items. When do range partition, 
> exception thrown as:
> {code}
> java.lang.IllegalArgumentException: n must be positive
> at java.util.Random.nextInt(Random.java:300)
> at 
> org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:58)
> at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:259)
> at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:257)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8966) Design a mechanism to ensure that temporary files created in tasks are cleaned up after failures

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8966:

Target Version/s:   (was: 1.6.0)

> Design a mechanism to ensure that temporary files created in tasks are 
> cleaned up after failures
> 
>
> Key: SPARK-8966
> URL: https://issues.apache.org/jira/browse/SPARK-8966
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>
> It's important to avoid leaking temporary files, such as spill files created 
> by the external sorter.  Individual operators should still make an effort to 
> clean up their own files / perform their own error handling, but I think that 
> we should add a safety-net mechanism to track file creation on a per-task 
> basis and automatically clean up leaked files.
> During tests, this mechanism should throw an exception when a leak is 
> detected. In production deployments, it should log a warning and clean up the 
> leak itself.  This is similar to the TaskMemoryManager's leak detection and 
> cleanup code.
> We may be able to implement this via a convenience method that registers task 
> completion handlers with TaskContext.
> We might also explore techniques that will cause files to be cleaned up 
> automatically when their file descriptors are closed (e.g. by calling unlink 
> on an open file). These techniques should not be our last line of defense 
> against file resource leaks, though, since they might be platform-specific 
> and may clean up resources later than we'd like.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11600) Spark MLlib 1.6 QA umbrella

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11600:
-
Component/s: Documentation

> Spark MLlib 1.6 QA umbrella
> ---
>
> Key: SPARK-11600
> URL: https://issues.apache.org/jira/browse/SPARK-11600
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This JIRA lists tasks for the next MLlib release's QA period.
> h2. API
> * Check binary API compatibility (SPARK-11601)
> * Audit new public APIs (from the generated html doc)
> ** Scala (SPARK-11602)
> ** Java compatibility (SPARK-11605)
> ** Python coverage (SPARK-11604)
> * Check Experimental, DeveloperApi tags (SPARK-11603)
> h2. Algorithms and performance
> *Performance*
> * _List any other missing performance tests from spark-perf here_
> * ALS.recommendAll (SPARK-7457)
> * perf-tests in Python (SPARK-7539)
> * perf-tests for transformers (SPARK-2838)
> * MultilayerPerceptron (SPARK-11911)
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide (SPARK-11606)
> * For major components, create JIRAs for example code (SPARK-9670)
> * Update Programming Guide for 1.6 (towards end of QA) (SPARK-11608)
> * Update website (SPARK-11607)
> * Merge duplicate content under examples/ (SPARK-11685)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8414) Ensure ContextCleaner actually triggers clean ups

2015-11-30 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033103#comment-15033103
 ] 

Michael Armbrust commented on SPARK-8414:
-

Still planning to do this for 1.6?

> Ensure ContextCleaner actually triggers clean ups
> -
>
> Key: SPARK-8414
> URL: https://issues.apache.org/jira/browse/SPARK-8414
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>
> Right now it cleans up old references only through natural GCs, which may not 
> occur if the driver has infinite RAM. We should do a periodic GC to make sure 
> that we actually do clean things up. Something like once per 30 minutes seems 
> relatively inexpensive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12069) Documentation update for Datasets

2015-11-30 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-12069:


 Summary: Documentation update for Datasets
 Key: SPARK-12069
 URL: https://issues.apache.org/jira/browse/SPARK-12069
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11966) Spark API for UDTFs

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11966:
-
Target Version/s: 1.7.0

> Spark API for UDTFs
> ---
>
> Key: SPARK-11966
> URL: https://issues.apache.org/jira/browse/SPARK-11966
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Jaka Jancar
>Priority: Minor
>
> Defining UDFs is easy using sqlContext.udf.register, but not table-generating 
> functions. For those you still have to use these horrendous Hive interfaces:
> https://github.com/prongs/apache-hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/udtf/example/GenericUDTFCount2.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7348) DAG visualization: add links to RDD page

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7348:

Target Version/s:   (was: 1.6.0)

> DAG visualization: add links to RDD page
> 
>
> Key: SPARK-7348
> URL: https://issues.apache.org/jira/browse/SPARK-7348
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
>
> It currently has links from the job page to the stage page. It would be nice 
> if it has links to the corresponding RDD page as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12060) Avoid memory copy in JavaSerializerInstance.serialize

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12060:
-
Priority: Critical  (was: Major)

> Avoid memory copy in JavaSerializerInstance.serialize
> -
>
> Key: SPARK-12060
> URL: https://issues.apache.org/jira/browse/SPARK-12060
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Critical
>
> JavaSerializerInstance.serialize uses ByteArrayOutputStream.toByteArray to 
> get the serialized data. ByteArrayOutputStream.toByteArray needs to copy the 
> content in the internal array to a new array. However, since the array will 
> be converted to ByteBuffer at once, we can avoid the memory copy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11985) Update Spark Streaming - Kinesis Library Documentation regarding data de-aggregation and message handler

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11985:
-
Component/s: Documentation

> Update Spark Streaming - Kinesis Library Documentation regarding data 
> de-aggregation and message handler
> 
>
> Key: SPARK-11985
> URL: https://issues.apache.org/jira/browse/SPARK-11985
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Streaming
>Reporter: Burak Yavuz
>
> Update documentation and provide how-to example in guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6518) Add example code and user guide for bisecting k-means

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6518:

Component/s: Documentation

> Add example code and user guide for bisecting k-means
> -
>
> Key: SPARK-6518
> URL: https://issues.apache.org/jira/browse/SPARK-6518
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Yu Ishikawa
>Assignee: Yu Ishikawa
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11603) ML 1.6 QA: API: Experimental, DeveloperApi, final, sealed audit

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11603:
-
Component/s: Documentation

> ML 1.6 QA: API: Experimental, DeveloperApi, final, sealed audit
> ---
>
> Key: SPARK-11603
> URL: https://issues.apache.org/jira/browse/SPARK-11603
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: DB Tsai
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.  This will 
> probably not include the Pipeline APIs yet since some parts (e.g., feature 
> attributes) are still under flux.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11607) Update MLlib website for 1.6

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11607:
-
Component/s: Documentation

> Update MLlib website for 1.6
> 
>
> Key: SPARK-11607
> URL: https://issues.apache.org/jira/browse/SPARK-11607
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> Update MLlib's website to include features in 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6280) Remove Akka systemName from Spark

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6280:

Target Version/s:   (was: 1.6.0)

> Remove Akka systemName from Spark
> -
>
> Key: SPARK-6280
> URL: https://issues.apache.org/jira/browse/SPARK-6280
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> `systemName` is a Akka concept. A RPC implementation does not need to support 
> it. 
> We can hard code the system name in Spark and hide it in the internal Akka 
> RPC implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12031) Integer overflow when do sampling.

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12031:
-
Priority: Critical  (was: Major)

> Integer overflow when do sampling.
> --
>
> Key: SPARK-12031
> URL: https://issues.apache.org/jira/browse/SPARK-12031
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1, 1.5.2
>Reporter: uncleGen
>Priority: Critical
>
> In my case, some partitions contain too much items. When do range partition, 
> exception thrown as:
> {code}
> java.lang.IllegalArgumentException: n must be positive
> at java.util.Random.nextInt(Random.java:300)
> at 
> org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:58)
> at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:259)
> at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:257)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12017) Java Doc Publishing Broken

2015-11-30 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033119#comment-15033119
 ] 

Michael Armbrust commented on SPARK-12017:
--

Fixed in https://github.com/apache/spark/pull/10049

> Java Doc Publishing Broken
> --
>
> Key: SPARK-12017
> URL: https://issues.apache.org/jira/browse/SPARK-12017
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>    Reporter: Michael Armbrust
>Priority: Blocker
>
> The java docs are missing from the 1.6 preview.  I think that 
> [this|https://github.com/apache/spark/commit/529a1d3380c4c23fed068ad05a6376162c4b76d6#commitcomment-14392230]
>  is the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12017) Java Doc Publishing Broken

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12017.
--
   Resolution: Fixed
 Assignee: Josh Rosen
Fix Version/s: 1.6.0

> Java Doc Publishing Broken
> --
>
> Key: SPARK-12017
> URL: https://issues.apache.org/jira/browse/SPARK-12017
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>    Reporter: Michael Armbrust
>Assignee: Josh Rosen
>Priority: Blocker
> Fix For: 1.6.0
>
>
> The java docs are missing from the 1.6 preview.  I think that 
> [this|https://github.com/apache/spark/commit/529a1d3380c4c23fed068ad05a6376162c4b76d6#commitcomment-14392230]
>  is the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12010) Spark JDBC requires support for column-name-free INSERT syntax

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12010:
-
Target Version/s:   (was: 1.6.0)

> Spark JDBC requires support for column-name-free INSERT syntax
> --
>
> Key: SPARK-12010
> URL: https://issues.apache.org/jira/browse/SPARK-12010
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Christian Kurz
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Spark JDBC write only works with technologies which support the following 
> INSERT statement syntax (JdbcUtils.scala: insertStatement()):
> INSERT INTO $table VALUES ( ?, ?, ..., ? )
> Some technologies require a list of column names:
> INSERT INTO $table ( $colNameList ) VALUES ( ?, ?, ..., ? )
> Therefore technologies like Progress JDBC Driver for Cassandra do not work 
> with Spark JDBC write.
> Idea for fix:
> Move JdbcUtils.scala:insertStatement() into SqlDialect and add a SqlDialect 
> for Progress JDBC Driver for Cassandra



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12010) Spark JDBC requires support for column-name-free INSERT syntax

2015-11-30 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033124#comment-15033124
 ] 

Michael Armbrust commented on SPARK-12010:
--

Thanks for working on this, but we've already hit code freeze for 1.6.0 so I'm 
going to retarget.  Typically [let project 
committers|https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-JIRA]
 set the "target version".

> Spark JDBC requires support for column-name-free INSERT syntax
> --
>
> Key: SPARK-12010
> URL: https://issues.apache.org/jira/browse/SPARK-12010
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Christian Kurz
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Spark JDBC write only works with technologies which support the following 
> INSERT statement syntax (JdbcUtils.scala: insertStatement()):
> INSERT INTO $table VALUES ( ?, ?, ..., ? )
> Some technologies require a list of column names:
> INSERT INTO $table ( $colNameList ) VALUES ( ?, ?, ..., ? )
> Therefore technologies like Progress JDBC Driver for Cassandra do not work 
> with Spark JDBC write.
> Idea for fix:
> Move JdbcUtils.scala:insertStatement() into SqlDialect and add a SqlDialect 
> for Progress JDBC Driver for Cassandra



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11990) DataFrame recompute UDF in some situation.

2015-11-26 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11990.
--
   Resolution: Duplicate
Fix Version/s: 1.6.0

This is already fixed in Spark 1.6 by [SPARK-10371].

> DataFrame recompute UDF in some situation.
> --
>
> Key: SPARK-11990
> URL: https://issues.apache.org/jira/browse/SPARK-11990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Yi Tian
> Fix For: 1.6.0
>
>
> Here is codes for reproducing this problem:
> {code}
>   val mkArrayUDF = org.apache.spark.sql.functions.udf[Array[String],String] 
> ((s: String) => {
> println("udf called")
> Array[String](s+"_part1", s+"_part2")
>   })
>   
>   val df = sc.parallelize(Seq(("a"))).toDF("a")
>   val df2 = df.withColumn("arr",mkArrayUDF(df("a")))
>   val df3 = df2.withColumn("e0", df2("arr")(0)).withColumn("e1", 
> df2("arr")(1))
>   df3.collect().foreach(println)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12017) Java Doc Publishing Broken

2015-11-26 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-12017:


 Summary: Java Doc Publishing Broken
 Key: SPARK-12017
 URL: https://issues.apache.org/jira/browse/SPARK-12017
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Michael Armbrust
Priority: Blocker


The java docs are missing from the 1.6 preview.  I think that 
[this|https://github.com/apache/spark/commit/529a1d3380c4c23fed068ad05a6376162c4b76d6#commitcomment-14392230]
 is the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12017) Java Doc Publishing Broken

2015-11-26 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029153#comment-15029153
 ] 

Michael Armbrust commented on SPARK-12017:
--

/cc [~joshrosen]

> Java Doc Publishing Broken
> --
>
> Key: SPARK-12017
> URL: https://issues.apache.org/jira/browse/SPARK-12017
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>    Reporter: Michael Armbrust
>Priority: Blocker
>
> The java docs are missing from the 1.6 preview.  I think that 
> [this|https://github.com/apache/spark/commit/529a1d3380c4c23fed068ad05a6376162c4b76d6#commitcomment-14392230]
>  is the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11863) Unable to resolve order by if it contains mixture of aliases and real columns.

2015-11-26 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11863.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9961
[https://github.com/apache/spark/pull/9961]

> Unable to resolve order by if it contains mixture of aliases and real columns.
> --
>
> Key: SPARK-11863
> URL: https://issues.apache.org/jira/browse/SPARK-11863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dilip Biswal
> Fix For: 1.6.0
>
>
> Analyzer is unable to resolve order by if the columns in the order by clause 
> contains a mixture of alias and real column names.
> Example :
> var var3 = sqlContext.sql("select c1 as a, c2 as b from inttab group by c1, 
> c2 order by  b, c1")
> This used to work in 1.4 and is failing starting 1.5 and is affecting some 
> tpcds queries (19, 55,71)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11942) fix encoder life cycle for CoGroup

2015-11-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11942:
-
Assignee: Wenchen Fan

> fix encoder life cycle for CoGroup
> --
>
> Key: SPARK-11942
> URL: https://issues.apache.org/jira/browse/SPARK-11942
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11942) fix encoder life cycle for CoGroup

2015-11-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11942.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9928
[https://github.com/apache/spark/pull/9928]

> fix encoder life cycle for CoGroup
> --
>
> Key: SPARK-11942
> URL: https://issues.apache.org/jira/browse/SPARK-11942
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9141) DataFrame recomputed instead of using cached parent.

2015-11-24 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024927#comment-15024927
 ] 

Michael Armbrust commented on SPARK-9141:
-

[~tianyi] please provide a reproduction of the issue you are hitting.  The 
example from the description works for me.  In particular please include 
explain for the cache and failing dataframe.

> DataFrame recomputed instead of using cached parent.
> 
>
> Key: SPARK-9141
> URL: https://issues.apache.org/jira/browse/SPARK-9141
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Nick Pritchard
>    Assignee: Michael Armbrust
>Priority: Blocker
>  Labels: cache, dataframe
> Fix For: 1.5.0
>
>
> As I understand, DataFrame.cache() is supposed to work the same as 
> RDD.cache(), so that repeated operations on it will use the cached results 
> and not recompute the entire lineage. However, it seems that some DataFrame 
> operations (e.g. withColumn) change the underlying RDD lineage so that cache 
> doesn't work as expected.
> Below is a Scala example that demonstrates this. First, I define two UDF's 
> that  use println so that it is easy to see when they are being called. Next, 
> I create a simple data frame with one row and two columns. Next, I add a 
> column, cache it, and call count() to force the computation. Lastly, I add 
> another column, cache it, and call count().
> I would have expected the last statement to only compute the last column, 
> since everything else was cached. However, because withColumn() changes the 
> lineage, the whole data frame is recomputed.
> {code}
> // Examples udf's that println when called 
> val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 } 
> val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 } 
> // Initial dataset 
> val df1 = sc.parallelize(Seq(("a", 1))).toDF("name", "value") 
> // Add column by applying twice udf 
> val df2 = df1.withColumn("twice", twice($"value")) 
> df2.cache() 
> df2.count() //prints Computed: twice(1) 
> // Add column by applying triple udf 
> val df3 = df2.withColumn("triple", triple($"value")) 
> df3.cache() 
> df3.count() //prints Computed: twice(1)\nComputed: triple(1) 
> {code}
> I found a workaround, which helped me understand what was going on behind the 
> scenes, but doesn't seem like an ideal solution. Basically, I convert to RDD 
> then back DataFrame, which seems to freeze the lineage. The code below shows 
> the workaround for creating the second data frame so cache will work as 
> expected.
> {code}
> val df2 = {
>   val tmp = df1.withColumn("twice", twice($"value"))
>   sqlContext.createDataFrame(tmp.rdd, tmp.schema)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9328) Netty IO layer should implement read timeouts

2015-11-24 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025107#comment-15025107
 ] 

Michael Armbrust commented on SPARK-9328:
-

[~joshrosen] is this actually a 1.6 blocker?

> Netty IO layer should implement read timeouts
> -
>
> Key: SPARK-9328
> URL: https://issues.apache.org/jira/browse/SPARK-9328
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.1, 1.3.1, 1.4.1, 1.5.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Spark's network layer does not implement read timeouts which may lead to 
> stalls during shuffle: if a remote shuffle server stalls while responding to 
> a shuffle block fetch request but does not close the socket then the job may 
> block until an OS-level socket timeout occurs.
> I think that we can fix this using Netty's ReadTimeoutHandler 
> (http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler).
>   The tricky part of working on this will be figuring out the right place to 
> add the handler and ensuring that we don't introduce performance issues by 
> not re-using sockets.
> Quoting from that linked StackOverflow question:
> {quote}
> Note that the ReadTimeoutHandler is also unaware of whether you have sent a 
> request - it only cares whether data has been read from the socket. If your 
> connection is persistent, and you only want read timeouts to fire when a 
> request has been sent, you'll need to build a request / response aware 
> timeout handler.
> {quote}
> If we want to avoid tearing down connections between shuffles then we may 
> have to do something like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11926) unify GetStructField and GetInternalRowField

2015-11-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11926.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9909
[https://github.com/apache/spark/pull/9909]

> unify GetStructField and GetInternalRowField
> 
>
> Key: SPARK-11926
> URL: https://issues.apache.org/jira/browse/SPARK-11926
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11913) support typed aggregate for complex buffer schema

2015-11-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11913.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9898
[https://github.com/apache/spark/pull/9898]

> support typed aggregate for complex buffer schema
> -
>
> Key: SPARK-11913
> URL: https://issues.apache.org/jira/browse/SPARK-11913
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11894) Incorrect results are returned when using null

2015-11-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11894.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9904
[https://github.com/apache/spark/pull/9904]

> Incorrect results are returned when using null
> --
>
> Key: SPARK-11894
> URL: https://issues.apache.org/jira/browse/SPARK-11894
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
> Fix For: 1.6.0
>
>
> In DataSet APIs, the following two datasets are the same. 
>   Seq((new java.lang.Integer(0), "1"), (new java.lang.Integer(22), 
> "2")).toDS()
>   Seq((null.asInstanceOf[java.lang.Integer],, "1"), (new 
> java.lang.Integer(22), "2")).toDS()
> Note: java.lang.Integer is Nullable. 
> It could generate an incorrect result. For example, 
> val ds1 = Seq((null.asInstanceOf[java.lang.Integer], "1"), (new 
> java.lang.Integer(22), "2")).toDS()
> val ds2 = Seq((null.asInstanceOf[java.lang.Integer], "1"), (new 
> java.lang.Integer(22), "2")).toDS()//toDF("key", "value").as('df2)
> val res1 = ds1.joinWith(ds2, lit(true)).collect()
> The expected result should be 
> ((null,1),(null,1))
> ((22,2),(null,1))
> ((null,1),(22,2))
> ((22,2),(22,2))
> The actual result is 
> ((0,1),(0,1))
> ((22,2),(0,1))
> ((0,1),(22,2))
> ((22,2),(22,2))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11921) fix `nullable` of encoder schema

2015-11-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11921.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9906
[https://github.com/apache/spark/pull/9906]

> fix `nullable` of encoder schema
> 
>
> Key: SPARK-11921
> URL: https://issues.apache.org/jira/browse/SPARK-11921
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Michael Armbrust
Here is how I view the relationship between the various components of Spark:

 - *RDDs - *a low level API for expressing DAGs that will be executed in
parallel by Spark workers
 - *Catalyst -* an internal library for expressing trees that we use to
build relational algebra and expression evaluation.  There's also an
optimizer and query planner than turns these into logical concepts into RDD
actions.
 - *Tungsten -* an internal optimized execution engine that can compile
catalyst expressions into efficient java bytecode that operates directly on
serialized binary data.  It also has nice low level data structures /
algorithms like hash tables and sorting that operate directly on serialized
data.  These are used by the physical nodes that are produced by the query
planner (and run inside of RDD operation on workers).
 - *DataFrames - *a user facing API that is similar to SQL/LINQ for
constructing dataflows that are backed by catalyst logical plans
 - *Datasets* - a user facing API that is similar to the RDD API for
constructing dataflows that are backed by catalyst logical plans

So everything is still operating on RDDs but I anticipate most users will
eventually migrate to the higher level APIs for convenience and automatic
optimization

On Mon, Nov 23, 2015 at 4:18 PM, Jakob Odersky  wrote:

> Hi everyone,
>
> I'm doing some reading-up on all the newer features of Spark such as
> DataFrames, DataSets and Project Tungsten. This got me a bit confused on
> the relation between all these concepts.
>
> When starting to learn Spark, I read a book and the original paper on
> RDDs, this lead me to basically think "Spark == RDDs".
> Now, looking into DataFrames, I read that they are basically (distributed)
> collections with an associated schema, thus enabling declarative queries
> and optimization (through Catalyst). I am uncertain how DataFrames relate
> to RDDs: are DataFrames transformed to operations on RDDs once they have
> been optimized? Or are they completely different concepts? In case of the
> latter, do DataFrames still use the Spark scheduler and get broken down
> into a DAG of stages and tasks?
>
> Regarding project Tungsten, where does it fit in? To my understanding it
> is used to efficiently cache data in memory and may also be used to
> generate query code for specialized hardware. This sounds as though it
> would work on Spark's worker nodes, however it would also only work with
> schema-associated data (aka DataFrames), thus leading me to the conclusion
> that RDDs and DataFrames do not share a common backend which in turn
> contradicts my conception of "Spark == RDDs".
>
> Maybe I missed the obvious as these questions seem pretty basic, however I
> was unable to find clear answers in Spark documentation or related papers
> and talks. I would greatly appreciate any clarifications.
>
> thanks,
> --Jakob
>


[ANNOUNCE] Spark 1.6.0 Release Preview

2015-11-22 Thread Michael Armbrust
In order to facilitate community testing of Spark 1.6.0, I'm excited to
announce the availability of an early preview of the release. This is not a
release candidate, so there is no voting involved. However, it'd be awesome
if community members can start testing with this preview package and report
any problems they encounter.

This preview package contains all the commits to branch-1.6
 till commit
308381420f51b6da1007ea09a02d740613a226e0
.

The staging maven repository for this preview build can be found here:
https://repository.apache.org/content/repositories/orgapachespark-1162

Binaries for this preview build can be found here:
http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-preview2-bin/

A build of the docs can also be found here:
http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-preview2-docs/

The full change log for this release can be found on JIRA

.

*== How can you help? ==*

If you are a Spark user, you can help us test this release by taking a
Spark workload and running on this preview release, then reporting any
regressions.

*== Major Features ==*

When testing, we'd appreciate it if users could focus on areas that have
changed in this release.  Some notable new features include:

SPARK-11787  *Parquet
Performance* - Improve Parquet scan performance when using flat schemas.
SPARK-10810  *Session *
*Management* - Multiple users of the thrift (JDBC/ODBC) server now have
isolated sessions including their own default database (i.e USE mydb) even
on shared clusters.
SPARK-   *Dataset API* -
A new, experimental type-safe API (similar to RDDs) that performs many
operations on serialized binary data and code generation (i.e. Project
Tungsten)
SPARK-1  *Unified
Memory Management* - Shared memory for execution and caching instead of
exclusive division of the regions.
SPARK-10978  *Datasource
API Avoid Double Filter* - When implementing a datasource with filter
pushdown, developers can now tell Spark SQL to avoid double evaluating a
pushed-down filter.
SPARK-2629   *New
improved state management* - trackStateByKey - a DStream transformation for
stateful stream processing, supersedes updateStateByKey in functionality
and performance.

Happy testing!

Michael


[jira] [Updated] (SPARK-7539) Perf tests for Python MLlib

2015-11-22 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7539:

Component/s: Tests

> Perf tests for Python MLlib
> ---
>
> Key: SPARK-7539
> URL: https://issues.apache.org/jira/browse/SPARK-7539
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, PySpark, Tests
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> As new perf-tests are added to Scala, we should added equivalent ones in 
> Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11819) nice error message for missing encoder

2015-11-20 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11819.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9810
[https://github.com/apache/spark/pull/9810]

> nice error message for missing encoder
> --
>
> Key: SPARK-11819
> URL: https://issues.apache.org/jira/browse/SPARK-11819
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11876) [SQL] Support PrintSchema in DataSet APIs

2015-11-20 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11876.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9855
[https://github.com/apache/spark/pull/9855]

> [SQL] Support PrintSchema in DataSet APIs
> -
>
> Key: SPARK-11876
> URL: https://issues.apache.org/jira/browse/SPARK-11876
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
> Fix For: 1.6.0
>
>
> For DataSet APIs, prints the schema to the console in a nice tree format



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11873) Regression for TPC-DS query 63 when used with decimal datatype and windows function

2015-11-20 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11873:
-
Issue Type: Improvement  (was: Bug)

> Regression for TPC-DS query 63 when used with decimal datatype and windows 
> function
> ---
>
> Key: SPARK-11873
> URL: https://issues.apache.org/jira/browse/SPARK-11873
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Dileep Kumar
>  Labels: perfomance
> Attachments: 63.1.1, 63.1.5, 63.decimal_schema, 
> 63.decimal_schema_windows_function, 63.double_schema, 98.1.1, 98.1.5, 
> decimal_schema.sql, double_schema.sql
>
>
> When running the TPC-DS based queries for benchmarking spark found that query 
> 63 (after making it similar to original query) show different behavior 
> compared to other queries eg. q98 which has similar function.
> Here are performance numbers(execution time in seconds):
>   1.1 Baseline1.5 1.5 + Decimal
> q63   27  26  38
> q98   18  26  24
> As you can see q63 is showing regression compared to similar query. I am 
> attaching the both version of queries and affected schemas. When adding the 
> windows function back this is the only query seem to be slower than 1.1 in 
> 1.5.
> I have attached the both version of schema and queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11819) nice error message for missing encoder

2015-11-20 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11819:
-
Assignee: Wenchen Fan

> nice error message for missing encoder
> --
>
> Key: SPARK-11819
> URL: https://issues.apache.org/jira/browse/SPARK-11819
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11876) [SQL] Support PrintSchema in DataSet APIs

2015-11-20 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11876:
-
Assignee: Xiao Li

> [SQL] Support PrintSchema in DataSet APIs
> -
>
> Key: SPARK-11876
> URL: https://issues.apache.org/jira/browse/SPARK-11876
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 1.6.0
>
>
> For DataSet APIs, prints the schema to the console in a nice tree format



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11889) Type inference in REPL broken for GroupedDataset.agg

2015-11-20 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-11889:


 Summary: Type inference in REPL broken for GroupedDataset.agg
 Key: SPARK-11889
 URL: https://issues.apache.org/jira/browse/SPARK-11889
 Project: Spark
  Issue Type: Bug
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Critical


This works in compiled code, but fails in the REPL.
{code}
/** An `Aggregator` that adds up any numeric type returned by the given 
function. */
class SumOf[I, N : Numeric](f: I => N) extends Aggregator[I, N, N] with 
Serializable {
  val numeric = implicitly[Numeric[N]]
  override def zero: N = numeric.zero
  override def reduce(b: N, a: I): N = numeric.plus(b, f(a))
  override def merge(b1: N,b2: N): N = numeric.plus(b1, b2)
  override def finish(reduction: N): N = reduction
}

def sum[I, N : Numeric : Encoder](f: I => N): TypedColumn[I, N] = new 
SumOf(f).toColumn

val ds = Seq((1, 1, 2L), (1, 2, 3L), (1, 3, 4L), (2, 1, 5L)).toDS()
ds.groupBy(_._1).agg(count("*"), sum(_._2), sum(_._3)).collect()
{code}

{code}
:38: error: missing parameter type for expanded function ((x$2) => 
x$2._2)
  ds.groupBy(_._1).agg(sum(_._2), sum(_._3)).collect()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11889) Type inference in REPL broken for GroupedDataset.agg

2015-11-20 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11889:
-
Target Version/s: 1.6.0

> Type inference in REPL broken for GroupedDataset.agg
> 
>
> Key: SPARK-11889
> URL: https://issues.apache.org/jira/browse/SPARK-11889
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>    Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Critical
>
> This works in compiled code, but fails in the REPL.
> {code}
> /** An `Aggregator` that adds up any numeric type returned by the given 
> function. */
> class SumOf[I, N : Numeric](f: I => N) extends Aggregator[I, N, N] with 
> Serializable {
>   val numeric = implicitly[Numeric[N]]
>   override def zero: N = numeric.zero
>   override def reduce(b: N, a: I): N = numeric.plus(b, f(a))
>   override def merge(b1: N,b2: N): N = numeric.plus(b1, b2)
>   override def finish(reduction: N): N = reduction
> }
> def sum[I, N : Numeric : Encoder](f: I => N): TypedColumn[I, N] = new 
> SumOf(f).toColumn
> val ds = Seq((1, 1, 2L), (1, 2, 3L), (1, 3, 4L), (2, 1, 5L)).toDS()
> ds.groupBy(_._1).agg(count("*"), sum(_._2), sum(_._3)).collect()
> {code}
> {code}
> :38: error: missing parameter type for expanded function ((x$2) => 
> x$2._2)
>   ds.groupBy(_._1).agg(sum(_._2), sum(_._3)).collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11889) Type inference in REPL broken for GroupedDataset.agg

2015-11-20 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11889:
-
Component/s: SQL

> Type inference in REPL broken for GroupedDataset.agg
> 
>
> Key: SPARK-11889
> URL: https://issues.apache.org/jira/browse/SPARK-11889
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>    Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Critical
>
> This works in compiled code, but fails in the REPL.
> {code}
> /** An `Aggregator` that adds up any numeric type returned by the given 
> function. */
> class SumOf[I, N : Numeric](f: I => N) extends Aggregator[I, N, N] with 
> Serializable {
>   val numeric = implicitly[Numeric[N]]
>   override def zero: N = numeric.zero
>   override def reduce(b: N, a: I): N = numeric.plus(b, f(a))
>   override def merge(b1: N,b2: N): N = numeric.plus(b1, b2)
>   override def finish(reduction: N): N = reduction
> }
> def sum[I, N : Numeric : Encoder](f: I => N): TypedColumn[I, N] = new 
> SumOf(f).toColumn
> val ds = Seq((1, 1, 2L), (1, 2, 3L), (1, 3, 4L), (2, 1, 5L)).toDS()
> ds.groupBy(_._1).agg(count("*"), sum(_._2), sum(_._3)).collect()
> {code}
> {code}
> :38: error: missing parameter type for expanded function ((x$2) => 
> x$2._2)
>   ds.groupBy(_._1).agg(sum(_._2), sum(_._3)).collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11890) Encoder errors logic breaks on Scala 2.11

2015-11-20 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-11890:


 Summary: Encoder errors logic breaks on Scala 2.11
 Key: SPARK-11890
 URL: https://issues.apache.org/jira/browse/SPARK-11890
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11890) Encoder errors logic breaks on Scala 2.11

2015-11-20 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11890:
-
Description: 
{code}
[error] 
/home/jenkins/workspace/Spark-Master-Scala211-Compile/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:66:
 not found: value getClassNameFromType
[error] val className = getClassNameFromType(tpe)
[error] ^
[error] 
/home/jenkins/workspace/Spark-Master-Scala211-Compile/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:337:
 not found: value getClassNameFromType
[error] val clsName = getClassNameFromType(tpe)
[error]   ^
[error] 
/home/jenkins/workspace/Spark-Master-Scala211-Compile/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:353:
 not found: value silentSchemaFor
[error]   val Schema(catalystType, nullable) = silentSchemaFor(elementType)
[error]^
[error] 
/home/jenkins/workspace/Spark-Master-Scala211-Compile/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:360:
 not found: value getClassNameFromType
[error] val clsName = getClassNameFromType(elementType)
[error]   ^
[error] 
/home/jenkins/workspace/Spark-Master-Scala211-Compile/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:416:
 not found: value getClassNameFromType
[error]   val className = getClassNameFromType(optType)
[error]   ^
[error] 
/home/jenkins/workspace/Spark-Master-Scala211-Compile/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:424:
 not found: value silentSchemaFor
[error] expressions.Literal.create(null, 
silentSchemaFor(optType).dataType),
[error]  ^
[error] 
/home/jenkins/workspace/Spark-Master-Scala211-Compile/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:451:
 not found: value getClassNameFromType
[error] val clsName = getClassNameFromType(fieldType)
[error]   ^
[error] 7 errors found
{code}

> Encoder errors logic breaks on Scala 2.11
> -
>
> Key: SPARK-11890
> URL: https://issues.apache.org/jira/browse/SPARK-11890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>    Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>
> {code}
> [error] 
> /home/jenkins/workspace/Spark-Master-Scala211-Compile/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:66:
>  not found: value getClassNameFromType
> [error] val className = getClassNameFromType(tpe)
> [error] ^
> [error] 
> /home/jenkins/workspace/Spark-Master-Scala211-Compile/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:337:
>  not found: value getClassNameFromType
> [error] val clsName = getClassNameFromType(tpe)
> [error]   ^
> [error] 
> /home/jenkins/workspace/Spark-Master-Scala211-Compile/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:353:
>  not found: value silentSchemaFor
> [error]   val Schema(catalystType, nullable) = 
> silentSchemaFor(elementType)
> [error]^
> [error] 
> /home/jenkins/workspace/Spark-Master-Scala211-Compile/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:360:
>  not found: value getClassNameFromType
> [error] val clsName = getClassNameFromType(elementType)
> [error]   ^
> [error] 
> /home/jenkins/workspace/Spark-Master-Scala211-Compile/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:416:
>  not found: value getClassNameFromType
> [error]   val className = getClassNameFromType(optType)
> [error]   ^
> [error] 
> /home/jenkins/workspace/Spark-Master-Scala211-Compile/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:424:
>  not found: value silentSchemaFor
> [error] expressions.Literal.create(null, 
> silentSchemaFor(optType).dataType),
> [error]  ^
> [error] 
> /home/jenkins/workspace/Spark-Master-Scala211-Compile/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:451:
>  not found: value getClassNameFromType
> [error] val clsName = getClassNameFro

<    9   10   11   12   13   14   15   16   17   18   >