[jira] [Commented] (SPARK-4199) Drop table if exists raises table not found exception in HiveContext

2014-11-03 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194376#comment-14194376
 ] 

Cheng Lian commented on SPARK-4199:
---

Hi [~huangjs], which version/commit are you using? Could you please provide, 
for example, a {{spark-shell}} session snippet that helps reproduce this issue? 
Just tried both 1.1.0 and the most recent master 
(https://github.com/apache/spark/tree/76386e1a23c55a58c0aeea67820aab2bac71b24b) 
with this under {{spark-shell}}:
{code}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.catalyst.types._
import java.sql.Date

val sparkContext = sc
import sparkContext._

val hiveContext = new HiveContext(sparkContext)
import hiveContext._

sql(DROP TABLE IF EXISTS xxx)
{code}
The only ERROR log (which is expected) I found is:
{code}
14/11/03 17:12:56 ERROR metadata.Hive: 
NoSuchObjectException(message:default.xxx table not found)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1560)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
at com.sun.proxy.$Proxy16.get_table(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:997)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
...
{code}
And the DROP statement itself completed successfully.

 Drop table if exists raises table not found exception in HiveContext
 --

 Key: SPARK-4199
 URL: https://issues.apache.org/jira/browse/SPARK-4199
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Jianshi Huang

 Try this:
   sql(DROP TABLE IF EXISTS some_table)
 The exception looks like this:
 14/11/02 19:55:29 INFO ParseDriver: Parsing command: DROP TABLE IF EXISTS 
 some_table
 14/11/02 19:55:29 INFO ParseDriver: Parse Completed
 14/11/02 19:55:29 INFO Driver: /PERFLOG method=parse start=1414986929678 
 end=1414986929678 duration=0
 14/11/02 19:55:29 INFO Driver: PERFLOG method=semanticAnalyze
 14/11/02 19:55:29 INFO HiveMetaStore: 0: Opening raw store with implemenation 
 class:org.apache.hadoop.hive.metastore.ObjectStore
 14/11/02 19:55:29 INFO ObjectStore: ObjectStore, initialize called
 14/11/02 19:55:29 ERROR Driver: FAILED: SemanticException [Error 10001]: 
 Table not found some_table
 org.apache.hadoop.hive.ql.parse.SemanticException: Table not found some_table
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3294)
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3281)
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeDropTable(DDLSemanticAnalyzer.java:824)
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:249)
 at 
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
 at 
 org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:294)
 at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:273)
 at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58)
 at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56)
 at 
 org.apache.spark.sql.execution.Command$class.execute(commands.scala:44)
 at 
 org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:353)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:353)
 at 
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
 at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104)
 at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SPARK-1442) Add Window function support

2014-11-03 Thread guowei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guowei updated SPARK-1442:
--
Attachment: (was: Window Function.pdf)

 Add Window function support
 ---

 Key: SPARK-1442
 URL: https://issues.apache.org/jira/browse/SPARK-1442
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Chengxiang Li
 Attachments: Window Function.pdf


 similiar to Hive, add window function support for catalyst.
 https://issues.apache.org/jira/browse/HIVE-4197
 https://issues.apache.org/jira/browse/HIVE-896



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1442) Add Window function support

2014-11-03 Thread guowei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guowei updated SPARK-1442:
--
Attachment: Window Function.pdf

 Add Window function support
 ---

 Key: SPARK-1442
 URL: https://issues.apache.org/jira/browse/SPARK-1442
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Chengxiang Li
 Attachments: Window Function.pdf


 similiar to Hive, add window function support for catalyst.
 https://issues.apache.org/jira/browse/HIVE-4197
 https://issues.apache.org/jira/browse/HIVE-896



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4203) Partition directories in random order when inserting into hive table

2014-11-03 Thread Matthew Taylor (JIRA)
Matthew Taylor created SPARK-4203:
-

 Summary: Partition directories in random order when inserting into 
hive table
 Key: SPARK-4203
 URL: https://issues.apache.org/jira/browse/SPARK-4203
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.2.0
Reporter: Matthew Taylor


When doing an insert into hive table with partitions the folders written to the 
file system are in a random order instead of the order defined in table 
creation. Seems that the loadPartition method in Hive.java has a 
MapString,String parameter but expects to be called with a map that has a 
defined ordering such as  LinkedHashMap. Have a patch which I will do a PR for. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4204) Utils.exceptionString only return the information for the outermost exception

2014-11-03 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-4204:
---

 Summary: Utils.exceptionString only return the information for the 
outermost exception
 Key: SPARK-4204
 URL: https://issues.apache.org/jira/browse/SPARK-4204
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Shixiong Zhu
Priority: Minor


An exception may contain some inner exceptions, but Utils.exceptionString only 
return the information for the outermost exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4204) Utils.exceptionString only return the information for the outermost exception

2014-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194507#comment-14194507
 ] 

Apache Spark commented on SPARK-4204:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/3073

 Utils.exceptionString only return the information for the outermost exception
 -

 Key: SPARK-4204
 URL: https://issues.apache.org/jira/browse/SPARK-4204
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Shixiong Zhu
Priority: Minor

 An exception may contain some inner exceptions, but Utils.exceptionString 
 only return the information for the outermost exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4205) Timestamp and Date objects with comparison operators

2014-11-03 Thread Marc Culler (JIRA)
Marc Culler created SPARK-4205:
--

 Summary: Timestamp and Date objects with comparison operators
 Key: SPARK-4205
 URL: https://issues.apache.org/jira/browse/SPARK-4205
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Marc Culler
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2014-11-03 Thread Chris Heller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Heller updated SPARK-2691:

Attachment: spark-docker.patch

Here is the patch for the changes needed to support docker images in the fine 
grained backend.

The approach taken here was to just populate the DockerInfo of the container 
info if some properties were set in the properties file. This has no support 
for versions of mesos which do not support docker, so it is very incomplete.

Additionally there is only support for image name, and volumes.

For volumes, you just provide a string, which takes a value similar in form to 
the argument to 'docker run -v', ie. it is a comma separated list of 
[host:]container[:mode] options. I think this is sufficient, it parallels the 
command line, and so should be familiar.

 Allow Spark on Mesos to be launched with Docker
 ---

 Key: SPARK-2691
 URL: https://issues.apache.org/jira/browse/SPARK-2691
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen
Assignee: Timothy Chen
  Labels: mesos
 Attachments: spark-docker.patch


 Currently to launch Spark with Mesos one must upload a tarball and specifiy 
 the executor URI to be passed in that is to be downloaded on each slave or 
 even each execution depending coarse mode or not.
 We want to make Spark able to support launching Executors via a Docker image 
 that utilizes the recent Docker and Mesos integration work. 
 With the recent integration Spark can simply specify a Docker image and 
 options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2014-11-03 Thread Chris Heller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194513#comment-14194513
 ] 

Chris Heller edited comment on SPARK-2691 at 11/3/14 1:08 PM:
--

Here is the patch for the changes needed to support docker images in the fine 
grained backend.

The approach taken here was to just populate the DockerInfo of the container 
info if some properties were set in the properties file. This has no support 
for versions of mesos which do not support docker, so it is very incomplete.

Additionally there is only support for image name, and volumes.

For volumes, you just provide a string, which takes a value similar in form to 
the argument to 'docker run -v', ie. it is a comma separated list of 
[host:]container[:mode] options. I think this is sufficient, it parallels the 
command line, and so should be familiar.
 
I would suggest, for all options of the DockerInfo exposed,  to mirror how 
those options are set on the docker command line.


was (Author: chrisheller):
Here is the patch for the changes needed to support docker images in the fine 
grained backend.

The approach taken here was to just populate the DockerInfo of the container 
info if some properties were set in the properties file. This has no support 
for versions of mesos which do not support docker, so it is very incomplete.

Additionally there is only support for image name, and volumes.

For volumes, you just provide a string, which takes a value similar in form to 
the argument to 'docker run -v', ie. it is a comma separated list of 
[host:]container[:mode] options. I think this is sufficient, it parallels the 
command line, and so should be familiar.

 Allow Spark on Mesos to be launched with Docker
 ---

 Key: SPARK-2691
 URL: https://issues.apache.org/jira/browse/SPARK-2691
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen
Assignee: Timothy Chen
  Labels: mesos
 Attachments: spark-docker.patch


 Currently to launch Spark with Mesos one must upload a tarball and specifiy 
 the executor URI to be passed in that is to be downloaded on each slave or 
 even each execution depending coarse mode or not.
 We want to make Spark able to support launching Executors via a Docker image 
 that utilizes the recent Docker and Mesos integration work. 
 With the recent integration Spark can simply specify a Docker image and 
 options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4206) BlockManager warnings in local mode: Block $blockId already exists on this machine; not re-adding it

2014-11-03 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-4206:
---

 Summary: BlockManager warnings in local mode: Block $blockId 
already exists on this machine; not re-adding it
 Key: SPARK-4206
 URL: https://issues.apache.org/jira/browse/SPARK-4206
 Project: Spark
  Issue Type: Bug
Reporter: Imran Rashid
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4206) BlockManager warnings in local mode: Block $blockId already exists on this machine; not re-adding it

2014-11-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194550#comment-14194550
 ] 

Sean Owen commented on SPARK-4206:
--

I think there was a discussion about this and the consensus was that these 
aren't anything to worry about and can be info-level messages?

 BlockManager warnings in local mode: Block $blockId already exists on this 
 machine; not re-adding it
 -

 Key: SPARK-4206
 URL: https://issues.apache.org/jira/browse/SPARK-4206
 Project: Spark
  Issue Type: Bug
Reporter: Imran Rashid
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4206) BlockManager warnings in local mode: Block $blockId already exists on this machine; not re-adding it

2014-11-03 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-4206:

Description: 
When running in local mode, you often get log warning messages like:

WARN storage.BlockManager: Block input-0-1415022975000 already exists on this 
machine; not re-adding it

(eg., try running the TwitterPopularTags example in local mode)

I think these warning messages are pretty unsettling for a new user, and should 
be removed.  If they are truly innocuous, they should be changed to logInfo, or 
maybe even logDebug.  Or if they might actually indicate a problem, we should 
find the root cause and fix it.


I *think* the problem is caused by a replication level  1 when running in 
local mode.  In BlockManager.doPut, first the block is put locally:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L692

and then if the replication level  1, a request is sent out to replicate the 
block:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L827

However, in local mode, there isn't anywhere else to replicate the block; the 
request comes back to the same node, which then issues the warning that the 
block has already been added.

If that analysis is right, the easy fix would be to make sure replicationLevel 
= 1 in local mode.  But, its a little disturbing that a replication request 
could result in an attempt to replicate on the same node -- and that if 
something is wrong, we only issue a warning and keep going.

If this really the culprit, then it might be worth taking a closer look at the 
logic of replication.
Environment: local mode, branch-1.1  master

 BlockManager warnings in local mode: Block $blockId already exists on this 
 machine; not re-adding it
 -

 Key: SPARK-4206
 URL: https://issues.apache.org/jira/browse/SPARK-4206
 Project: Spark
  Issue Type: Bug
 Environment: local mode, branch-1.1  master
Reporter: Imran Rashid
Priority: Minor

 When running in local mode, you often get log warning messages like:
 WARN storage.BlockManager: Block input-0-1415022975000 already exists on this 
 machine; not re-adding it
 (eg., try running the TwitterPopularTags example in local mode)
 I think these warning messages are pretty unsettling for a new user, and 
 should be removed.  If they are truly innocuous, they should be changed to 
 logInfo, or maybe even logDebug.  Or if they might actually indicate a 
 problem, we should find the root cause and fix it.
 I *think* the problem is caused by a replication level  1 when running in 
 local mode.  In BlockManager.doPut, first the block is put locally:
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L692
 and then if the replication level  1, a request is sent out to replicate the 
 block:
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L827
 However, in local mode, there isn't anywhere else to replicate the block; the 
 request comes back to the same node, which then issues the warning that the 
 block has already been added.
 If that analysis is right, the easy fix would be to make sure 
 replicationLevel = 1 in local mode.  But, its a little disturbing that a 
 replication request could result in an attempt to replicate on the same node 
 -- and that if something is wrong, we only issue a warning and keep going.
 If this really the culprit, then it might be worth taking a closer look at 
 the logic of replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4206) BlockManager warnings in local mode: Block $blockId already exists on this machine; not re-adding it

2014-11-03 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194574#comment-14194574
 ] 

Imran Rashid commented on SPARK-4206:
-

thanks Sean -- sorry I accidentally created the issue before fleshing it out, I 
added more info now.

apologies if I missed the previous discussion, I didn't find anything in jira 
or spark-dev -- do you mind pointing me at it if you do find something?  I had 
always assumed they were no big deal earlier as well, but after taking a closer 
look, I'm not so sure.  If my explanation is correct, at the very least we 
ought to be able to eliminate the warning entirely by setting the replication 
level = 1 in local mode.

 BlockManager warnings in local mode: Block $blockId already exists on this 
 machine; not re-adding it
 -

 Key: SPARK-4206
 URL: https://issues.apache.org/jira/browse/SPARK-4206
 Project: Spark
  Issue Type: Bug
 Environment: local mode, branch-1.1  master
Reporter: Imran Rashid
Priority: Minor

 When running in local mode, you often get log warning messages like:
 WARN storage.BlockManager: Block input-0-1415022975000 already exists on this 
 machine; not re-adding it
 (eg., try running the TwitterPopularTags example in local mode)
 I think these warning messages are pretty unsettling for a new user, and 
 should be removed.  If they are truly innocuous, they should be changed to 
 logInfo, or maybe even logDebug.  Or if they might actually indicate a 
 problem, we should find the root cause and fix it.
 I *think* the problem is caused by a replication level  1 when running in 
 local mode.  In BlockManager.doPut, first the block is put locally:
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L692
 and then if the replication level  1, a request is sent out to replicate the 
 block:
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L827
 However, in local mode, there isn't anywhere else to replicate the block; the 
 request comes back to the same node, which then issues the warning that the 
 block has already been added.
 If that analysis is right, the easy fix would be to make sure 
 replicationLevel = 1 in local mode.  But, its a little disturbing that a 
 replication request could result in an attempt to replicate on the same node 
 -- and that if something is wrong, we only issue a warning and keep going.
 If this really the culprit, then it might be worth taking a closer look at 
 the logic of replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce

2014-11-03 Thread Andre Schumacher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194576#comment-14194576
 ] 

Andre Schumacher commented on SPARK-2620:
-

I also bumped into this issue (on Spark 1.1.0) and it is kind of extremely 
annoying although it only affects the REPL. Is anybody actively working on 
reolving this? Given it's already a few months old: are there any blockers for 
making this work? Matei mentioned the way code is wrapped inside the REPL.

 case class cannot be used as key for reduce
 ---

 Key: SPARK-2620
 URL: https://issues.apache.org/jira/browse/SPARK-2620
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
 Environment: reproduced on spark-shell local[4]
Reporter: Gerard Maas
Priority: Critical
  Labels: case-class, core

 Using a case class as a key doesn't seem to work properly on Spark 1.0.0
 A minimal example:
 case class P(name:String)
 val ps = Array(P(alice), P(bob), P(charly), P(bob))
 sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect
 [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
 (P(bob),1), (P(abe),1), (P(charly),1))
 In contrast to the expected behavior, that should be equivalent to:
 sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect
 Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
 groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4207) Query which has syntax like 'not like' is not working in Spark SQL

2014-11-03 Thread Ravindra Pesala (JIRA)
Ravindra Pesala created SPARK-4207:
--

 Summary: Query which has syntax like 'not like' is not working in 
Spark SQL
 Key: SPARK-4207
 URL: https://issues.apache.org/jira/browse/SPARK-4207
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Ravindra Pesala


Queries which has 'not like' is not working in Spark SQL. Same works in Spark 
HiveQL.

{code}
sql(SELECT * FROM records where value not like 'val%')
{code}

The above query fails with below exception

{code}
Exception in thread main java.lang.RuntimeException: [1.39] failure: ``IN'' 
expected but `like' found

SELECT * FROM records where value not like 'val%'
  ^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:75)
at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:75)
at 
org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:186)

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2014-11-03 Thread Eduardo Jimenez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194625#comment-14194625
 ] 

Eduardo Jimenez commented on SPARK-2691:


Thanks! docker-cli format it is then, as I agree its better. There might be 
some fields that are required by Mesos.proto but not required by Docker, and in 
those cases I'll stick to the Mesos requirements.

 Allow Spark on Mesos to be launched with Docker
 ---

 Key: SPARK-2691
 URL: https://issues.apache.org/jira/browse/SPARK-2691
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen
Assignee: Timothy Chen
  Labels: mesos
 Attachments: spark-docker.patch


 Currently to launch Spark with Mesos one must upload a tarball and specifiy 
 the executor URI to be passed in that is to be downloaded on each slave or 
 even each execution depending coarse mode or not.
 We want to make Spark able to support launching Executors via a Docker image 
 that utilizes the recent Docker and Mesos integration work. 
 With the recent integration Spark can simply specify a Docker image and 
 options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2014-11-03 Thread Chris Heller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194632#comment-14194632
 ] 

Chris Heller commented on SPARK-2691:
-

That seems reasonable. In fact, the volumes field of a ContainerInfo is not 
part of the DockerInfo structure, but since there is only a DOCKER type of 
ContainerInfo at the moment, and since the volumes field is described perfectly 
by the 'docker run -v' syntax, it seems OK to repurpose it here.

 Allow Spark on Mesos to be launched with Docker
 ---

 Key: SPARK-2691
 URL: https://issues.apache.org/jira/browse/SPARK-2691
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen
Assignee: Timothy Chen
  Labels: mesos
 Attachments: spark-docker.patch


 Currently to launch Spark with Mesos one must upload a tarball and specifiy 
 the executor URI to be passed in that is to be downloaded on each slave or 
 even each execution depending coarse mode or not.
 We want to make Spark able to support launching Executors via a Docker image 
 that utilizes the recent Docker and Mesos integration work. 
 With the recent integration Spark can simply specify a Docker image and 
 options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2014-11-03 Thread Tom Arnfeld (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194634#comment-14194634
 ] 

Tom Arnfeld edited comment on SPARK-2691 at 11/3/14 3:42 PM:
-

Thanks so much for the patches here [~ChrisHeller]! We'd literally just sat 
down to implement this. Is there a github pull request with this patch?

It would also be really great if it were possible to specify extra environment 
variables to be given to the executor container.


was (Author: tarnfeld):
Thanks so much for the patches here [~ChrisHeller]! We'd literally just sat 
down to implement this. Is there a github pull request with this patch?

 Allow Spark on Mesos to be launched with Docker
 ---

 Key: SPARK-2691
 URL: https://issues.apache.org/jira/browse/SPARK-2691
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen
Assignee: Timothy Chen
  Labels: mesos
 Attachments: spark-docker.patch


 Currently to launch Spark with Mesos one must upload a tarball and specifiy 
 the executor URI to be passed in that is to be downloaded on each slave or 
 even each execution depending coarse mode or not.
 We want to make Spark able to support launching Executors via a Docker image 
 that utilizes the recent Docker and Mesos integration work. 
 With the recent integration Spark can simply specify a Docker image and 
 options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2014-11-03 Thread Tom Arnfeld (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194634#comment-14194634
 ] 

Tom Arnfeld commented on SPARK-2691:


Thanks so much for the patches here [~ChrisHeller]! We'd literally just sat 
down to implement this. Is there a github pull request with this patch?

 Allow Spark on Mesos to be launched with Docker
 ---

 Key: SPARK-2691
 URL: https://issues.apache.org/jira/browse/SPARK-2691
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen
Assignee: Timothy Chen
  Labels: mesos
 Attachments: spark-docker.patch


 Currently to launch Spark with Mesos one must upload a tarball and specifiy 
 the executor URI to be passed in that is to be downloaded on each slave or 
 even each execution depending coarse mode or not.
 We want to make Spark able to support launching Executors via a Docker image 
 that utilizes the recent Docker and Mesos integration work. 
 With the recent integration Spark can simply specify a Docker image and 
 options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2014-11-03 Thread Chris Heller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194637#comment-14194637
 ] 

Chris Heller commented on SPARK-2691:
-

+1 on passing in environment. great idea.

There isn't a pull at the moment, I didn't feel the patch was complete enough 
for that (the lack of support for coarse mode and the total disregard for mesos 
pre 0.20, make the patch a little fragile) -- but I'll happily create one if 
you'd like.

What is there has been in use on our cluster for a while now, and I would 
really love to have this be part of upstream.

 Allow Spark on Mesos to be launched with Docker
 ---

 Key: SPARK-2691
 URL: https://issues.apache.org/jira/browse/SPARK-2691
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen
Assignee: Timothy Chen
  Labels: mesos
 Attachments: spark-docker.patch


 Currently to launch Spark with Mesos one must upload a tarball and specifiy 
 the executor URI to be passed in that is to be downloaded on each slave or 
 even each execution depending coarse mode or not.
 We want to make Spark able to support launching Executors via a Docker image 
 that utilizes the recent Docker and Mesos integration work. 
 With the recent integration Spark can simply specify a Docker image and 
 options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2014-11-03 Thread Tom Arnfeld (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194646#comment-14194646
 ] 

Tom Arnfeld commented on SPARK-2691:


Awesome. We're keen to get spark up and running on our cluster to share with 
Hadoop so are going to be working on this now. Would you mind if we took this 
patch and made it a bit more fully featured (env variables, pre 0.20 support, 
coarse mode) and opened a pull request to spark?

Just wondering how we can make this as frictionless as possible and not overlap 
with any work you're doing on the patch. We're very keen to get this ready and 
merged into the spark master branch.

 Allow Spark on Mesos to be launched with Docker
 ---

 Key: SPARK-2691
 URL: https://issues.apache.org/jira/browse/SPARK-2691
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen
Assignee: Timothy Chen
  Labels: mesos
 Attachments: spark-docker.patch


 Currently to launch Spark with Mesos one must upload a tarball and specifiy 
 the executor URI to be passed in that is to be downloaded on each slave or 
 even each execution depending coarse mode or not.
 We want to make Spark able to support launching Executors via a Docker image 
 that utilizes the recent Docker and Mesos integration work. 
 With the recent integration Spark can simply specify a Docker image and 
 options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2014-11-03 Thread Tom Arnfeld (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194649#comment-14194649
 ] 

Tom Arnfeld commented on SPARK-2691:


Also [~ChrisHeller] if you're not shipping spark as an executor URI it'd be 
awesome if you were able to share the {{Dockerfile}} (or the rough outline of 
it) you're using to build the spark docker image. That'd help us (and others 
i'm sure) greatly.

 Allow Spark on Mesos to be launched with Docker
 ---

 Key: SPARK-2691
 URL: https://issues.apache.org/jira/browse/SPARK-2691
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen
Assignee: Timothy Chen
  Labels: mesos
 Attachments: spark-docker.patch


 Currently to launch Spark with Mesos one must upload a tarball and specifiy 
 the executor URI to be passed in that is to be downloaded on each slave or 
 even each execution depending coarse mode or not.
 We want to make Spark able to support launching Executors via a Docker image 
 that utilizes the recent Docker and Mesos integration work. 
 With the recent integration Spark can simply specify a Docker image and 
 options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2014-11-03 Thread Tom Arnfeld (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194634#comment-14194634
 ] 

Tom Arnfeld edited comment on SPARK-2691 at 11/3/14 3:57 PM:
-

Thanks so much for the patches here [~ChrisHeller]/[~yoeduardoj]! We'd 
literally just sat down to implement this. Is there a github pull request with 
this patch?

It would also be really great if it were possible to specify extra environment 
variables to be given to the executor container.


was (Author: tarnfeld):
Thanks so much for the patches here [~ChrisHeller]! We'd literally just sat 
down to implement this. Is there a github pull request with this patch?

It would also be really great if it were possible to specify extra environment 
variables to be given to the executor container.

 Allow Spark on Mesos to be launched with Docker
 ---

 Key: SPARK-2691
 URL: https://issues.apache.org/jira/browse/SPARK-2691
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen
Assignee: Timothy Chen
  Labels: mesos
 Attachments: spark-docker.patch


 Currently to launch Spark with Mesos one must upload a tarball and specifiy 
 the executor URI to be passed in that is to be downloaded on each slave or 
 even each execution depending coarse mode or not.
 We want to make Spark able to support launching Executors via a Docker image 
 that utilizes the recent Docker and Mesos integration work. 
 With the recent integration Spark can simply specify a Docker image and 
 options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2014-11-03 Thread Tom Arnfeld (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194646#comment-14194646
 ] 

Tom Arnfeld edited comment on SPARK-2691 at 11/3/14 3:56 PM:
-

Awesome. We're keen to get spark up and running on our cluster to share with 
Hadoop so are going to be working on this now. Would anyone ([~yoeduardoj]?) 
mind if we took this patch and made it a bit more fully featured (env 
variables, pre 0.20 support, coarse mode) and opened a pull request to spark?

Just wondering how we can make this as frictionless as possible and not overlap 
with any work you're doing on the patch. We're very keen to get this ready and 
merged into the spark master branch.


was (Author: tarnfeld):
Awesome. We're keen to get spark up and running on our cluster to share with 
Hadoop so are going to be working on this now. Would you mind if we took this 
patch and made it a bit more fully featured (env variables, pre 0.20 support, 
coarse mode) and opened a pull request to spark?

Just wondering how we can make this as frictionless as possible and not overlap 
with any work you're doing on the patch. We're very keen to get this ready and 
merged into the spark master branch.

 Allow Spark on Mesos to be launched with Docker
 ---

 Key: SPARK-2691
 URL: https://issues.apache.org/jira/browse/SPARK-2691
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen
Assignee: Timothy Chen
  Labels: mesos
 Attachments: spark-docker.patch


 Currently to launch Spark with Mesos one must upload a tarball and specifiy 
 the executor URI to be passed in that is to be downloaded on each slave or 
 even each execution depending coarse mode or not.
 We want to make Spark able to support launching Executors via a Docker image 
 that utilizes the recent Docker and Mesos integration work. 
 With the recent integration Spark can simply specify a Docker image and 
 options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2014-11-03 Thread Eduardo Jimenez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194655#comment-14194655
 ] 

Eduardo Jimenez commented on SPARK-2691:


Go for it. It would be helpful if most of the possible mesos arguments are 
supported (see ContainerInfo in 
https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto)

One thing I was trying to do is create a set of common Mesos primitives, as the 
code to create ContainerInfo is going to be very similar for both fine-grained 
and coarse-grained mode. Just my $.02.



 Allow Spark on Mesos to be launched with Docker
 ---

 Key: SPARK-2691
 URL: https://issues.apache.org/jira/browse/SPARK-2691
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen
Assignee: Timothy Chen
  Labels: mesos
 Attachments: spark-docker.patch


 Currently to launch Spark with Mesos one must upload a tarball and specifiy 
 the executor URI to be passed in that is to be downloaded on each slave or 
 even each execution depending coarse mode or not.
 We want to make Spark able to support launching Executors via a Docker image 
 that utilizes the recent Docker and Mesos integration work. 
 With the recent integration Spark can simply specify a Docker image and 
 options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2014-11-03 Thread Eduardo Jimenez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194669#comment-14194669
 ] 

Eduardo Jimenez commented on SPARK-2691:


I would also say don't only support Docker. I was doing this in a way that 
makes it possible to mount volumes using a Mesos container as well (cgroups I 
think), but I haven't actually tried it yet (want to use the mesos sandbox for 
spark files, but at the same time I want to use multiple work directories to 
span across several disks)

 Allow Spark on Mesos to be launched with Docker
 ---

 Key: SPARK-2691
 URL: https://issues.apache.org/jira/browse/SPARK-2691
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen
Assignee: Timothy Chen
  Labels: mesos
 Attachments: spark-docker.patch


 Currently to launch Spark with Mesos one must upload a tarball and specifiy 
 the executor URI to be passed in that is to be downloaded on each slave or 
 even each execution depending coarse mode or not.
 We want to make Spark able to support launching Executors via a Docker image 
 that utilizes the recent Docker and Mesos integration work. 
 With the recent integration Spark can simply specify a Docker image and 
 options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4208) stack over flow error while using sqlContext.sql

2014-11-03 Thread milq (JIRA)
milq created SPARK-4208:
---

 Summary: stack over flow error while using sqlContext.sql
 Key: SPARK-4208
 URL: https://issues.apache.org/jira/browse/SPARK-4208
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.1.0
 Environment: windows 7 , prebuilt spark-1.1.0-bin-hadoop2.3
Reporter: milq


error happens when using sqlContext.sql

14/11/03 18:54:43 INFO BlockManager: Removing block broadcast_1
14/11/03 18:54:43 INFO MemoryStore: Block broadcast_1 of size 2976 dropped from 
memory (free 28010260
14/11/03 18:54:43 INFO ContextCleaner: Cleaned broadcast 1
root
 |--  firstName : string (nullable = true)
 |-- lastNameX: string (nullable = true)

Exception in thread main java.lang.StackOverflowError
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2014-11-03 Thread Chris Heller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194679#comment-14194679
 ] 

Chris Heller commented on SPARK-2691:
-

Ok here is the patch as a PR: https://github.com/apache/spark/pull/3074

[~tarnfeld] feel free to expand on this patch. I was looking at the code today 
and realized the coarse mode support should be trivial (just setting a 
ContainerInfo inside the TaskInfo created) -- it just cannot reuse the 
fine-grained code path in its current form since that assumes passing of an 
ExecutorInfo, but it could easily be generalized over a ContainerInfo instead.

We are not shipping the spark image as an executor URI, instead spark is 
bundled in the image. It is just a stock spark needed in the image, a simple 
docker file would look like (assuming you have a spark tar ball and libmesos in 
your directory with the Dockerfile):

{noformat}
FROM ubuntu

RUN apt-get -y update
RUN apt-get -y install default-jre-headless
RUN apt-get -y install python2.7

ADD spark-1.1.0-bin-hadoop1.tgz /
RUN mv /spark-1.1.0-bin-hadoop1 /spark
COPY libmesos-0.20.1.so /usr/lib/libmesos.so

ENV SPARK_HOME /spark
ENV MESOS_JAVA_NATIVE_LIBRARY /usr/lib/libmesos.so

CMD ps -ef
{noformat}

[~yoeduardoj] one awesome thing, which is actually beyond the scope of docker 
support, but still related to mesos would be the ability to support 
configuration of what role and attributes in a mesos offer are filtered by 
spark -- but this is not relevant just wanted to bring it up while folks are 
digging into the mesos backend code.

 Allow Spark on Mesos to be launched with Docker
 ---

 Key: SPARK-2691
 URL: https://issues.apache.org/jira/browse/SPARK-2691
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen
Assignee: Timothy Chen
  Labels: mesos
 Attachments: spark-docker.patch


 Currently to launch Spark with Mesos one must upload a tarball and specifiy 
 the executor URI to be passed in that is to be downloaded on each slave or 
 even each execution depending coarse mode or not.
 We want to make Spark able to support launching Executors via a Docker image 
 that utilizes the recent Docker and Mesos integration work. 
 With the recent integration Spark can simply specify a Docker image and 
 options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4207) Query which has syntax like 'not like' is not working in Spark SQL

2014-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194714#comment-14194714
 ] 

Apache Spark commented on SPARK-4207:
-

User 'ravipesala' has created a pull request for this issue:
https://github.com/apache/spark/pull/3075

 Query which has syntax like 'not like' is not working in Spark SQL
 --

 Key: SPARK-4207
 URL: https://issues.apache.org/jira/browse/SPARK-4207
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Ravindra Pesala
Assignee: Ravindra Pesala

 Queries which has 'not like' is not working in Spark SQL. Same works in Spark 
 HiveQL.
 {code}
 sql(SELECT * FROM records where value not like 'val%')
 {code}
 The above query fails with below exception
 {code}
 Exception in thread main java.lang.RuntimeException: [1.39] failure: ``IN'' 
 expected but `like' found
 SELECT * FROM records where value not like 'val%'
   ^
   at scala.sys.package$.error(package.scala:27)
   at 
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
   at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:75)
   at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:75)
   at 
 org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:186)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4205) Timestamp and Date objects with comparison operators

2014-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194777#comment-14194777
 ] 

Apache Spark commented on SPARK-4205:
-

User 'culler' has created a pull request for this issue:
https://github.com/apache/spark/pull/3066

 Timestamp and Date objects with comparison operators
 

 Key: SPARK-4205
 URL: https://issues.apache.org/jira/browse/SPARK-4205
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Marc Culler
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4209) Support UDT in UDF

2014-11-03 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-4209:


 Summary: Support UDT in UDF
 Key: SPARK-4209
 URL: https://issues.apache.org/jira/browse/SPARK-4209
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Xiangrui Meng


UDF doesn't recognize functions defined with UDTs. Before execution, an SQL 
internal datum should be converted to Scala types, and after execution, the 
result should be converted back to internal format (maybe this part is already 
done).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4206) BlockManager warnings in local mode: Block $blockId already exists on this machine; not re-adding it

2014-11-03 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194793#comment-14194793
 ] 

Imran Rashid commented on SPARK-4206:
-

a little more confirmation: If I change this line

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala#L56

to use a storage level without replication, the warnings go away.

I'd like to change it so that
1) when a storage level is initially requested, the user gets a warning if they 
request some impossible amount of replication b/c there aren't that many nodes 
in the cluster, and the storage level is auto-downgraded
2) the warning turns into an exception.

(2) is a little scary / ambitious ... but if there is *another* cause for this, 
I'd like to find out rather than just have it get ignored again.

 BlockManager warnings in local mode: Block $blockId already exists on this 
 machine; not re-adding it
 -

 Key: SPARK-4206
 URL: https://issues.apache.org/jira/browse/SPARK-4206
 Project: Spark
  Issue Type: Bug
 Environment: local mode, branch-1.1  master
Reporter: Imran Rashid
Priority: Minor

 When running in local mode, you often get log warning messages like:
 WARN storage.BlockManager: Block input-0-1415022975000 already exists on this 
 machine; not re-adding it
 (eg., try running the TwitterPopularTags example in local mode)
 I think these warning messages are pretty unsettling for a new user, and 
 should be removed.  If they are truly innocuous, they should be changed to 
 logInfo, or maybe even logDebug.  Or if they might actually indicate a 
 problem, we should find the root cause and fix it.
 I *think* the problem is caused by a replication level  1 when running in 
 local mode.  In BlockManager.doPut, first the block is put locally:
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L692
 and then if the replication level  1, a request is sent out to replicate the 
 block:
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L827
 However, in local mode, there isn't anywhere else to replicate the block; the 
 request comes back to the same node, which then issues the warning that the 
 block has already been added.
 If that analysis is right, the easy fix would be to make sure 
 replicationLevel = 1 in local mode.  But, its a little disturbing that a 
 replication request could result in an attempt to replicate on the same node 
 -- and that if something is wrong, we only issue a warning and keep going.
 If this really the culprit, then it might be worth taking a closer look at 
 the logic of replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4203) Partition directories in random order when inserting into hive table

2014-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194843#comment-14194843
 ] 

Apache Spark commented on SPARK-4203:
-

User 'tbfenet' has created a pull request for this issue:
https://github.com/apache/spark/pull/3076

 Partition directories in random order when inserting into hive table
 

 Key: SPARK-4203
 URL: https://issues.apache.org/jira/browse/SPARK-4203
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.2.0
Reporter: Matthew Taylor

 When doing an insert into hive table with partitions the folders written to 
 the file system are in a random order instead of the order defined in table 
 creation. Seems that the loadPartition method in Hive.java has a 
 MapString,String parameter but expects to be called with a map that has a 
 defined ordering such as  LinkedHashMap. Have a patch which I will do a PR 
 for. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4201) Can't use concat() on partition column in where condition (Hive compatibility problem)

2014-11-03 Thread Venkata Ramana G (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194847#comment-14194847
 ] 

Venkata Ramana G commented on SPARK-4201:
-

I found the same is working on latest master, please confirm.

 Can't use concat() on partition column in where condition (Hive compatibility 
 problem)
 --

 Key: SPARK-4201
 URL: https://issues.apache.org/jira/browse/SPARK-4201
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0, 1.1.0
 Environment: Hive 0.12+hadoop 2.4/hadoop 2.2 +spark 1.1
Reporter: dongxu
Priority: Minor
  Labels: com

 The team used hive to query,we try to  move it to spark-sql.
 when I search sentences like that. 
 select count(1) from  gulfstream_day_driver_base_2 where  
 concat(year,month,day) = '20140929';
 It can't work ,but it work well in hive.
 I have to rewrite the sql to  select count(1) from  
 gulfstream_day_driver_base_2 where  year = 2014 and  month = 09 day= 29.
 There are some error log.
 14/11/03 15:05:03 ERROR SparkSQLDriver: Failed in [select count(1) from  
 gulfstream_day_driver_base_2 where  concat(year,month,day) = '20140929']
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
 Aggregate false, [], [SUM(PartialCount#1390L) AS c_0#1337L]
  Exchange SinglePartition
   Aggregate true, [], [COUNT(1) AS PartialCount#1390L]
HiveTableScan [], (MetastoreRelation default, 
 gulfstream_day_driver_base_2, None), 
 Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341)
  = 20140929))
   at 
 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
   at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126)
   at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360)
   at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360)
   at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:415)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
 execute, tree:
 Exchange SinglePartition
  Aggregate true, [], [COUNT(1) AS PartialCount#1390L]
   HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, 
 None), 
 Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341)
  = 20140929))
   at 
 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
   at org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44)
   at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:128)
   at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127)
   at 
 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
   ... 16 more
 Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
 execute, tree:
 Aggregate true, [], [COUNT(1) AS PartialCount#1390L]
  HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, 
 None), 
 Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341)
  = 20140929))
   at 
 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
   at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126)
   at 
 org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:86)
   at 
 org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:45)
   at 
 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
 

[jira] [Created] (SPARK-4210) Add Extra-Trees algorithm to MLlib

2014-11-03 Thread Vincent Botta (JIRA)
Vincent Botta created SPARK-4210:


 Summary: Add Extra-Trees algorithm to MLlib
 Key: SPARK-4210
 URL: https://issues.apache.org/jira/browse/SPARK-4210
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Vincent Botta


This task will add Extra-Trees support to Spark MLlib. The implementation could 
be inspired from the current Random Forest algorithm. This algorithm is 
expected to be particularly suited as sorting of attributes is not required as 
opposed to to the original Random Forest approach (with similar and/or better 
predictive power). 

The tasks involves:
- Code implementation
- Unit tests
- Functional tests
- Performance tests
- Documentation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-11-03 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-2426:

Affects Version/s: (was: 1.0.0)
   1.2.0

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-11-03 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-2426:

Affects Version/s: (was: 1.2.0)
   1.3.0

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2938) Support SASL authentication in Netty network module

2014-11-03 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194944#comment-14194944
 ] 

Reynold Xin commented on SPARK-2938:


I think we are still going to add it to 1.2 (assuming the change is not too 
invasive - which it shouldn't be). Would be great you review it once the pull 
request is ready in the next couple of days.

 Support SASL authentication in Netty network module
 ---

 Key: SPARK-2938
 URL: https://issues.apache.org/jira/browse/SPARK-2938
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Reporter: Reynold Xin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1070) Add check for JIRA ticket in the Github pull request title/summary with CI

2014-11-03 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194952#comment-14194952
 ] 

Nicholas Chammas commented on SPARK-1070:
-

[~hsaputra] - The [Spark PR Board | https://spark-prs.appspot.com/] 
automatically parses JIRA ticket IDs in the PR titles. Does that address the 
need behind this request?

cc [~joshrosen]

 Add check for JIRA ticket in the Github pull request title/summary with CI
 --

 Key: SPARK-1070
 URL: https://issues.apache.org/jira/browse/SPARK-1070
 Project: Spark
  Issue Type: Task
  Components: Build
Reporter: Henry Saputra
Assignee: Mark Hamstra
Priority: Minor

 As part of discussion in the dev@ list to add audit trail of Spark's Github 
 pull requests (PR) to JIRA, need to add check maybe in the Jenkins CI to 
 verify that the PRs contain JIRA ticket number in the title/ summary.
  
 There are maybe some PRs that may not need ticket so probably add support for 
 some magic keyword to bypass the check. But this should be done in rare 
 cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1070) Add check for JIRA ticket in the Github pull request title/summary with CI

2014-11-03 Thread Henry Saputra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194961#comment-14194961
 ] 

Henry Saputra commented on SPARK-1070:
--

[~nchammas], way back when Patrick propose the right way to send PR there was a 
discussion to force PR to have JIRA rocket prefix in the summary.
This ticket is filed to address that issue/ idea.

 Add check for JIRA ticket in the Github pull request title/summary with CI
 --

 Key: SPARK-1070
 URL: https://issues.apache.org/jira/browse/SPARK-1070
 Project: Spark
  Issue Type: Task
  Components: Build
Reporter: Henry Saputra
Assignee: Mark Hamstra
Priority: Minor

 As part of discussion in the dev@ list to add audit trail of Spark's Github 
 pull requests (PR) to JIRA, need to add check maybe in the Jenkins CI to 
 verify that the PRs contain JIRA ticket number in the title/ summary.
  
 There are maybe some PRs that may not need ticket so probably add support for 
 some magic keyword to bypass the check. But this should be done in rare 
 cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4211) Spark POM hive-0.13.1 profile sets incorrect hive version property

2014-11-03 Thread Fi (JIRA)
Fi created SPARK-4211:
-

 Summary: Spark POM hive-0.13.1 profile sets incorrect hive version 
property
 Key: SPARK-4211
 URL: https://issues.apache.org/jira/browse/SPARK-4211
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Fi


The fix in SPARK-3826 added a new maven profile 'hive-0.13.1'.
By default, it sets the maven property to `hive.version=0.13.1a`.
This special hive version resolves dependency issues with Hive 0.13+

However, when explicitly specifying the hive-0.13.1 maven profile, the 
'hive.version=0.13.1' property would be set instead of 'hive.version=0.13.1a'

e.g. mvn -Phive -Phive=0.13.1

Also see: https://github.com/apache/spark/pull/2685



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4211) Spark POM hive-0.13.1 profile sets incorrect hive version property

2014-11-03 Thread Fi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fi updated SPARK-4211:
--
Fix Version/s: 1.2.0

 Spark POM hive-0.13.1 profile sets incorrect hive version property
 --

 Key: SPARK-4211
 URL: https://issues.apache.org/jira/browse/SPARK-4211
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Fi
 Fix For: 1.2.0


 The fix in SPARK-3826 added a new maven profile 'hive-0.13.1'.
 By default, it sets the maven property to `hive.version=0.13.1a`.
 This special hive version resolves dependency issues with Hive 0.13+
 However, when explicitly specifying the hive-0.13.1 maven profile, the 
 'hive.version=0.13.1' property would be set instead of 'hive.version=0.13.1a'
 e.g. mvn -Phive -Phive=0.13.1
 Also see: https://github.com/apache/spark/pull/2685



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4211) Spark POM hive-0.13.1 profile sets incorrect hive version property

2014-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194973#comment-14194973
 ] 

Apache Spark commented on SPARK-4211:
-

User 'coderfi' has created a pull request for this issue:
https://github.com/apache/spark/pull/3072

 Spark POM hive-0.13.1 profile sets incorrect hive version property
 --

 Key: SPARK-4211
 URL: https://issues.apache.org/jira/browse/SPARK-4211
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Fi
 Fix For: 1.2.0


 The fix in SPARK-3826 added a new maven profile 'hive-0.13.1'.
 By default, it sets the maven property to `hive.version=0.13.1a`.
 This special hive version resolves dependency issues with Hive 0.13+
 However, when explicitly specifying the hive-0.13.1 maven profile, the 
 'hive.version=0.13.1' property would be set instead of 'hive.version=0.13.1a'
 e.g. mvn -Phive -Phive=0.13.1
 Also see: https://github.com/apache/spark/pull/2685



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2938) Support SASL authentication in Netty network module

2014-11-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2938:
---
Priority: Blocker  (was: Major)

 Support SASL authentication in Netty network module
 ---

 Key: SPARK-2938
 URL: https://issues.apache.org/jira/browse/SPARK-2938
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Aaron Davidson
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4199) Drop table if exists raises table not found exception in HiveContext

2014-11-03 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang closed SPARK-4199.

Resolution: Invalid

 Drop table if exists raises table not found exception in HiveContext
 --

 Key: SPARK-4199
 URL: https://issues.apache.org/jira/browse/SPARK-4199
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Jianshi Huang

 Try this:
   sql(DROP TABLE IF EXISTS some_table)
 The exception looks like this:
 14/11/02 19:55:29 INFO ParseDriver: Parsing command: DROP TABLE IF EXISTS 
 some_table
 14/11/02 19:55:29 INFO ParseDriver: Parse Completed
 14/11/02 19:55:29 INFO Driver: /PERFLOG method=parse start=1414986929678 
 end=1414986929678 duration=0
 14/11/02 19:55:29 INFO Driver: PERFLOG method=semanticAnalyze
 14/11/02 19:55:29 INFO HiveMetaStore: 0: Opening raw store with implemenation 
 class:org.apache.hadoop.hive.metastore.ObjectStore
 14/11/02 19:55:29 INFO ObjectStore: ObjectStore, initialize called
 14/11/02 19:55:29 ERROR Driver: FAILED: SemanticException [Error 10001]: 
 Table not found some_table
 org.apache.hadoop.hive.ql.parse.SemanticException: Table not found some_table
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3294)
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3281)
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeDropTable(DDLSemanticAnalyzer.java:824)
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:249)
 at 
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
 at 
 org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:294)
 at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:273)
 at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58)
 at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56)
 at 
 org.apache.spark.sql.execution.Command$class.execute(commands.scala:44)
 at 
 org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:353)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:353)
 at 
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
 at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104)
 at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4199) Drop table if exists raises table not found exception in HiveContext

2014-11-03 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195045#comment-14195045
 ] 

Jianshi Huang commented on SPARK-4199:
--

Turned out it was caused by wrong version of datanucleus jars in my spark build 
directory. Somehow I have two versions of datanucleus...

After removing the wrong version, now all works.

Thanks!

Jianshi

 Drop table if exists raises table not found exception in HiveContext
 --

 Key: SPARK-4199
 URL: https://issues.apache.org/jira/browse/SPARK-4199
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Jianshi Huang

 Try this:
   sql(DROP TABLE IF EXISTS some_table)
 The exception looks like this:
 14/11/02 19:55:29 INFO ParseDriver: Parsing command: DROP TABLE IF EXISTS 
 some_table
 14/11/02 19:55:29 INFO ParseDriver: Parse Completed
 14/11/02 19:55:29 INFO Driver: /PERFLOG method=parse start=1414986929678 
 end=1414986929678 duration=0
 14/11/02 19:55:29 INFO Driver: PERFLOG method=semanticAnalyze
 14/11/02 19:55:29 INFO HiveMetaStore: 0: Opening raw store with implemenation 
 class:org.apache.hadoop.hive.metastore.ObjectStore
 14/11/02 19:55:29 INFO ObjectStore: ObjectStore, initialize called
 14/11/02 19:55:29 ERROR Driver: FAILED: SemanticException [Error 10001]: 
 Table not found some_table
 org.apache.hadoop.hive.ql.parse.SemanticException: Table not found some_table
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3294)
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3281)
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeDropTable(DDLSemanticAnalyzer.java:824)
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:249)
 at 
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
 at 
 org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:294)
 at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:273)
 at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58)
 at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56)
 at 
 org.apache.spark.sql.execution.Command$class.execute(commands.scala:44)
 at 
 org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:353)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:353)
 at 
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
 at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104)
 at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4212) Actor not found

2014-11-03 Thread Davies Liu (JIRA)
Davies Liu created SPARK-4212:
-

 Summary: Actor not found
 Key: SPARK-4212
 URL: https://issues.apache.org/jira/browse/SPARK-4212
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Davies Liu


tried to run a PySpark test, but it hanged:

NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
ahead of assembly.
14/11/03 12:32:58 WARN Remoting: Tried to associate with unreachable remote 
address [akka.tcp://sparkDriver@dm:7077]. Address is now gated for 5000 ms, all 
messages to this address will be delivered to dead letters. Reason: Connection 
refused: dm/192.168.1.11:7077
14/11/03 12:32:58 ERROR OneForOneStrategy: Actor not found for: 
ActorSelection[Anchor(akka.tcp://sparkDriver@dm:7077/), 
Path(/user/HeartbeatReceiver)]
akka.actor.ActorInitializationException: exception during creation
at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
at akka.actor.ActorCell.create(ActorCell.scala:596)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: akka.actor.ActorNotFound: Actor not found for: 
ActorSelection[Anchor(akka.tcp://sparkDriver@dm:7077/), 
Path(/user/HeartbeatReceiver)]
at 
akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
at 
akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at 
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
at 
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
at 
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at 
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at 
scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
at 
akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
at 
akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
at 
akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267)
at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:508)
at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:541)
at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:531)
at 
akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:87)
at akka.remote.EndpointWriter.postStop(Endpoint.scala:561)
at akka.actor.Actor$class.aroundPostStop(Actor.scala:475)
at akka.remote.EndpointActor.aroundPostStop(Endpoint.scala:415)
at 
akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
at 
akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
at akka.actor.ActorCell.terminate(ActorCell.scala:369)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)
... 8 more




^CTraceback (most recent call last):
  File python/pyspark/tests.py, line 1627, in module
unittest.main()
  File //anaconda/lib/python2.7/unittest/main.py, line 95, in __init__
self.runTests()
  File //anaconda/lib/python2.7/unittest/main.py, line 232, in runTests
self.result = testRunner.run(self.test)
  File //anaconda/lib/python2.7/unittest/runner.py, line 151, in run
test(result)
  File //anaconda/lib/python2.7/unittest/suite.py, line 70, in __call__
return self.run(*args, **kwds)
  File //anaconda/lib/python2.7/unittest/suite.py, line 108, in run
test(result)
  File //anaconda/lib/python2.7/unittest/suite.py, line 70, in __call__
return self.run(*args, **kwds)
  File //anaconda/lib/python2.7/unittest/suite.py, line 

[jira] [Commented] (SPARK-4186) Support binaryFiles and binaryRecords API in Python

2014-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195054#comment-14195054
 ] 

Apache Spark commented on SPARK-4186:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3078

 Support binaryFiles and binaryRecords API in Python
 ---

 Key: SPARK-4186
 URL: https://issues.apache.org/jira/browse/SPARK-4186
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core
Reporter: Matei Zaharia
Assignee: Davies Liu

 After SPARK-2759, we should expose these methods in Python. Shouldn't be too 
 hard to add.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4211) Spark POM hive-0.13.1 profile sets incorrect hive version property

2014-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4211.
-
Resolution: Fixed

Issue resolved by pull request 3072
[https://github.com/apache/spark/pull/3072]

 Spark POM hive-0.13.1 profile sets incorrect hive version property
 --

 Key: SPARK-4211
 URL: https://issues.apache.org/jira/browse/SPARK-4211
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Fi
 Fix For: 1.2.0


 The fix in SPARK-3826 added a new maven profile 'hive-0.13.1'.
 By default, it sets the maven property to `hive.version=0.13.1a`.
 This special hive version resolves dependency issues with Hive 0.13+
 However, when explicitly specifying the hive-0.13.1 maven profile, the 
 'hive.version=0.13.1' property would be set instead of 'hive.version=0.13.1a'
 e.g. mvn -Phive -Phive=0.13.1
 Also see: https://github.com/apache/spark/pull/2685



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3541) Improve ALS internal storage

2014-11-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3541:
-
Target Version/s: 1.3.0  (was: 1.2.0)

 Improve ALS internal storage
 

 Key: SPARK-3541
 URL: https://issues.apache.org/jira/browse/SPARK-3541
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
   Original Estimate: 96h
  Remaining Estimate: 96h

 The internal storage of ALS uses many small objects, which increases the GC 
 pressure and makes ALS difficult to scale to very large scale, e.g., 50 
 billion ratings. In such cases, the full GC may take more than 10 minutes to 
 finish. That is longer than the default heartbeat timeout and hence executors 
 will be removed under default settings.
 We can use primitive arrays to reduce the number of objects significantly. 
 This requires big change to the ALS implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4207) Query which has syntax like 'not like' is not working in Spark SQL

2014-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4207.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3075
[https://github.com/apache/spark/pull/3075]

 Query which has syntax like 'not like' is not working in Spark SQL
 --

 Key: SPARK-4207
 URL: https://issues.apache.org/jira/browse/SPARK-4207
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Ravindra Pesala
Assignee: Ravindra Pesala
 Fix For: 1.2.0


 Queries which has 'not like' is not working in Spark SQL. Same works in Spark 
 HiveQL.
 {code}
 sql(SELECT * FROM records where value not like 'val%')
 {code}
 The above query fails with below exception
 {code}
 Exception in thread main java.lang.RuntimeException: [1.39] failure: ``IN'' 
 expected but `like' found
 SELECT * FROM records where value not like 'val%'
   ^
   at scala.sys.package$.error(package.scala:27)
   at 
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
   at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:75)
   at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:75)
   at 
 org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:186)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3594) try more rows during inferSchema

2014-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3594.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2716
[https://github.com/apache/spark/pull/2716]

 try more rows during inferSchema
 

 Key: SPARK-3594
 URL: https://issues.apache.org/jira/browse/SPARK-3594
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.2.0


 If there are some empty values in the first row of RDD of Row, the 
 inferSchema will failed.
 It's better to try with more rows, combine them together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4210) Add Extra-Trees algorithm to MLlib

2014-11-03 Thread Manish Amde (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195108#comment-14195108
 ] 

Manish Amde commented on SPARK-4210:


[~0asa] Thanks for the creating the JIRA.

From the scikit-learn documentation: As in random forests, a random subset of 
candidate features is used, but instead of looking for the most discriminative 
thresholds, thresholds are drawn at random for each candidate feature and the 
best of these randomly-generated thresholds is picked as the splitting rule. 
This usually allows to reduce the variance of the model a bit more, at the 
expense of a slightly greater increase in bias. This might lead to 
interesting implementation tradeoffs. Could you please discuss how you plan to 
implement the findBestSplit method for this.

Also, please note down the related literature (it's a relatively new algorithm) 
so that people not familiar with this algorithm can understand the suitability 
of this algorithm for MLlib.

[~mengxr] Could you please assign the ticket to [~0asa]?

 Add Extra-Trees algorithm to MLlib
 --

 Key: SPARK-4210
 URL: https://issues.apache.org/jira/browse/SPARK-4210
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Vincent Botta

 This task will add Extra-Trees support to Spark MLlib. The implementation 
 could be inspired from the current Random Forest algorithm. This algorithm is 
 expected to be particularly suited as sorting of attributes is not required 
 as opposed to to the original Random Forest approach (with similar and/or 
 better predictive power). 
 The tasks involves:
 - Code implementation
 - Unit tests
 - Functional tests
 - Performance tests
 - Documentation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2014-11-03 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195146#comment-14195146
 ] 

Cody Koeninger commented on SPARK-3146:
---

I think this PR is an elegant way to solve SPARK-2388, which is an otherwise 
blocking bug for our usage of kafka.

Absent a concrete design for doing something equivalent for all InputDStreams, 
I'd encourage merging it. 

 Improve the flexibility of Spark Streaming Kafka API to offer user the 
 ability to process message before storing into BM
 

 Key: SPARK-3146
 URL: https://issues.apache.org/jira/browse/SPARK-3146
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Saisai Shao

 Currently Spark Streaming Kafka API stores the key and value of each message 
 into BM for processing, potentially this may lose the flexibility for 
 different requirements:
 1. currently topic/partition/offset information for each message is discarded 
 by KafkaInputDStream. In some scenarios people may need this information to 
 better filter the message, like SPARK-2388 described.
 2. People may need to add timestamp for each message when feeding into Spark 
 Streaming, which can better measure the system latency.
 3. Checkpointing the partition/offsets or others...
 So here we add a messageHandler in interface to give people the flexibility 
 to preprocess the message before storing into BM. In the meantime time this 
 improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't

2014-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195168#comment-14195168
 ] 

Apache Spark commented on SPARK-1021:
-

User 'erikerlandson' has created a pull request for this issue:
https://github.com/apache/spark/pull/3079

 sortByKey() launches a cluster job when it shouldn't
 

 Key: SPARK-1021
 URL: https://issues.apache.org/jira/browse/SPARK-1021
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 0.8.0, 0.9.0, 1.0.0, 1.1.0
Reporter: Andrew Ash
Assignee: Erik Erlandson
  Labels: starter
 Fix For: 1.2.0


 The sortByKey() method is listed as a transformation, not an action, in the 
 documentation.  But it launches a cluster job regardless.
 http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html
 Some discussion on the mailing list suggested that this is a problem with the 
 rdd.count() call inside Partitioner.scala's rangeBounds method.
 https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102
 Josh Rosen suggests that rangeBounds should be made into a lazy variable:
 {quote}
 I wonder whether making RangePartitoner .rangeBounds into a lazy val would 
 fix this 
 (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).
   We'd need to make sure that rangeBounds() is never called before an action 
 is performed.  This could be tricky because it's called in the 
 RangePartitioner.equals() method.  Maybe it's sufficient to just compare the 
 number of partitions, the ids of the RDDs used to create the 
 RangePartitioner, and the sort ordering.  This still supports the case where 
 I range-partition one RDD and pass the same partitioner to a different RDD.  
 It breaks support for the case where two range partitioners created on 
 different RDDs happened to have the same rangeBounds(), but it seems unlikely 
 that this would really harm performance since it's probably unlikely that the 
 range partitioners are equal by chance.
 {quote}
 Can we please make this happen?  I'll send a PR on GitHub to start the 
 discussion and testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4152) Avoid data change in CTAS while table already existed

2014-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4152.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3013
[https://github.com/apache/spark/pull/3013]

 Avoid data change in CTAS while table already existed
 -

 Key: SPARK-4152
 URL: https://issues.apache.org/jira/browse/SPARK-4152
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Minor
 Fix For: 1.2.0


 CREATE TABLE t1 (a String);
 CREATE TABLE t1 AS SELECT key FROM src; -- throw exception
 CREATE TABLE if not exists t1 AS SELECT key FROM src; -- expect do nothing, 
 actually will overwrite the t1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4213) SparkSQL - ParquetFilters - No support for LT, LTE, GT, GTE operators

2014-11-03 Thread Terry Siu (JIRA)
Terry Siu created SPARK-4213:


 Summary: SparkSQL - ParquetFilters - No support for LT, LTE, GT, 
GTE operators
 Key: SPARK-4213
 URL: https://issues.apache.org/jira/browse/SPARK-4213
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
 Environment: CDH5.2, Hive 0.13.1, Spark 1.2 snapshot (commit hash 
76386e1a23c)
Reporter: Terry Siu
 Fix For: 1.2.0


When I issue a hql query against a HiveContext where my predicate uses a column 
of string type with one of LT, LTE, GT, or GTE operator, I get the following 
error:

scala.MatchError: StringType (of class 
org.apache.spark.sql.catalyst.types.StringType$)

Looking at the code in org.apache.spark.sql.parquet.ParquetFilters, StringType 
is absent from the corresponding functions for creating these filters.

To reproduce, in a Hive 0.13.1 shell, I created the following table (at a 
specified DB):

create table sparkbug (
  id int,
  event string
) stored as parquet;

Insert some sample data:

insert into table sparkbug select 1, '2011-06-18' from some table limit 1;
insert into table sparkbug select 2, '2012-01-01' from some table limit 1;

Launch a spark shell and create a HiveContext to the metastore where the table 
above is located.

import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
val hc = new HiveContext(sc)
hc.setConf(spark.sql.shuffle.partitions, 10)
hc.setConf(spark.sql.hive.convertMetastoreParquet, true)
hc.setConf(spark.sql.parquet.compression.codec, snappy)
import hc._
hc.hql(select * from db.sparkbug where event = '2011-12-01')

A scala.MatchError will appear in the output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4213) SparkSQL - ParquetFilters - No support for LT, LTE, GT, GTE operators

2014-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4213:

Priority: Blocker  (was: Major)

 SparkSQL - ParquetFilters - No support for LT, LTE, GT, GTE operators
 -

 Key: SPARK-4213
 URL: https://issues.apache.org/jira/browse/SPARK-4213
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
 Environment: CDH5.2, Hive 0.13.1, Spark 1.2 snapshot (commit hash 
 76386e1a23c)
Reporter: Terry Siu
Priority: Blocker
 Fix For: 1.2.0


 When I issue a hql query against a HiveContext where my predicate uses a 
 column of string type with one of LT, LTE, GT, or GTE operator, I get the 
 following error:
 scala.MatchError: StringType (of class 
 org.apache.spark.sql.catalyst.types.StringType$)
 Looking at the code in org.apache.spark.sql.parquet.ParquetFilters, 
 StringType is absent from the corresponding functions for creating these 
 filters.
 To reproduce, in a Hive 0.13.1 shell, I created the following table (at a 
 specified DB):
 create table sparkbug (
   id int,
   event string
 ) stored as parquet;
 Insert some sample data:
 insert into table sparkbug select 1, '2011-06-18' from some table limit 1;
 insert into table sparkbug select 2, '2012-01-01' from some table limit 1;
 Launch a spark shell and create a HiveContext to the metastore where the 
 table above is located.
 import org.apache.spark.sql._
 import org.apache.spark.sql.SQLContext
 import org.apache.spark.sql.hive.HiveContext
 val hc = new HiveContext(sc)
 hc.setConf(spark.sql.shuffle.partitions, 10)
 hc.setConf(spark.sql.hive.convertMetastoreParquet, true)
 hc.setConf(spark.sql.parquet.compression.codec, snappy)
 import hc._
 hc.hql(select * from db.sparkbug where event = '2011-12-01')
 A scala.MatchError will appear in the output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3238) Commas/spaces/dashes are not escaped properly when transferring schema information to parquet readers

2014-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3238.
-
Resolution: Fixed
  Assignee: Cheng Lian

I think this was fixed by the conversion to JSON for serializing schema

 Commas/spaces/dashes are not escaped properly when transferring schema 
 information to parquet readers
 -

 Key: SPARK-3238
 URL: https://issues.apache.org/jira/browse/SPARK-3238
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Michael Armbrust
Assignee: Cheng Lian





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2883) Spark Support for ORCFile format

2014-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2883:

Target Version/s: 1.3.0  (was: 1.2.0)

 Spark Support for ORCFile format
 

 Key: SPARK-2883
 URL: https://issues.apache.org/jira/browse/SPARK-2883
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, SQL
Reporter: Zhan Zhang
Priority: Blocker
 Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
 pm jobtracker.png, orc.diff


 Verify the support of OrcInputFormat in spark, fix issues if exists and add 
 documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3618) Store analyzed plans for temp tables

2014-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-3618:
---

Assignee: Michael Armbrust

 Store analyzed plans for temp tables
 

 Key: SPARK-3618
 URL: https://issues.apache.org/jira/browse/SPARK-3618
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust

 Right now we store unanalyzed logical plans for temporary tables.  However 
 this means that changes to session state (e.g., the current database) could 
 result in tables becoming inaccessible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3618) Store analyzed plans for temp tables

2014-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3618.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

This was done as part of the caching overhaul.

 Store analyzed plans for temp tables
 

 Key: SPARK-3618
 URL: https://issues.apache.org/jira/browse/SPARK-3618
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
 Fix For: 1.2.0


 Right now we store unanalyzed logical plans for temporary tables.  However 
 this means that changes to session state (e.g., the current database) could 
 result in tables becoming inaccessible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3575) Hive Schema is ignored when using convertMetastoreParquet

2014-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3575:

Target Version/s: 1.3.0  (was: 1.2.0)

 Hive Schema is ignored when using convertMetastoreParquet
 -

 Key: SPARK-3575
 URL: https://issues.apache.org/jira/browse/SPARK-3575
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian

 This can cause problems when for example one of the columns is defined as 
 TINYINT.  A class cast exception will be thrown since the parquet table scan 
 produces INTs while the rest of the execution is expecting bytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3575) Hive Schema is ignored when using convertMetastoreParquet

2014-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3575:

Priority: Critical  (was: Major)

 Hive Schema is ignored when using convertMetastoreParquet
 -

 Key: SPARK-3575
 URL: https://issues.apache.org/jira/browse/SPARK-3575
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Critical

 This can cause problems when for example one of the columns is defined as 
 TINYINT.  A class cast exception will be thrown since the parquet table scan 
 produces INTs while the rest of the execution is expecting bytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3575) Hive Schema is ignored when using convertMetastoreParquet

2014-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3575:

Target Version/s: 1.2.0  (was: 1.3.0)

 Hive Schema is ignored when using convertMetastoreParquet
 -

 Key: SPARK-3575
 URL: https://issues.apache.org/jira/browse/SPARK-3575
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian

 This can cause problems when for example one of the columns is defined as 
 TINYINT.  A class cast exception will be thrown since the parquet table scan 
 produces INTs while the rest of the execution is expecting bytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3440) HiveServer2 and CLI should retrieve Hive result set schema

2014-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3440:

Target Version/s: 1.3.0  (was: 1.2.0)

 HiveServer2 and CLI should retrieve Hive result set schema
 --

 Key: SPARK-3440
 URL: https://issues.apache.org/jira/browse/SPARK-3440
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2, 1.1.0
Reporter: Cheng Lian

 When executing Hive native queries/commands with {{HiveContext.runHive}}, 
 Spark SQL only calls {{Driver.getResults}} and returns a {{Seq\[String\]}}. 
 The schema of the result set is not retrieved, and thus not possible to split 
 the row string into proper columns and assign column names to them. For 
 example, currently all {{NativeCommand}} only returns a single column named 
 {{result}}.
 For existing Hive applications that rely on result set schemas, this breaks 
 compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2710) Build SchemaRDD from a JdbcRDD with MetaData (no hard-coded case class)

2014-11-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195240#comment-14195240
 ] 

Michael Armbrust commented on SPARK-2710:
-

Now that its been merged, it would be great if this feature could be 
implemented using the DataSource API.

 Build SchemaRDD from a JdbcRDD with MetaData (no hard-coded case class)
 ---

 Key: SPARK-2710
 URL: https://issues.apache.org/jira/browse/SPARK-2710
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Teng Qiu

 Spark SQL can take Parquet files or JSON files as a table directly (without 
 given a case class to define the schema)
 as a component named SQL, it should also be able to take a ResultSet from 
 RDBMS easily.
 i find that there is a JdbcRDD in core: 
 core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
 so i want to make some small change in this file to allow SQLContext to read 
 the MetaData from the PreparedStatement (read metadata do not need to execute 
 the query really).
 Then, in Spark SQL, SQLContext can create SchemaRDD with JdbcRDD and his 
 MetaData.
 In the further, maybe we can add a feature in sql-shell, so that user can 
 using spark-thrift-server join tables from different sources
 such as:
 {code}
 CREATE TABLE jdbc_tbl1 AS JDBC connectionString username password 
 initQuery bound ...
 CREATE TABLE parquet_files AS PARQUET hdfs://tmp/parquet_table/
 SELECT parquet_files.colX, jdbc_tbl1.colY
   FROM parquet_files
   JOIN jdbc_tbl1
 ON (parquet_files.id = jdbc_tbl1.id)
 {code}
 I think such a feature will be useful, like facebook Presto engine does.
 oh, and there is a small bug in JdbcRDD
 in compute(), method close()
 {code}
 if (null != conn  ! stmt.isClosed()) conn.close()
 {code}
 should be
 {code}
 if (null != conn  ! conn.isClosed()) conn.close()
 {code}
 just a small write error :)
 but such a close method will never be able to close conn...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2710) Build SchemaRDD from a JdbcRDD with MetaData (no hard-coded case class)

2014-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2710:

Target Version/s: 1.3.0  (was: 1.2.0)

 Build SchemaRDD from a JdbcRDD with MetaData (no hard-coded case class)
 ---

 Key: SPARK-2710
 URL: https://issues.apache.org/jira/browse/SPARK-2710
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Teng Qiu

 Spark SQL can take Parquet files or JSON files as a table directly (without 
 given a case class to define the schema)
 as a component named SQL, it should also be able to take a ResultSet from 
 RDBMS easily.
 i find that there is a JdbcRDD in core: 
 core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
 so i want to make some small change in this file to allow SQLContext to read 
 the MetaData from the PreparedStatement (read metadata do not need to execute 
 the query really).
 Then, in Spark SQL, SQLContext can create SchemaRDD with JdbcRDD and his 
 MetaData.
 In the further, maybe we can add a feature in sql-shell, so that user can 
 using spark-thrift-server join tables from different sources
 such as:
 {code}
 CREATE TABLE jdbc_tbl1 AS JDBC connectionString username password 
 initQuery bound ...
 CREATE TABLE parquet_files AS PARQUET hdfs://tmp/parquet_table/
 SELECT parquet_files.colX, jdbc_tbl1.colY
   FROM parquet_files
   JOIN jdbc_tbl1
 ON (parquet_files.id = jdbc_tbl1.id)
 {code}
 I think such a feature will be useful, like facebook Presto engine does.
 oh, and there is a small bug in JdbcRDD
 in compute(), method close()
 {code}
 if (null != conn  ! stmt.isClosed()) conn.close()
 {code}
 should be
 {code}
 if (null != conn  ! conn.isClosed()) conn.close()
 {code}
 just a small write error :)
 but such a close method will never be able to close conn...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2902) Change default options to be more agressive

2014-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2902.
-
   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Michael Armbrust  (was: Cheng Lian)

 Change default options to be more agressive
 ---

 Key: SPARK-2902
 URL: https://issues.apache.org/jira/browse/SPARK-2902
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 1.0.1, 1.0.2
Reporter: Cheng Lian
Assignee: Michael Armbrust
 Fix For: 1.2.0


 Compression for in-memory columnar storage is disabled by default, it's time 
 to enable it. Also, it help alleviating OOM mentioned in SPARK-2650



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2775) HiveContext does not support dots in column names.

2014-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2775:

Target Version/s: 1.3.0  (was: 1.2.0)

 HiveContext does not support dots in column names. 
 ---

 Key: SPARK-2775
 URL: https://issues.apache.org/jira/browse/SPARK-2775
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai

 When you try the following snippet in hive/console. 
 {code}
 val data = sc.parallelize(Seq({key.number1: value1, key.number2: 
 value2}))
 jsonRDD(data).registerAsTable(jt)
 hql(select `key.number1` from jt)
 {code}
 You will find the name of key.number1 cannot be resolved.
 {code}
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
 attributes: 'key.number1, tree:
 Project ['key.number1]
  LowerCaseSchema 
   Subquery jt
SparkLogicalPlan (ExistingRdd [key.number1#8,key.number2#9], MappedRDD[17] 
 at map at JsonRDD.scala:37)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3720) support ORC in spark sql

2014-11-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195287#comment-14195287
 ] 

René X Parra commented on SPARK-3720:
-

[~zhazhan] should this JIRA ticket be closed (marked as duplicate) of 
SPARK-2883 ??


 support ORC in spark sql
 

 Key: SPARK-3720
 URL: https://issues.apache.org/jira/browse/SPARK-3720
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Attachments: orc.diff


 The Optimized Row Columnar (ORC) file format provides a highly efficient way 
 to store data on hdfs.ORC file format has many advantages such as:
 1 a single file as the output of each task, which reduces the NameNode's load
 2 Hive type support including datetime, decimal, and the complex types 
 (struct, list, map, and union)
 3 light-weight indexes stored within the file
 skip row groups that don't pass predicate filtering
 seek to a given row
 4 block-mode compression based on data type
 run-length encoding for integer columns
 dictionary encoding for string columns
 5 concurrent reads of the same file using separate RecordReaders
 6 ability to split files without scanning for markers
 7 bound the amount of memory needed for reading or writing
 8 metadata stored using Protocol Buffers, which allows addition and removal 
 of fields
 Now spark sql support Parquet, support ORC provide people more opts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2360) CSV import to SchemaRDDs

2014-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2360.
-
Resolution: Won't Fix

Hey Hossein, I'm going to close this since I think we have decided this feature 
would work best as a separate library using the new Data Source API.

 CSV import to SchemaRDDs
 

 Key: SPARK-2360
 URL: https://issues.apache.org/jira/browse/SPARK-2360
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Hossein Falaki

 I think the first step it to design the interface that we want to present to 
 users.  Mostly this is defining options when importing.  Off the top of my 
 head:
 - What is the separator?
 - Provide column names or infer them from the first row.
 - how to handle multiple files with possibly different schemas
 - do we have a method to let users specify the datatypes of the columns or 
 are they just strings?
 - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2014-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2870:

Target Version/s: 1.3.0  (was: 1.2.0)

 Thorough schema inference directly on RDDs of Python dictionaries
 -

 Key: SPARK-2870
 URL: https://issues.apache.org/jira/browse/SPARK-2870
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Reporter: Nicholas Chammas

 h4. Background
 I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
 They process JSON text directly and infer a schema that covers the entire 
 source data set. 
 This is very important with semi-structured data like JSON since individual 
 elements in the data set are free to have different structures. Matching 
 fields across elements may even have different value types.
 For example:
 {code}
 {a: 5}
 {a: cow}
 {code}
 To get a queryable schema that covers the whole data set, you need to infer a 
 schema by looking at the whole data set. The aforementioned 
 {{SQLContext.json...()}} methods do this very well. 
 h4. Feature Request
 What we need is for {{SQlContext.inferSchema()}} to do this, too. 
 Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
 Python dictionaries and does something functionally equivalent to this:
 {code}
 SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
 {code}
 As of 1.0.2, 
 [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
  just looks at the first element in the data set. This won't help much when 
 the structure of the elements in the target RDD is variable.
 h4. Example Use Case
 * You have some JSON text data that you want to analyze using Spark SQL. 
 * You would use one of the {{SQLContext.json...()}} methods, but you need to 
 do some filtering on the data first to remove bad elements--basically, some 
 minimal schema validation.
 * You deserialize the JSON objects to Python {{dict}} s and filter out the 
 bad ones. You now have an RDD of dictionaries.
 * From this RDD, you want a SchemaRDD that captures the schema for the whole 
 data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4133) PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0

2014-11-03 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195294#comment-14195294
 ] 

Josh Rosen commented on SPARK-4133:
---

Do you happen to be creating multiple running SparkContexts in the same JVM by 
any chance?

 PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0
 --

 Key: SPARK-4133
 URL: https://issues.apache.org/jira/browse/SPARK-4133
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Antonio Jesus Navarro
Priority: Blocker

 Snappy related problems found when trying to upgrade existing Spark Streaming 
 App from 1.0.2 to 1.1.0.
 We can not run an existing 1.0.2 spark app if upgraded to 1.1.0
  IOException is thrown by snappy (parsing_error(2))
 {code}
 Executor task launch worker-0 DEBUG storage.BlockManager - Getting local 
 block broadcast_0
 Executor task launch worker-0 DEBUG storage.BlockManager - Level for block 
 broadcast_0 is StorageLevel(true, true, false, true, 1)
 Executor task launch worker-0 DEBUG storage.BlockManager - Getting block 
 broadcast_0 from memory
 Executor task launch worker-0 DEBUG storage.BlockManager - Getting local 
 block broadcast_0
 Executor task launch worker-0 DEBUG executor.Executor - Task 0's epoch is 0
 Executor task launch worker-0 DEBUG storage.BlockManager - Block broadcast_0 
 not registered locally
 Executor task launch worker-0 INFO  broadcast.TorrentBroadcast - Started 
 reading broadcast variable 0
 sparkDriver-akka.actor.default-dispatcher-4 INFO  
 receiver.ReceiverSupervisorImpl - Registered receiver 0
 Executor task launch worker-0 INFO  util.RecurringTimer - Started timer for 
 BlockGenerator at time 1414656492400
 Executor task launch worker-0 INFO  receiver.BlockGenerator - Started 
 BlockGenerator
 Thread-87 INFO  receiver.BlockGenerator - Started block pushing thread
 Executor task launch worker-0 INFO  receiver.ReceiverSupervisorImpl - 
 Starting receiver
 sparkDriver-akka.actor.default-dispatcher-5 INFO  scheduler.ReceiverTracker - 
 Registered receiver for stream 0 from akka://sparkDriver
 Executor task launch worker-0 INFO  kafka.KafkaReceiver - Starting Kafka 
 Consumer Stream with group: stratioStreaming
 Executor task launch worker-0 INFO  kafka.KafkaReceiver - Connecting to 
 Zookeeper: node.stratio.com:2181
 sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] 
 received message StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 
 cap=0]) from Actor[akka://sparkDriver/deadLetters]
 sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] 
 received message StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 
 cap=0]) from Actor[akka://sparkDriver/deadLetters]
 sparkDriver-akka.actor.default-dispatcher-6 DEBUG local.LocalActor - [actor] 
 received message StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 
 cap=0]) from Actor[akka://sparkDriver/deadLetters]
 sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] 
 handled message (8.442354 ms) 
 StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from 
 Actor[akka://sparkDriver/deadLetters]
 sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] 
 handled message (8.412421 ms) 
 StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from 
 Actor[akka://sparkDriver/deadLetters]
 sparkDriver-akka.actor.default-dispatcher-6 DEBUG local.LocalActor - [actor] 
 handled message (8.385471 ms) 
 StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from 
 Actor[akka://sparkDriver/deadLetters]
 Executor task launch worker-0 INFO  utils.VerifiableProperties - Verifying 
 properties
 Executor task launch worker-0 INFO  utils.VerifiableProperties - Property 
 group.id is overridden to stratioStreaming
 Executor task launch worker-0 INFO  utils.VerifiableProperties - Property 
 zookeeper.connect is overridden to node.stratio.com:2181
 Executor task launch worker-0 INFO  utils.VerifiableProperties - Property 
 zookeeper.connection.timeout.ms is overridden to 1
 Executor task launch worker-0 INFO  broadcast.TorrentBroadcast - Reading 
 broadcast variable 0 took 0.033998997 s
 Executor task launch worker-0 INFO  consumer.ZookeeperConsumerConnector - 
 [stratioStreaming_ajn-stratio-1414656492293-8ecb3e3a], Connecting to 
 zookeeper instance at node.stratio.com:2181
 Executor task launch worker-0 DEBUG zkclient.ZkConnection - Creating new 
 ZookKeeper instance to connect to node.stratio.com:2181.
 ZkClient-EventThread-169-node.stratio.com:2181 INFO  zkclient.ZkEventThread - 
 Starting ZkClient event thread.
 Executor task launch worker-0 INFO  zookeeper.ZooKeeper - Initiating client 
 connection, connectString=node.stratio.com:2181 

[jira] [Updated] (SPARK-2863) Emulate Hive type coercion in native reimplementations of Hive functions

2014-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2863:

Target Version/s: 1.3.0  (was: 1.2.0)

 Emulate Hive type coercion in native reimplementations of Hive functions
 

 Key: SPARK-2863
 URL: https://issues.apache.org/jira/browse/SPARK-2863
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: William Benton
Assignee: William Benton

 Native reimplementations of Hive functions no longer have the same 
 type-coercion behavior as they would if executed via Hive.  As [Michael 
 Armbrust points 
 out|https://github.com/apache/spark/pull/1750#discussion_r15790970], queries 
 like {{SELECT SQRT(2) FROM src LIMIT 1}} succeed in Hive but fail if 
 {{SQRT}} is implemented natively.
 Spark SQL should have Hive-compatible type coercions for arguments to 
 natively-implemented functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format

2014-11-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195301#comment-14195301
 ] 

René X Parra commented on SPARK-2883:
-

[~marmbrus] I see this was changed from Version 1.2.0 to Version 1.3.0 ... what 
is acceptance criteria to get this into 1.2.0 (or now, apparently 1.3.0) ?  
Perhaps [~zhazhan] you can provide some guidance on what needs to be done?

 Spark Support for ORCFile format
 

 Key: SPARK-2883
 URL: https://issues.apache.org/jira/browse/SPARK-2883
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, SQL
Reporter: Zhan Zhang
Priority: Blocker
 Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
 pm jobtracker.png, orc.diff


 Verify the support of OrcInputFormat in spark, fix issues if exists and add 
 documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format

2014-11-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195308#comment-14195308
 ] 

Michael Armbrust commented on SPARK-2883:
-

The merge deadline for 1.2.0 was on Saturday so only critical bug fixes can go 
in at this point.  If there is a bug and a fix for using the ORC SerDe I would 
still consider including that in 1.2.0.

Regarding the native support similar to what is already done for parquet, the 
existing PR needs to be updated to use the Data Sources API that was added in 
1.2.0.  I'll have time to do a more thorough review of that PR after we release 
1.2.0

 Spark Support for ORCFile format
 

 Key: SPARK-2883
 URL: https://issues.apache.org/jira/browse/SPARK-2883
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, SQL
Reporter: Zhan Zhang
Priority: Blocker
 Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
 pm jobtracker.png, orc.diff


 Verify the support of OrcInputFormat in spark, fix issues if exists and add 
 documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4178) Hadoop input metrics ignore bytes read in RecordReader instantiation

2014-11-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4178.

Resolution: Fixed
  Assignee: Sandy Ryza

 Hadoop input metrics ignore bytes read in RecordReader instantiation
 

 Key: SPARK-4178
 URL: https://issues.apache.org/jira/browse/SPARK-4178
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4214) With dynamic allocation, avoid outstanding requests for more executors than pending tasks need

2014-11-03 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4214:
-

 Summary: With dynamic allocation, avoid outstanding requests for 
more executors than pending tasks need
 Key: SPARK-4214
 URL: https://issues.apache.org/jira/browse/SPARK-4214
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza


Dynamic allocation tries to allocate more executors while we have pending tasks 
remaining.  Our current policy can end up with more outstanding executor 
requests than needed to fulfill all the pending tasks.  Capping the executor 
requests to the number of cores needed to fulfill all pending tasks would make 
dynamic allocation behavior less sensitive to settings for maxExecutors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3720) support ORC in spark sql

2014-11-03 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195402#comment-14195402
 ] 

Zhan Zhang commented on SPARK-3720:
---

[~neoword] wangfei and me are working together to solve this issue. The initial 
consolidation is done, but  there are some feature not available yet, e.g., 
predictor pushdown.  By the way, my original patch only works on local mode, 
and the predictor is not working as expected due to some bug.  After the 
current hive-0.13.1 in the upstream is stable, I will patch the diff to this PR 
so that it has full feature support. 

 support ORC in spark sql
 

 Key: SPARK-3720
 URL: https://issues.apache.org/jira/browse/SPARK-3720
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Attachments: orc.diff


 The Optimized Row Columnar (ORC) file format provides a highly efficient way 
 to store data on hdfs.ORC file format has many advantages such as:
 1 a single file as the output of each task, which reduces the NameNode's load
 2 Hive type support including datetime, decimal, and the complex types 
 (struct, list, map, and union)
 3 light-weight indexes stored within the file
 skip row groups that don't pass predicate filtering
 seek to a given row
 4 block-mode compression based on data type
 run-length encoding for integer columns
 dictionary encoding for string columns
 5 concurrent reads of the same file using separate RecordReaders
 6 ability to split files without scanning for markers
 7 bound the amount of memory needed for reading or writing
 8 metadata stored using Protocol Buffers, which allows addition and removal 
 of fields
 Now spark sql support Parquet, support ORC provide people more opts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format

2014-11-03 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195427#comment-14195427
 ] 

Zhan Zhang commented on SPARK-2883:
---

[~neoword] , As [~marmbrus] mentioned, the PR need to reconstruct to fit to the 
data sources api.  Wangfei and me have consolidated our work into 
https://github.com/apache/spark/pull/2576. Now the major missing part in that 
patch is the predictor pushdown, which has some problem in my old patch. After 
hive-0.13 is stable, I will patch the predictor pushdown to the PR so it has 
full feature support.

 Spark Support for ORCFile format
 

 Key: SPARK-2883
 URL: https://issues.apache.org/jira/browse/SPARK-2883
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, SQL
Reporter: Zhan Zhang
Priority: Blocker
 Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
 pm jobtracker.png, orc.diff


 Verify the support of OrcInputFormat in spark, fix issues if exists and add 
 documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3797) Run the shuffle service inside the YARN NodeManager as an AuxiliaryService

2014-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195466#comment-14195466
 ] 

Apache Spark commented on SPARK-3797:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/3082

 Run the shuffle service inside the YARN NodeManager as an AuxiliaryService
 --

 Key: SPARK-3797
 URL: https://issues.apache.org/jira/browse/SPARK-3797
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Affects Versions: 1.1.0
Reporter: Patrick Wendell
Assignee: Andrew Or

 It's also worth considering running the shuffle service in a YARN container 
 beside the executor(s) on each node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4215) Allow requesting executors only on Yarn (for now)

2014-11-03 Thread Andrew Or (JIRA)
Andrew Or created SPARK-4215:


 Summary: Allow requesting executors only on Yarn (for now)
 Key: SPARK-4215
 URL: https://issues.apache.org/jira/browse/SPARK-4215
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical


Currently if the user attempts to call `sc.requestExecutors` on, say, 
standalone mode, it just fails silently. We must at the very least log a 
warning to say it's only available for Yarn currently, or maybe even throw an 
exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4215) Allow requesting executors only on Yarn (for now)

2014-11-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4215:
-
Description: Currently if the user attempts to call `sc.requestExecutors` 
or enables dynamic allocation on, say, standalone mode, it just fails silently. 
We must at the very least log a warning to say it's only available for Yarn 
currently, or maybe even throw an exception.  (was: Currently if the user 
attempts to call `sc.requestExecutors` on, say, standalone mode, it just fails 
silently. We must at the very least log a warning to say it's only available 
for Yarn currently, or maybe even throw an exception.)

 Allow requesting executors only on Yarn (for now)
 -

 Key: SPARK-4215
 URL: https://issues.apache.org/jira/browse/SPARK-4215
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical

 Currently if the user attempts to call `sc.requestExecutors` or enables 
 dynamic allocation on, say, standalone mode, it just fails silently. We must 
 at the very least log a warning to say it's only available for Yarn 
 currently, or maybe even throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4213) SparkSQL - ParquetFilters - No support for LT, LTE, GT, GTE operators

2014-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195480#comment-14195480
 ] 

Apache Spark commented on SPARK-4213:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/3083

 SparkSQL - ParquetFilters - No support for LT, LTE, GT, GTE operators
 -

 Key: SPARK-4213
 URL: https://issues.apache.org/jira/browse/SPARK-4213
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
 Environment: CDH5.2, Hive 0.13.1, Spark 1.2 snapshot (commit hash 
 76386e1a23c)
Reporter: Terry Siu
Priority: Blocker
 Fix For: 1.2.0


 When I issue a hql query against a HiveContext where my predicate uses a 
 column of string type with one of LT, LTE, GT, or GTE operator, I get the 
 following error:
 scala.MatchError: StringType (of class 
 org.apache.spark.sql.catalyst.types.StringType$)
 Looking at the code in org.apache.spark.sql.parquet.ParquetFilters, 
 StringType is absent from the corresponding functions for creating these 
 filters.
 To reproduce, in a Hive 0.13.1 shell, I created the following table (at a 
 specified DB):
 create table sparkbug (
   id int,
   event string
 ) stored as parquet;
 Insert some sample data:
 insert into table sparkbug select 1, '2011-06-18' from some table limit 1;
 insert into table sparkbug select 2, '2012-01-01' from some table limit 1;
 Launch a spark shell and create a HiveContext to the metastore where the 
 table above is located.
 import org.apache.spark.sql._
 import org.apache.spark.sql.SQLContext
 import org.apache.spark.sql.hive.HiveContext
 val hc = new HiveContext(sc)
 hc.setConf(spark.sql.shuffle.partitions, 10)
 hc.setConf(spark.sql.hive.convertMetastoreParquet, true)
 hc.setConf(spark.sql.parquet.compression.codec, snappy)
 import hc._
 hc.hql(select * from db.sparkbug where event = '2011-12-01')
 A scala.MatchError will appear in the output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4216) Eliminate Jenkins GitHub posts from AMPLab

2014-11-03 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195482#comment-14195482
 ] 

Nicholas Chammas commented on SPARK-4216:
-

cc [~shaneknapp] [~joshrosen]

 Eliminate Jenkins GitHub posts from AMPLab
 --

 Key: SPARK-4216
 URL: https://issues.apache.org/jira/browse/SPARK-4216
 Project: Spark
  Issue Type: Bug
  Components: Build, Project Infra
Reporter: Nicholas Chammas
Priority: Minor

 * [Real Jenkins | 
 https://github.com/apache/spark/pull/2988#issuecomment-60873361]
 * [Imposter Jenkins | 
 https://github.com/apache/spark/pull/2988#issuecomment-60873366]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab

2014-11-03 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-4216:

Summary: Eliminate duplicate Jenkins GitHub posts from AMPLab  (was: 
Eliminate Jenkins GitHub posts from AMPLab)

 Eliminate duplicate Jenkins GitHub posts from AMPLab
 

 Key: SPARK-4216
 URL: https://issues.apache.org/jira/browse/SPARK-4216
 Project: Spark
  Issue Type: Bug
  Components: Build, Project Infra
Reporter: Nicholas Chammas
Priority: Minor

 * [Real Jenkins | 
 https://github.com/apache/spark/pull/2988#issuecomment-60873361]
 * [Imposter Jenkins | 
 https://github.com/apache/spark/pull/2988#issuecomment-60873366]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4166) Display the executor ID in the Web UI when ExecutorLostFailure happens

2014-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195542#comment-14195542
 ] 

Apache Spark commented on SPARK-4166:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/3085

 Display the executor ID in the Web UI when ExecutorLostFailure happens
 --

 Key: SPARK-4166
 URL: https://issues.apache.org/jira/browse/SPARK-4166
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Affects Versions: 1.1.0
Reporter: Shixiong Zhu
Priority: Minor
 Fix For: 1.2.0


 Now when  ExecutorLostFailure happens, it only displays ExecutorLostFailure 
 (executor lost)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4210) Add Extra-Trees algorithm to MLlib

2014-11-03 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195552#comment-14195552
 ] 

Joseph K. Bradley commented on SPARK-4210:
--

[~0asa] For the API, do you plan for this to be a new algorithm, or a set of 
new parameters for RandomForest?

For the API, I could imagine a few options:
* RandomForest parameters: Provide ExtraTrees implicitly by allowing users to 
tweak parameters of RandomForest.  This seems best if users would want to tweak 
several parameters on their own.
* RandomForest.trainExtraTrees() method: Provide a new train() method which 
calls RandomForest but constrains parameters to fit the ExtraTrees algorithm.  
This seems best if people would look for the ExtraTrees name and if we can 
simplify the API by constraining the set of parameters they can call ExtraTrees 
with.  I vote for this choice, if possible.
* ExtraTrees class: Provide an entirely new class.  This seems non-ideal to me.

For the implementation, I would hope it could be implemented by generalizing 
RandomForest with a new splitting option.  I'm not sure how much change to the 
internals it would require; if it's too much, it might merit a new underlying 
implementation.  Hopefully not though!

What are your thoughts?

 Add Extra-Trees algorithm to MLlib
 --

 Key: SPARK-4210
 URL: https://issues.apache.org/jira/browse/SPARK-4210
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Vincent Botta

 This task will add Extra-Trees support to Spark MLlib. The implementation 
 could be inspired from the current Random Forest algorithm. This algorithm is 
 expected to be particularly suited as sorting of attributes is not required 
 as opposed to to the original Random Forest approach (with similar and/or 
 better predictive power). 
 The tasks involves:
 - Code implementation
 - Unit tests
 - Functional tests
 - Performance tests
 - Documentation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4163) When fetching blocks unsuccessfully, Web UI only displays Fetch failure

2014-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195575#comment-14195575
 ] 

Apache Spark commented on SPARK-4163:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/3086

 When fetching blocks unsuccessfully, Web UI only displays Fetch failure
 -

 Key: SPARK-4163
 URL: https://issues.apache.org/jira/browse/SPARK-4163
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Affects Versions: 1.0.0, 1.1.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor
 Fix For: 1.2.0


 When fetching blocks unsuccessfully, Web UI only displays Fetch failure. 
 It's hard to find out the cause of the fetch failure. Web UI should display 
 the stack track for the fetch failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-611) Allow JStack to be run from web UI

2014-11-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-611:

Affects Version/s: 1.0.0

 Allow JStack to be run from web UI
 --

 Key: SPARK-611
 URL: https://issues.apache.org/jira/browse/SPARK-611
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Josh Rosen
 Fix For: 1.2.0


 Huge debugging improvement if the standalone mode dashboard can run jstack 
 and show it on the web page for a slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-611) Allow JStack to be run from web UI

2014-11-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-611.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

 Allow JStack to be run from web UI
 --

 Key: SPARK-611
 URL: https://issues.apache.org/jira/browse/SPARK-611
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Josh Rosen
 Fix For: 1.2.0


 Huge debugging improvement if the standalone mode dashboard can run jstack 
 and show it on the web page for a slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2938) Support SASL authentication in Netty network module

2014-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195599#comment-14195599
 ] 

Apache Spark commented on SPARK-2938:
-

User 'aarondav' has created a pull request for this issue:
https://github.com/apache/spark/pull/3087

 Support SASL authentication in Netty network module
 ---

 Key: SPARK-2938
 URL: https://issues.apache.org/jira/browse/SPARK-2938
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Aaron Davidson
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2938) Support SASL authentication in Netty network module

2014-11-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2938:
-
Target Version/s: 1.2.0

 Support SASL authentication in Netty network module
 ---

 Key: SPARK-2938
 URL: https://issues.apache.org/jira/browse/SPARK-2938
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Reynold Xin
Assignee: Aaron Davidson
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2938) Support SASL authentication in Netty network module

2014-11-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2938:
-
Affects Version/s: 1.1.0

 Support SASL authentication in Netty network module
 ---

 Key: SPARK-2938
 URL: https://issues.apache.org/jira/browse/SPARK-2938
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Reynold Xin
Assignee: Aaron Davidson
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2938) Support SASL authentication in Netty network module

2014-11-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2938:
-
Affects Version/s: (was: 1.1.0)
   1.2.0

 Support SASL authentication in Netty network module
 ---

 Key: SPARK-2938
 URL: https://issues.apache.org/jira/browse/SPARK-2938
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Reynold Xin
Assignee: Aaron Davidson
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4217) Result of SparkSQL is incorrect after a table join and group by operation

2014-11-03 Thread peter.zhang (JIRA)
peter.zhang created SPARK-4217:
--

 Summary: Result of SparkSQL is incorrect after a table join and 
group by operation
 Key: SPARK-4217
 URL: https://issues.apache.org/jira/browse/SPARK-4217
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
 Environment: Hadoop 2.2.0
Spark1.1
Reporter: peter.zhang
Priority: Critical


I runed a test using same SQL script in SparkSQL, Shark and Hive environment as 
below
---
select c.theyear, sum(b.amount)
from tblstock a
join tblStockDetail b on a.ordernumber = b.ordernumber
join tbldate c on a.dateid = c.dateid
group by c.theyear;


result of hive/shark:
theyear _c1
20041403018
20055557850
20067203061
200711300432
200812109328
20095365447
2010188944

result of SparkSQL:
2010210924
20043265696
200513247234
200613670416
200716711974
200814670698
20096322137

I'll attach test data soon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4217) Result of SparkSQL is incorrect after a table join and group by operation

2014-11-03 Thread peter.zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peter.zhang updated SPARK-4217:
---
Description: 
I runed a test using same SQL script in SparkSQL, Shark and Hive environment as 
below
---
select c.theyear, sum(b.amount)
from tblstock a
join tblStockDetail b on a.ordernumber = b.ordernumber
join tbldate c on a.dateid = c.dateid
group by c.theyear;


result of hive/shark:
theyear _c1
20041403018
20055557850
20067203061
200711300432
200812109328
20095365447
2010188944

result of SparkSQL:
2010210924
20043265696
200513247234
200613670416
200716711974
200814670698
20096322137


  was:
I runed a test using same SQL script in SparkSQL, Shark and Hive environment as 
below
---
select c.theyear, sum(b.amount)
from tblstock a
join tblStockDetail b on a.ordernumber = b.ordernumber
join tbldate c on a.dateid = c.dateid
group by c.theyear;


result of hive/shark:
theyear _c1
20041403018
20055557850
20067203061
200711300432
200812109328
20095365447
2010188944

result of SparkSQL:
2010210924
20043265696
200513247234
200613670416
200716711974
200814670698
20096322137

I'll attach test data soon


 Result of SparkSQL is incorrect after a table join and group by operation
 -

 Key: SPARK-4217
 URL: https://issues.apache.org/jira/browse/SPARK-4217
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
 Environment: Hadoop 2.2.0
 Spark1.1
Reporter: peter.zhang
Priority: Critical
 Attachments: TestScript.sql, saledata.zip


 I runed a test using same SQL script in SparkSQL, Shark and Hive environment 
 as below
 ---
 select c.theyear, sum(b.amount)
 from tblstock a
 join tblStockDetail b on a.ordernumber = b.ordernumber
 join tbldate c on a.dateid = c.dateid
 group by c.theyear;
 result of hive/shark:
 theyear   _c1
 2004  1403018
 2005  5557850
 2006  7203061
 2007  11300432
 2008  12109328
 2009  5365447
 2010  188944
 result of SparkSQL:
 2010  210924
 2004  3265696
 2005  13247234
 2006  13670416
 2007  16711974
 2008  14670698
 2009  6322137



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4192) Support UDT in Python

2014-11-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4192.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3068
[https://github.com/apache/spark/pull/3068]

 Support UDT in Python
 -

 Key: SPARK-4192
 URL: https://issues.apache.org/jira/browse/SPARK-4192
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor
 Fix For: 1.2.0


 This is a sub-task of SPARK-3572 for UDT support in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab

2014-11-03 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195711#comment-14195711
 ] 

shane knapp commented on SPARK-4216:


actually, you got the 'real' and 'impostor' jenkins bots backwards.  amplab 
hosts not only spark, but many other projects as well.  :)

the github pull request builder (hereafter known as ghprb) only allows one bot 
per jenkins instance.  spark works around this by using their own bot, with 
injected oauth tokens, which uses the ghprb/github api to post additional 
messages.  the primary bot (amplab jenkins) also posts automagically.

two solutions:

1) we could just use amplab jenkins, instead of spark qa.  the other research 
projects do NOT want to use the sparkqa bot.
2) i'm sure that the ghprb folks wouldn't mind a PR to add multi-bot support.

thoughts?  personally, i'd lean towards (1).

 Eliminate duplicate Jenkins GitHub posts from AMPLab
 

 Key: SPARK-4216
 URL: https://issues.apache.org/jira/browse/SPARK-4216
 Project: Spark
  Issue Type: Bug
  Components: Build, Project Infra
Reporter: Nicholas Chammas
Priority: Minor

 * [Real Jenkins | 
 https://github.com/apache/spark/pull/2988#issuecomment-60873361]
 * [Imposter Jenkins | 
 https://github.com/apache/spark/pull/2988#issuecomment-60873366]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >