date:20150602


[ 
https://issues.apache.org/jira/browse/SPARK-8021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568672#comment-14568672
 ] 

Apache Spark commented on SPARK-8021:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/6578

 DataFrameReader/Writer in Python does not match Scala
 -

 Key: SPARK-8021
 URL: https://issues.apache.org/jira/browse/SPARK-8021
 Project: Spark
  Issue Type: Sub-task
Affects Versions: 1.4.0
Reporter: Michael Armbrust
Assignee: Davies Liu
Priority: Blocker

 When doing {{sqlContext.read.format(json).load(...)}} I get 
 {{AttributeError: 'DataFrameReader' object has no attribute 'format'}}.  
 These APIs should match up so that examples we give in documentation and 
 slides can be used in any language.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8032) Make version checking in mllib/init.py more robust for version NumPy 1.10

2015-06-02 Thread Manoj Kumar (JIRA)

Manoj Kumar created SPARK-8032:
--

 Summary: Make version checking in mllib/__init__.py more robust 
for version NumPy 1.10
 Key: SPARK-8032
 URL: https://issues.apache.org/jira/browse/SPARK-8032
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Manoj Kumar


The current checking does version `1.x' is less than `1.4' this will fail if x 
has greater than 1 digit, since x  4, however `1.x`  `1.4`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8021) DataFrameReader/Writer in Python does not match Scala


 [ 
https://issues.apache.org/jira/browse/SPARK-8021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8021:
---

Assignee: Apache Spark  (was: Davies Liu)

 DataFrameReader/Writer in Python does not match Scala
 -

 Key: SPARK-8021
 URL: https://issues.apache.org/jira/browse/SPARK-8021
 Project: Spark
  Issue Type: Sub-task
Affects Versions: 1.4.0
Reporter: Michael Armbrust
Assignee: Apache Spark
Priority: Blocker

 When doing {{sqlContext.read.format(json).load(...)}} I get 
 {{AttributeError: 'DataFrameReader' object has no attribute 'format'}}.  
 These APIs should match up so that examples we give in documentation and 
 slides can be used in any language.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8021) DataFrameReader/Writer in Python does not match Scala


 [ 
https://issues.apache.org/jira/browse/SPARK-8021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8021:
---

Assignee: Davies Liu  (was: Apache Spark)

 DataFrameReader/Writer in Python does not match Scala
 -

 Key: SPARK-8021
 URL: https://issues.apache.org/jira/browse/SPARK-8021
 Project: Spark
  Issue Type: Sub-task
Affects Versions: 1.4.0
Reporter: Michael Armbrust
Assignee: Davies Liu
Priority: Blocker

 When doing {{sqlContext.read.format(json).load(...)}} I get 
 {{AttributeError: 'DataFrameReader' object has no attribute 'format'}}.  
 These APIs should match up so that examples we give in documentation and 
 slides can be used in any language.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8034) spark-sql security authorization bug

nilone created SPARK-8034:
-

 Summary: spark-sql security authorization bug
 Key: SPARK-8034
 URL: https://issues.apache.org/jira/browse/SPARK-8034
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.3.1, 1.3.0, 1.2.1
Reporter: nilone


I Try to use beeline to access thrift jdbc server for authorization test, and 
these params have added to the hive-site.xml :
--
hive.security.authorization.enabled : true 
hive.security.authorization.createtable.owner.grants ： select,alter,drop
--

1、cannot control select privilege : anyone can select any table created by 
other users,（all for spark1.1,spark1.2,spark1.3)
2、when create table under different beeline client with different user name,  
the server write wrong owner name into the hive metastore table 'TBLS',  always 
write the name who  is the first one make the create table operation .and 
cannot control drop ,alter privilege between users.
(this bug is for version after spark1.2, spark1.1 is ok, )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-8035) 2


 [ 
https://issues.apache.org/jira/browse/SPARK-8035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nilone closed SPARK-8035.
-
Resolution: Invalid

 2
 -

 Key: SPARK-8035
 URL: https://issues.apache.org/jira/browse/SPARK-8035
 Project: Spark
  Issue Type: Bug
Reporter: nilone





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7980) Support SQLContext.range(end)

2015-06-02 Thread Animesh Baranawal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568720#comment-14568720
 ] 

Animesh Baranawal commented on SPARK-7980:
--

Regarding the python support for range, I am unable to check the functioning in 
pyspark...
Even for the pre-defined range function in context.py, when I type the 
following in ./bin/pyspark:

sqlContext.range(1, 7, 2).collect() 

I get the error:
Traceback (most recent call last):
  File stdin, line 1, in module
TypeError: range() takes exactly 2 arguments (4 given)

 Support SQLContext.range(end)
 -

 Key: SPARK-7980
 URL: https://issues.apache.org/jira/browse/SPARK-7980
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 SQLContext.range should also allow only specifying the end position, similar 
 to Python's own range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6816) Add SparkConf API to configure SparkR

2015-06-02 Thread Rick Moritz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567187#comment-14567187
 ] 

Rick Moritz edited comment on SPARK-6816 at 6/2/15 8:55 AM:


One current drawback with SparkR's configuration option is the inability to set 
driver VM-options. These are crucial, when attempting to run sparkR on a 
Hortonworks HDP, as both driver and appliation-master need to be aware of the 
hdp.version variable in order to resolve the classpath.

While it is possible to pass this variable to the executors, there's no way to 
pass this option to the driver, excepting the following exploit/work-around:

The SPARK_MEM variable can be abused to pass the required parameters to the 
driver's VM, by using String concatenation. Setting the variable to (e.g.)  
512m -Dhdp.version=NNN appends the -D option to the -X option which is 
currently read from this environment variable. Adding a secondary variable to 
the System.env which gets parsed for JVM options would be far more obvious and 
less hacky, or by adding a separate environment list for the driver, extending 
what's currently available for executors.

I'm adding this as a comment to this issue, since I believe it is sufficiently 
closely related not to warrant a separate issue.


was (Author: rpcmoritz):
One current drawback with SparkR's configuration option is the inability to set 
driver VM-options. These are crucial, when attempting to run sparkR on a 
Hortonworks HDP, as both driver and appliation-master need to be aware of the 
hdp.version variable in order to resolve the classpath.

While it is possible to pass this variable to the executors, there's no way to 
pass this option to the driver, excepting the following exploit/work-around:

The SPARK_MEM variable can be abused to pass the required parameters to the 
driver's VM, by using String concatenation. Setting the variable to (e.g.)  
512m -Dhdp.version=NNN appends the -D option to the -X option which is 
currently read from this environment variable. Adding a secondary variable to 
the System.env which gets parsed for JVM options would be far more obvious and 
less hacky, or by adding a separate environment list for the driver, extending 
what's currently available for executors.

I'm adding this as a comment to this issue, since I believe it is sufficiently 
closely related not to warrant a separate issue.

 Add SparkConf API to configure SparkR
 -

 Key: SPARK-6816
 URL: https://issues.apache.org/jira/browse/SPARK-6816
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Minor

 Right now the only way to configure SparkR is to pass in arguments to 
 sparkR.init. The goal is to add an API similar to SparkConf on Scala/Python 
 to make configuration easier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8038) PySpark SQL when functions is broken on Column


 [ 
https://issues.apache.org/jira/browse/SPARK-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8038:
---

Assignee: Apache Spark

 PySpark SQL when functions is broken on Column
 --

 Key: SPARK-8038
 URL: https://issues.apache.org/jira/browse/SPARK-8038
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.4.0
 Environment: Spark 1.4.0 RC3
Reporter: Olivier Girardot
Assignee: Apache Spark
Priority: Blocker

 {code}
 In [1]: df = sqlCtx.createDataFrame([(1, 1), (2, 2), (1, 2), (1, 2)], 
 [key, value])
 In [2]: from pyspark.sql import functions as F
 In [8]: df.select(df.key, F.when(df.key  1, 0).when(df.key == 0, 
 2).otherwise(1)).show()
 +---+-+
 | key |CASE WHEN (key = 0) THEN 2 ELSE 1|
 +---+-+
 | 1| 1|
 | 2| 1|
 | 1| 1|
 | 1| 1|
 +---+-+
 {code}
 When in Scala I get the expected expression and behaviour : 
 {code}
 scala val df = sqlContext.createDataFrame(List((1, 1), (2, 2), (1, 2), 
 (1, 2))).toDF(key, value)
 scala import org.apache.spark.sql.functions._
 scala df.select(df(key), when(df(key)  1, 0).when(df(key) === 2, 
 2).otherwise(1)).show()
 +---+---+
 |key|CASE WHEN (key  1) THEN 0 WHEN (key = 2) THEN 2 ELSE 1|
 +---+---+
 | 1| 1|
 | 2| 0|
 | 1| 1|
 | 1| 1|
 +---+---+
 {code}
 This is coming from the column.py file with the Column class definition of 
 **when** and the fix is coming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8004) Spark does not enclose column names when fetchting from jdbc sources


 [ 
https://issues.apache.org/jira/browse/SPARK-8004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8004:
---

Assignee: Apache Spark

 Spark does not enclose column names when fetchting from jdbc sources
 

 Key: SPARK-8004
 URL: https://issues.apache.org/jira/browse/SPARK-8004
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Rene Treffer
Assignee: Apache Spark

 Spark failes to load tables that have a keyword as column names
 Sample error:
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 157.0 failed 1 times, most recent failure: Lost task 0.0 in stage 157.0 
 (TID 4322, localhost): 
 com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error 
 in your SQL syntax; check the manual that corresponds to your MySQL server 
 version for the right syntax to use near 'key,value FROM [XX]'
 {code}
 A correct query would have been
 {code}
 SELECT `key`.`value` FROM 
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8004) Spark does not enclose column names when fetchting from jdbc sources


[ 
https://issues.apache.org/jira/browse/SPARK-8004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568666#comment-14568666
 ] 

Apache Spark commented on SPARK-8004:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/6577

 Spark does not enclose column names when fetchting from jdbc sources
 

 Key: SPARK-8004
 URL: https://issues.apache.org/jira/browse/SPARK-8004
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Rene Treffer

 Spark failes to load tables that have a keyword as column names
 Sample error:
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 157.0 failed 1 times, most recent failure: Lost task 0.0 in stage 157.0 
 (TID 4322, localhost): 
 com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error 
 in your SQL syntax; check the manual that corresponds to your MySQL server 
 version for the right syntax to use near 'key,value FROM [XX]'
 {code}
 A correct query would have been
 {code}
 SELECT `key`.`value` FROM 
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8004) Spark does not enclose column names when fetchting from jdbc sources


 [ 
https://issues.apache.org/jira/browse/SPARK-8004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8004:
---

Assignee: (was: Apache Spark)

 Spark does not enclose column names when fetchting from jdbc sources
 

 Key: SPARK-8004
 URL: https://issues.apache.org/jira/browse/SPARK-8004
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Rene Treffer

 Spark failes to load tables that have a keyword as column names
 Sample error:
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 157.0 failed 1 times, most recent failure: Lost task 0.0 in stage 157.0 
 (TID 4322, localhost): 
 com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error 
 in your SQL syntax; check the manual that corresponds to your MySQL server 
 version for the right syntax to use near 'key,value FROM [XX]'
 {code}
 A correct query would have been
 {code}
 SELECT `key`.`value` FROM 
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8035) 2

nilone created SPARK-8035:
-

 Summary: 2
 Key: SPARK-8035
 URL: https://issues.apache.org/jira/browse/SPARK-8035
 Project: Spark
  Issue Type: Bug
Reporter: nilone






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2015-06-02 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568726#comment-14568726
 ] 

Saisai Shao commented on SPARK-4352:


Hi [~sandyr], I have a proposal based on ratio to calculate the node locality 
which can cover all the situation, even in the run-time of dynamic allocation, 
say if we have 300 tasks, 200 tasks with node a, b, c; and 100 tasks with node 
a, b, d. So the ratio of node locality is 300 : 300 : 200 : 100.

Now we need to allocate 10 executors, so according to the ratio distribution, 
we will calculate out the best distribution of 10 executors based on the ratio 
above:

300 * 10 / 300 : 300 * 10 / 300 : 200 * 10 / 300 : 100 * 10 / 300 = 10 : 10 : 7 
: 4, floor to get the integer.

and requests:

4 executors: a, b, c, d
3 executors: a, b, c
3 executors: a, b

The probability of a, b is highest, and d is lowest, basicly follow the 
distribution of data.

If we request for 1 executor, this would be {{300 * 1 / 300 : 300 * 1 / 300 : 
200 * 1 / 300 : 100 * 1 / 300 = 1 : 1 : 1 : 1 }}, so each node has a equal 
chance to allocate the executor.

If {{task number = executor number * cores}} which means resource is over 
demanded, both above method and this ratio based method is OK, since they will 
by chance be the same, but ratio based implementation do not need to consider 
this special case, the algorithm is same for all the situation.

If currently we already have some nodes with executors allocated, say for 
example on nodes a, b, c, d, currently is 3 : 3 : 0 : 0, and we still need to 
request for 10 executors, originally the ratio is 3 : 3 : 2 : 1, so we will get 
10 executors on node a, b, c, d which is 3 : 3 : 2 : 2 by equal probability. 
And we already have 3 executors on a and b, so actually we only need 4 
executors with c, d to satisfy the ratio, and finally left 6 for a, b, c, d 
to equally increase the executor number (since now the probability is already 
satisfied).

What do you think about this algorithm, it's fairly general, one concern is 
that it does not take the core numbers into consideration.












 Incorporate locality preferences in dynamic allocation requests
 ---

 Key: SPARK-4352
 URL: https://issues.apache.org/jira/browse/SPARK-4352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Saisai Shao
Priority: Critical
 Attachments: Supportpreferrednodelocationindynamicallocation.pdf


 Currently, achieving data locality in Spark is difficult unless an 
 application takes resources on every node in the cluster.  
 preferredNodeLocalityData provides a sort of hacky workaround that has been 
 broken since 1.0.
 With dynamic executor allocation, Spark requests executors in response to 
 demand from the application.  When this occurs, it would be useful to look at 
 the pending tasks and communicate their location preferences to the cluster 
 resource manager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7980) Support SQLContext.range(end)

2015-06-02 Thread Animesh Baranawal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568722#comment-14568722
 ] 

Animesh Baranawal commented on SPARK-7980:
--

Regarding the python support for range, I am unable to check the functioning in 
pyspark...
Even for the pre-defined range function in context.py, when I type the 
following in ./bin/pyspark:

sqlContext.range(1, 7, 2).collect() 

I get the error:
Traceback (most recent call last):
  File stdin, line 1, in module
TypeError: range() takes exactly 2 arguments (4 given)

 Support SQLContext.range(end)
 -

 Key: SPARK-7980
 URL: https://issues.apache.org/jira/browse/SPARK-7980
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 SQLContext.range should also allow only specifying the end position, similar 
 to Python's own range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-7980) Support SQLContext.range(end)

2015-06-02 Thread Animesh Baranawal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Animesh Baranawal updated SPARK-7980:
-
Comment: was deleted

(was: Regarding the python support for range, I am unable to check the 
functioning in pyspark...
Even for the pre-defined range function in context.py, when I type the 
following in ./bin/pyspark:

sqlContext.range(1, 7, 2).collect() 

I get the error:
Traceback (most recent call last):
  File stdin, line 1, in module
TypeError: range() takes exactly 2 arguments (4 given))

 Support SQLContext.range(end)
 -

 Key: SPARK-7980
 URL: https://issues.apache.org/jira/browse/SPARK-7980
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 SQLContext.range should also allow only specifying the end position, similar 
 to Python's own range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8037) Ignores files whose name starts with . while enumerating files in HadoopFsRelation

2015-06-02 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-8037:
-

 Summary: Ignores files whose name starts with . while 
enumerating files in HadoopFsRelation
 Key: SPARK-8037
 URL: https://issues.apache.org/jira/browse/SPARK-8037
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor


Temporary files like {{.DS_Store}} generated by Mac OS X finder may cause 
trouble for partition discovery. A directory whose layout looks like the 
following
{noformat}
 find parquet_partitioned
parquet_partitioned
parquet_partitioned/._common_metadata.crc
parquet_partitioned/._metadata.crc
parquet_partitioned/._SUCCESS.crc
parquet_partitioned/_common_metadata
parquet_partitioned/_metadata
parquet_partitioned/_SUCCESS
parquet_partitioned/year=2014/.DS_Store
parquet_partitioned/year=2014/month=9
parquet_partitioned/year=2014/month=9/.DS_Store
parquet_partitioned/year=2014/month=9/day=1/.DS_Store
parquet_partitioned/year=2014/month=9/day=1/.part-r-8.gz.parquet.crc
parquet_partitioned/year=2014/month=9/day=1/part-r-8.gz.parquet
parquet_partitioned/year=2015
parquet_partitioned/year=2015/month=10
parquet_partitioned/year=2015/month=10/day=25
parquet_partitioned/year=2015/month=10/day=25/.part-r-2.gz.parquet.crc
parquet_partitioned/year=2015/month=10/day=25/.part-r-4.gz.parquet.crc
parquet_partitioned/year=2015/month=10/day=25/part-r-2.gz.parquet
parquet_partitioned/year=2015/month=10/day=25/part-r-4.gz.parquet
parquet_partitioned/year=2015/month=10/day=26
parquet_partitioned/year=2015/month=10/day=26/.part-r-5.gz.parquet.crc
parquet_partitioned/year=2015/month=10/day=26/part-r-5.gz.parquet
parquet_partitioned/year=2015/month=9
parquet_partitioned/year=2015/month=9/day=1
parquet_partitioned/year=2015/month=9/day=1/.part-r-7.gz.parquet.crc
parquet_partitioned/year=2015/month=9/day=1/part-r-7.gz.parquet
{noformat}
causes exception like this:
{noformat}
scala val df = sqlContext.read.parquet(parquet_partitioned)
java.lang.AssertionError: assertion failed: Conflicting partition column names 
detected:
ArrayBuffer(year, month)
ArrayBuffer(year)
ArrayBuffer(year, month, day)
at scala.Predef$.assert(Predef.scala:179)
at 
org.apache.spark.sql.sources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:189)
at 
org.apache.spark.sql.sources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:87)
at 
org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:492)
at 
org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:449)
at 
org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:448)
{noformat}
This is because {{.DS_Store}} files are considered as a data file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6816) Add SparkConf API to configure SparkR

2015-06-02 Thread Rick Moritz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568807#comment-14568807
 ] 

Rick Moritz commented on SPARK-6816:


[~shivaram], I am integrating SparkR into an RStudio server (I would believe 
this to be a rather common use case), so using bin/SparkR won't work in this 
case, as far as I can tell. Thanks for the suggestion nonetheless.

 Add SparkConf API to configure SparkR
 -

 Key: SPARK-6816
 URL: https://issues.apache.org/jira/browse/SPARK-6816
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Minor

 Right now the only way to configure SparkR is to pass in arguments to 
 sparkR.init. The goal is to add an API similar to SparkConf on Scala/Python 
 to make configuration easier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8032) Make NumPy version checking in mllib/init.py


 [ 
https://issues.apache.org/jira/browse/SPARK-8032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8032:
---

Assignee: (was: Apache Spark)

 Make NumPy version checking in mllib/__init__.py
 

 Key: SPARK-8032
 URL: https://issues.apache.org/jira/browse/SPARK-8032
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Manoj Kumar

 The current checking does version `1.x' is less than `1.4' this will fail if 
 x has greater than 1 digit, since x  4, however `1.x`  `1.4`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-8034) spark-sql security authorization bug


 [ 
https://issues.apache.org/jira/browse/SPARK-8034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nilone closed SPARK-8034.
-
Resolution: Invalid

 spark-sql security authorization bug
 

 Key: SPARK-8034
 URL: https://issues.apache.org/jira/browse/SPARK-8034
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.2.1, 1.3.0, 1.3.1
Reporter: nilone

 I Try to use beeline to access thrift jdbc server for authorization test, and 
 these params have added to the hive-site.xml :
 --
 hive.security.authorization.enabled : true 
 hive.security.authorization.createtable.owner.grants ： select,alter,drop
 --
 1、cannot control select privilege : anyone can select any table created by 
 other users,（all for spark1.1,spark1.2,spark1.3)
 2、when create table under different beeline client with different user name,  
 the server write wrong owner name into the hive metastore table 'TBLS',  
 always write the name who  is the first one make the create table operation 
 .and cannot control drop ,alter privilege between users.
 (this bug is for version after spark1.2, spark1.1 is ok, )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8036) Ignores files whose name starts with . while enumerating files in HadoopFsRelation

2015-06-02 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-8036:
-

 Summary: Ignores files whose name starts with . while 
enumerating files in HadoopFsRelation
 Key: SPARK-8036
 URL: https://issues.apache.org/jira/browse/SPARK-8036
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor


Temporary files like {{.DS_Store}} generated by Mac OS X finder may cause 
trouble for partition discovery. A directory whose layout looks like the 
following
{noformat}
 find parquet_partitioned
parquet_partitioned
parquet_partitioned/._common_metadata.crc
parquet_partitioned/._metadata.crc
parquet_partitioned/._SUCCESS.crc
parquet_partitioned/_common_metadata
parquet_partitioned/_metadata
parquet_partitioned/_SUCCESS
parquet_partitioned/year=2014/.DS_Store
parquet_partitioned/year=2014/month=9
parquet_partitioned/year=2014/month=9/.DS_Store
parquet_partitioned/year=2014/month=9/day=1/.DS_Store
parquet_partitioned/year=2014/month=9/day=1/.part-r-8.gz.parquet.crc
parquet_partitioned/year=2014/month=9/day=1/part-r-8.gz.parquet
parquet_partitioned/year=2015
parquet_partitioned/year=2015/month=10
parquet_partitioned/year=2015/month=10/day=25
parquet_partitioned/year=2015/month=10/day=25/.part-r-2.gz.parquet.crc
parquet_partitioned/year=2015/month=10/day=25/.part-r-4.gz.parquet.crc
parquet_partitioned/year=2015/month=10/day=25/part-r-2.gz.parquet
parquet_partitioned/year=2015/month=10/day=25/part-r-4.gz.parquet
parquet_partitioned/year=2015/month=10/day=26
parquet_partitioned/year=2015/month=10/day=26/.part-r-5.gz.parquet.crc
parquet_partitioned/year=2015/month=10/day=26/part-r-5.gz.parquet
parquet_partitioned/year=2015/month=9
parquet_partitioned/year=2015/month=9/day=1
parquet_partitioned/year=2015/month=9/day=1/.part-r-7.gz.parquet.crc
parquet_partitioned/year=2015/month=9/day=1/part-r-7.gz.parquet
{noformat}
causes exception like this:
{noformat}
scala val df = sqlContext.read.parquet(parquet_partitioned)
java.lang.AssertionError: assertion failed: Conflicting partition column names 
detected:
ArrayBuffer(year, month)
ArrayBuffer(year)
ArrayBuffer(year, month, day)
at scala.Predef$.assert(Predef.scala:179)
at 
org.apache.spark.sql.sources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:189)
at 
org.apache.spark.sql.sources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:87)
at 
org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:492)
at 
org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:449)
at 
org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:448)
{noformat}f
This is because {{.DS_Store}} files are considered as a data file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8033) spark-sql thriftserver security authorization bugs!


 [ 
https://issues.apache.org/jira/browse/SPARK-8033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8033.
--
Resolution: Duplicate

 spark-sql thriftserver security authorization bugs!
 ---

 Key: SPARK-8033
 URL: https://issues.apache.org/jira/browse/SPARK-8033
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.2.1, 1.3.0, 1.3.1
Reporter: nilone

 I Try to use beeline to access thrift jdbc server for authorization test, and 
 these params have added to the hive-site.xml :
 --
 hive.security.authorization.enabled : true 
 hive.security.authorization.createtable.owner.grants ： select,alter,drop
 --
 1、cannot control select privilege : anyone can select any table created by 
 other users,（all for spark1.1,spark1.2,spark1.3)
 2、when create table under different beeline client with different user name,  
 the server write wrong owner name into the hive metastore table 'TBLS',  
 always write the name who  is the first one make the create table operation 
 .and cannot control drop ,alter privilege between users.
 (this bug is for version after spark1.2, spark1.1 is ok, )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6988) Fix Spark SQL documentation for 1.3.x

2015-06-02 Thread Saurabh Santhosh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568705#comment-14568705
 ] 

Saurabh Santhosh commented on SPARK-6988:
-

Hey,
Can someone update the Spark documentation as well. (For correct usage of Data 
Frames)

https://spark.apache.org/docs/latest/sql-programming-guide.html

Eg : 

DataFrame teenagers = sqlContext.sql(SELECT name FROM parquetFile WHERE age = 
13 AND age = 19);
ListString teenagerNames = teenagers.map(new FunctionRow, String() {
  public String call(Row row) {
return Name:  + row.getString(0);
  }
}).collect();

to change teenagers.map to teenagers.javaRDD().map

 Fix Spark SQL documentation for 1.3.x
 -

 Key: SPARK-6988
 URL: https://issues.apache.org/jira/browse/SPARK-6988
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Olivier Girardot
Assignee: Olivier Girardot
Priority: Minor
 Fix For: 1.3.2, 1.4.0


 There are a few glitches regarding the DataFrame API usage in Java.
 The most important one being how to map a DataFrame result, using the javaRDD 
 method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7993) Improve DataFrame.show() output

2015-06-02 Thread Akhil Thatipamula (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568717#comment-14568717
 ] 

Akhil Thatipamula commented on SPARK-7993:
--

I am planning to check whether the data type for a given column is primitive. 
And if it turns out to be non primitive, I am modifying the value of 
string['cell.toString']. 

Is that legitimate?

 Improve DataFrame.show() output
 ---

 Key: SPARK-7993
 URL: https://issues.apache.org/jira/browse/SPARK-7993
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker
  Labels: starter

 1. Each column should be at the minimum 3 characters wide. Right now if the 
 widest value is 1, it is just 1 char wide, which looks ugly. Example below:
 2. If a DataFrame have more than N number of rows (N = 20 by default for 
 show), at the end we should display a message like only showing the top 20 
 rows.
 {code}
 +--+--+-+
 | a| b|c|
 +--+--+-+
 | 1| 2|3|
 | 1| 2|1|
 | 1| 2|3|
 | 3| 6|3|
 | 1| 2|3|
 | 5|10|1|
 | 1| 2|3|
 | 7|14|3|
 | 1| 2|3|
 | 9|18|1|
 | 1| 2|3|
 |11|22|3|
 | 1| 2|3|
 |13|26|1|
 | 1| 2|3|
 |15|30|3|
 | 1| 2|3|
 |17|34|1|
 | 1| 2|3|
 |19|38|3|
 +--+--+-+
 only showing top 20 rows    add this at the end
 {code}
 3. For array values, instead of printing ArrayBuffer, we should just print 
 square brackets:
 {code}
 +--+--+-+
 |   a_freqItems|   b_freqItems|  c_freqItems|
 +--+--+-+
 |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)|
 +--+--+-+
 {code}
 should be
 {code}
 +---+---+---+
 |a_freqItems|b_freqItems|c_freqItems|
 +---+---+---+
 |[11, 1]|[2, 22]| [1, 3]|
 +---+---+---+
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8011) DecimalType is not a datatype

2015-06-02 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568823#comment-14568823
 ] 

Liang-Chi Hsieh commented on SPARK-8011:


Try DecimalType.Unlimited?

 DecimalType is not a datatype
 -

 Key: SPARK-8011
 URL: https://issues.apache.org/jira/browse/SPARK-8011
 Project: Spark
  Issue Type: Bug
  Components: Java API, Spark Core
Affects Versions: 1.3.1
Reporter: Bipin Roshan Nag

 When I run the following in spark-shell :
  StructType(StructField(ID,IntegerType,true), 
 StructField(Value,DecimalType,true))
 I get
 console:50: error: type mismatch;
  found   : org.apache.spark.sql.types.DecimalType.type
  required: org.apache.spark.sql.types.DataType
StructType(StructField(ID,IntegerType,true), 
 StructField(Value,DecimalType,true))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8038) PySpark SQL when functions is broken on Column


 [ 
https://issues.apache.org/jira/browse/SPARK-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8038:
---

Assignee: (was: Apache Spark)

 PySpark SQL when functions is broken on Column
 --

 Key: SPARK-8038
 URL: https://issues.apache.org/jira/browse/SPARK-8038
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.4.0
 Environment: Spark 1.4.0 RC3
Reporter: Olivier Girardot
Priority: Blocker

 {code}
 In [1]: df = sqlCtx.createDataFrame([(1, 1), (2, 2), (1, 2), (1, 2)], 
 [key, value])
 In [2]: from pyspark.sql import functions as F
 In [8]: df.select(df.key, F.when(df.key  1, 0).when(df.key == 0, 
 2).otherwise(1)).show()
 +---+-+
 | key |CASE WHEN (key = 0) THEN 2 ELSE 1|
 +---+-+
 | 1| 1|
 | 2| 1|
 | 1| 1|
 | 1| 1|
 +---+-+
 {code}
 When in Scala I get the expected expression and behaviour : 
 {code}
 scala val df = sqlContext.createDataFrame(List((1, 1), (2, 2), (1, 2), 
 (1, 2))).toDF(key, value)
 scala import org.apache.spark.sql.functions._
 scala df.select(df(key), when(df(key)  1, 0).when(df(key) === 2, 
 2).otherwise(1)).show()
 +---+---+
 |key|CASE WHEN (key  1) THEN 0 WHEN (key = 2) THEN 2 ELSE 1|
 +---+---+
 | 1| 1|
 | 2| 0|
 | 1| 1|
 | 1| 1|
 +---+---+
 {code}
 This is coming from the column.py file with the Column class definition of 
 **when** and the fix is coming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8038) PySpark SQL when functions is broken on Column


[ 
https://issues.apache.org/jira/browse/SPARK-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568836#comment-14568836
 ] 

Apache Spark commented on SPARK-8038:
-

User 'ogirardot' has created a pull request for this issue:
https://github.com/apache/spark/pull/6580

 PySpark SQL when functions is broken on Column
 --

 Key: SPARK-8038
 URL: https://issues.apache.org/jira/browse/SPARK-8038
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.4.0
 Environment: Spark 1.4.0 RC3
Reporter: Olivier Girardot
Priority: Blocker

 {code}
 In [1]: df = sqlCtx.createDataFrame([(1, 1), (2, 2), (1, 2), (1, 2)], 
 [key, value])
 In [2]: from pyspark.sql import functions as F
 In [8]: df.select(df.key, F.when(df.key  1, 0).when(df.key == 0, 
 2).otherwise(1)).show()
 +---+-+
 | key |CASE WHEN (key = 0) THEN 2 ELSE 1|
 +---+-+
 | 1| 1|
 | 2| 1|
 | 1| 1|
 | 1| 1|
 +---+-+
 {code}
 When in Scala I get the expected expression and behaviour : 
 {code}
 scala val df = sqlContext.createDataFrame(List((1, 1), (2, 2), (1, 2), 
 (1, 2))).toDF(key, value)
 scala import org.apache.spark.sql.functions._
 scala df.select(df(key), when(df(key)  1, 0).when(df(key) === 2, 
 2).otherwise(1)).show()
 +---+---+
 |key|CASE WHEN (key  1) THEN 0 WHEN (key = 2) THEN 2 ELSE 1|
 +---+---+
 | 1| 1|
 | 2| 0|
 | 1| 1|
 | 1| 1|
 +---+---+
 {code}
 This is coming from the column.py file with the Column class definition of 
 **when** and the fix is coming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8037) Ignores files whose name starts with . while enumerating files in HadoopFsRelation


 [ 
https://issues.apache.org/jira/browse/SPARK-8037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8037:
---

Assignee: Cheng Lian  (was: Apache Spark)

 Ignores files whose name starts with . while enumerating files in 
 HadoopFsRelation
 

 Key: SPARK-8037
 URL: https://issues.apache.org/jira/browse/SPARK-8037
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor

 Temporary files like {{.DS_Store}} generated by Mac OS X finder may cause 
 trouble for partition discovery. A directory whose layout looks like the 
 following
 {noformat}
  find parquet_partitioned
 parquet_partitioned
 parquet_partitioned/._common_metadata.crc
 parquet_partitioned/._metadata.crc
 parquet_partitioned/._SUCCESS.crc
 parquet_partitioned/_common_metadata
 parquet_partitioned/_metadata
 parquet_partitioned/_SUCCESS
 parquet_partitioned/year=2014/.DS_Store
 parquet_partitioned/year=2014/month=9
 parquet_partitioned/year=2014/month=9/.DS_Store
 parquet_partitioned/year=2014/month=9/day=1/.DS_Store
 parquet_partitioned/year=2014/month=9/day=1/.part-r-8.gz.parquet.crc
 parquet_partitioned/year=2014/month=9/day=1/part-r-8.gz.parquet
 parquet_partitioned/year=2015
 parquet_partitioned/year=2015/month=10
 parquet_partitioned/year=2015/month=10/day=25
 parquet_partitioned/year=2015/month=10/day=25/.part-r-2.gz.parquet.crc
 parquet_partitioned/year=2015/month=10/day=25/.part-r-4.gz.parquet.crc
 parquet_partitioned/year=2015/month=10/day=25/part-r-2.gz.parquet
 parquet_partitioned/year=2015/month=10/day=25/part-r-4.gz.parquet
 parquet_partitioned/year=2015/month=10/day=26
 parquet_partitioned/year=2015/month=10/day=26/.part-r-5.gz.parquet.crc
 parquet_partitioned/year=2015/month=10/day=26/part-r-5.gz.parquet
 parquet_partitioned/year=2015/month=9
 parquet_partitioned/year=2015/month=9/day=1
 parquet_partitioned/year=2015/month=9/day=1/.part-r-7.gz.parquet.crc
 parquet_partitioned/year=2015/month=9/day=1/part-r-7.gz.parquet
 {noformat}
 causes exception like this:
 {noformat}
 scala val df = sqlContext.read.parquet(parquet_partitioned)
 java.lang.AssertionError: assertion failed: Conflicting partition column 
 names detected:
 ArrayBuffer(year, month)
 ArrayBuffer(year)
 ArrayBuffer(year, month, day)
 at scala.Predef$.assert(Predef.scala:179)
 at 
 org.apache.spark.sql.sources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:189)
 at 
 org.apache.spark.sql.sources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:87)
 at 
 org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:492)
 at 
 org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:449)
 at 
 org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:448)
 {noformat}
 This is because {{.DS_Store}} files are considered as a data file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8037) Ignores files whose name starts with . while enumerating files in HadoopFsRelation


 [ 
https://issues.apache.org/jira/browse/SPARK-8037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8037:
---

Assignee: Apache Spark  (was: Cheng Lian)

 Ignores files whose name starts with . while enumerating files in 
 HadoopFsRelation
 

 Key: SPARK-8037
 URL: https://issues.apache.org/jira/browse/SPARK-8037
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Assignee: Apache Spark
Priority: Minor

 Temporary files like {{.DS_Store}} generated by Mac OS X finder may cause 
 trouble for partition discovery. A directory whose layout looks like the 
 following
 {noformat}
  find parquet_partitioned
 parquet_partitioned
 parquet_partitioned/._common_metadata.crc
 parquet_partitioned/._metadata.crc
 parquet_partitioned/._SUCCESS.crc
 parquet_partitioned/_common_metadata
 parquet_partitioned/_metadata
 parquet_partitioned/_SUCCESS
 parquet_partitioned/year=2014/.DS_Store
 parquet_partitioned/year=2014/month=9
 parquet_partitioned/year=2014/month=9/.DS_Store
 parquet_partitioned/year=2014/month=9/day=1/.DS_Store
 parquet_partitioned/year=2014/month=9/day=1/.part-r-8.gz.parquet.crc
 parquet_partitioned/year=2014/month=9/day=1/part-r-8.gz.parquet
 parquet_partitioned/year=2015
 parquet_partitioned/year=2015/month=10
 parquet_partitioned/year=2015/month=10/day=25
 parquet_partitioned/year=2015/month=10/day=25/.part-r-2.gz.parquet.crc
 parquet_partitioned/year=2015/month=10/day=25/.part-r-4.gz.parquet.crc
 parquet_partitioned/year=2015/month=10/day=25/part-r-2.gz.parquet
 parquet_partitioned/year=2015/month=10/day=25/part-r-4.gz.parquet
 parquet_partitioned/year=2015/month=10/day=26
 parquet_partitioned/year=2015/month=10/day=26/.part-r-5.gz.parquet.crc
 parquet_partitioned/year=2015/month=10/day=26/part-r-5.gz.parquet
 parquet_partitioned/year=2015/month=9
 parquet_partitioned/year=2015/month=9/day=1
 parquet_partitioned/year=2015/month=9/day=1/.part-r-7.gz.parquet.crc
 parquet_partitioned/year=2015/month=9/day=1/part-r-7.gz.parquet
 {noformat}
 causes exception like this:
 {noformat}
 scala val df = sqlContext.read.parquet(parquet_partitioned)
 java.lang.AssertionError: assertion failed: Conflicting partition column 
 names detected:
 ArrayBuffer(year, month)
 ArrayBuffer(year)
 ArrayBuffer(year, month, day)
 at scala.Predef$.assert(Predef.scala:179)
 at 
 org.apache.spark.sql.sources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:189)
 at 
 org.apache.spark.sql.sources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:87)
 at 
 org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:492)
 at 
 org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:449)
 at 
 org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:448)
 {noformat}
 This is because {{.DS_Store}} files are considered as a data file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8037) Ignores files whose name starts with . while enumerating files in HadoopFsRelation


[ 
https://issues.apache.org/jira/browse/SPARK-8037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568857#comment-14568857
 ] 

Apache Spark commented on SPARK-8037:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/6581

 Ignores files whose name starts with . while enumerating files in 
 HadoopFsRelation
 

 Key: SPARK-8037
 URL: https://issues.apache.org/jira/browse/SPARK-8037
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor

 Temporary files like {{.DS_Store}} generated by Mac OS X finder may cause 
 trouble for partition discovery. A directory whose layout looks like the 
 following
 {noformat}
  find parquet_partitioned
 parquet_partitioned
 parquet_partitioned/._common_metadata.crc
 parquet_partitioned/._metadata.crc
 parquet_partitioned/._SUCCESS.crc
 parquet_partitioned/_common_metadata
 parquet_partitioned/_metadata
 parquet_partitioned/_SUCCESS
 parquet_partitioned/year=2014/.DS_Store
 parquet_partitioned/year=2014/month=9
 parquet_partitioned/year=2014/month=9/.DS_Store
 parquet_partitioned/year=2014/month=9/day=1/.DS_Store
 parquet_partitioned/year=2014/month=9/day=1/.part-r-8.gz.parquet.crc
 parquet_partitioned/year=2014/month=9/day=1/part-r-8.gz.parquet
 parquet_partitioned/year=2015
 parquet_partitioned/year=2015/month=10
 parquet_partitioned/year=2015/month=10/day=25
 parquet_partitioned/year=2015/month=10/day=25/.part-r-2.gz.parquet.crc
 parquet_partitioned/year=2015/month=10/day=25/.part-r-4.gz.parquet.crc
 parquet_partitioned/year=2015/month=10/day=25/part-r-2.gz.parquet
 parquet_partitioned/year=2015/month=10/day=25/part-r-4.gz.parquet
 parquet_partitioned/year=2015/month=10/day=26
 parquet_partitioned/year=2015/month=10/day=26/.part-r-5.gz.parquet.crc
 parquet_partitioned/year=2015/month=10/day=26/part-r-5.gz.parquet
 parquet_partitioned/year=2015/month=9
 parquet_partitioned/year=2015/month=9/day=1
 parquet_partitioned/year=2015/month=9/day=1/.part-r-7.gz.parquet.crc
 parquet_partitioned/year=2015/month=9/day=1/part-r-7.gz.parquet
 {noformat}
 causes exception like this:
 {noformat}
 scala val df = sqlContext.read.parquet(parquet_partitioned)
 java.lang.AssertionError: assertion failed: Conflicting partition column 
 names detected:
 ArrayBuffer(year, month)
 ArrayBuffer(year)
 ArrayBuffer(year, month, day)
 at scala.Predef$.assert(Predef.scala:179)
 at 
 org.apache.spark.sql.sources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:189)
 at 
 org.apache.spark.sql.sources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:87)
 at 
 org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:492)
 at 
 org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:449)
 at 
 org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:448)
 {noformat}
 This is because {{.DS_Store}} files are considered as a data file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6988) Fix Spark SQL documentation for 1.3.x

2015-06-02 Thread Saurabh Santhosh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568704#comment-14568704
 ] 

Saurabh Santhosh commented on SPARK-6988:
-

Hey Can someone update the spark documentation as well.

https://spark.apache.org/docs/latest/sql-programming-guide.html

DataFrame teenagers = sqlContext.sql(SELECT name FROM parquetFile WHERE age = 
13 AND age = 19);
ListString teenagerNames = teenagers.map(new FunctionRow, String() {
  public String call(Row row) {
return Name:  + row.getString(0);
  }
}).collect();

to make it teenagers.javaRDD()

 Fix Spark SQL documentation for 1.3.x
 -

 Key: SPARK-6988
 URL: https://issues.apache.org/jira/browse/SPARK-6988
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Olivier Girardot
Assignee: Olivier Girardot
Priority: Minor
 Fix For: 1.3.2, 1.4.0


 There are a few glitches regarding the DataFrame API usage in Java.
 The most important one being how to map a DataFrame result, using the javaRDD 
 method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-8033) spark-sql thriftserver security authorization bugs!


 [ 
https://issues.apache.org/jira/browse/SPARK-8033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-8033:
--

 spark-sql thriftserver security authorization bugs!
 ---

 Key: SPARK-8033
 URL: https://issues.apache.org/jira/browse/SPARK-8033
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.2.1, 1.3.0, 1.3.1
Reporter: nilone

 I Try to use beeline to access thrift jdbc server for authorization test, and 
 these params have added to the hive-site.xml :
 --
 hive.security.authorization.enabled : true 
 hive.security.authorization.createtable.owner.grants ： select,alter,drop
 --
 1、cannot control select privilege : anyone can select any table created by 
 other users,（all for spark1.1,spark1.2,spark1.3)
 2、when create table under different beeline client with different user name,  
 the server write wrong owner name into the hive metastore table 'TBLS',  
 always write the name who  is the first one make the create table operation 
 .and cannot control drop ,alter privilege between users.
 (this bug is for version after spark1.2, spark1.1 is ok, )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8033) spark-sql thriftserver security authorization bugs!


 [ 
https://issues.apache.org/jira/browse/SPARK-8033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8033.
--
Resolution: Fixed

[~nilone] Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
You opened this twice and the JIRA isn't quite correct

 spark-sql thriftserver security authorization bugs!
 ---

 Key: SPARK-8033
 URL: https://issues.apache.org/jira/browse/SPARK-8033
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.2.1, 1.3.0, 1.3.1
Reporter: nilone

 I Try to use beeline to access thrift jdbc server for authorization test, and 
 these params have added to the hive-site.xml :
 --
 hive.security.authorization.enabled : true 
 hive.security.authorization.createtable.owner.grants ： select,alter,drop
 --
 1、cannot control select privilege : anyone can select any table created by 
 other users,（all for spark1.1,spark1.2,spark1.3)
 2、when create table under different beeline client with different user name,  
 the server write wrong owner name into the hive metastore table 'TBLS',  
 always write the name who  is the first one make the create table operation 
 .and cannot control drop ,alter privilege between users.
 (this bug is for version after spark1.2, spark1.1 is ok, )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8038) PySpark SQL when functions is broken on Column

2015-06-02 Thread Olivier Girardot (JIRA)

Olivier Girardot created SPARK-8038:
---

 Summary: PySpark SQL when functions is broken on Column
 Key: SPARK-8038
 URL: https://issues.apache.org/jira/browse/SPARK-8038
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.4.0
 Environment: Spark 1.4.0 RC3
Reporter: Olivier Girardot
Priority: Blocker





{code}
In [1]: df = sqlCtx.createDataFrame([(1, 1), (2, 2), (1, 2), (1, 2)], 
[key, value])


In [2]: from pyspark.sql import functions as F

In [8]: df.select(df.key, F.when(df.key  1, 0).when(df.key == 0, 
2).otherwise(1)).show()

+---+-+
| key |CASE WHEN (key = 0) THEN 2 ELSE 1|
+---+-+
| 1| 1|
| 2| 1|
| 1| 1|
| 1| 1|
+---+-+
{code}

When in Scala I get the expected expression and behaviour : 

{code}
scala val df = sqlContext.createDataFrame(List((1, 1), (2, 2), (1, 2), 
(1, 2))).toDF(key, value)

scala import org.apache.spark.sql.functions._

scala df.select(df(key), when(df(key)  1, 0).when(df(key) === 2, 
2).otherwise(1)).show()

+---+---+

|key|CASE WHEN (key  1) THEN 0 WHEN (key = 2) THEN 2 ELSE 1|
+---+---+
| 1| 1|
| 2| 0|
| 1| 1|
| 1| 1|
+---+---+
{code}

This is coming from the column.py file with the Column class definition of 
**when** and the fix is coming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7893) Complex Operators between Graphs

2015-06-02 Thread Andy Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7893:
--
Description: 
Currently there are 30+ operators in GraphX, while few of them consider 
operators between graphs. The only one is _*mask*_, which takes another graph 
as a parameter and return a new graph.

In many complex case,such as _*streaming graph, small graph merge into huge 
graph*_, higher level operators of graphs can help users to focus and think in 
graph. Performance optimization can be done internally and be transparent to 
them.


Complex graph operator list is 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]

* Union of Graphs ( G ∪ H )
* Intersection of Graphs( G ∩ H)
* Graph Join
* Difference of Graphs（G – H）
* Graph Complement
* Line Graph ( L(G) )

This issue will be index of all these operators





  was:
Currently there are 30+ operators in GraphX, while few of them consider 
operators between graphs. The only one is _*mask*_, which takes another graph 
as a parameter and return a new graph.

In many complex case,such as _*streaming graph, small graph merge into huge 
graph*_, higher level operators of graphs can help users to focus and think in 
graph. Performance optimization can be done internally and be transparent to 
them.


Complex graph operator list is 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
This issue will focus on two frequently-used operators first: *union* and 
*join*.


 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: complex, graph, join, operators, union

 Currently there are 30+ operators in GraphX, while few of them consider 
 operators between graphs. The only one is _*mask*_, which takes another graph 
 as a parameter and return a new graph.
 In many complex case,such as _*streaming graph, small graph merge into huge 
 graph*_, higher level operators of graphs can help users to focus and think 
 in graph. Performance optimization can be done internally and be transparent 
 to them.
 Complex graph operator list is 
 here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]
 * Union of Graphs ( G ∪ H )
 * Intersection of Graphs( G ∩ H)
 * Graph Join
 * Difference of Graphs（G – H）
 * Graph Complement
 * Line Graph ( L(G) )
 This issue will be index of all these operators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7122) KafkaUtils.createDirectStream - unreasonable processing time in absence of load

2015-06-02 Thread Nicolas PHUNG (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568854#comment-14568854
 ] 

Nicolas PHUNG commented on SPARK-7122:
--

For _KafkaUtils.createStream_, jobs take between 13 ms and 0,3s. In details, 
stages take between 13ms and 0,3s with split into 1 to 3 tasks. From the 
streaming menu in Spark UI, processing time at 75th percentile is 112 ms and 
maximum is 358 ms.

For _KafkaUtils.createDirectStream_, jobs take between 13 ms and 7s. In 
details, stages take between 13ms and 7s are split between 275 to 400 tasks.

My kafka topic has 400 partitions that can explain the task split in 
_KafkaUtils.createDirectStream_. But I don't understand why it gets behind 
whereas _KafkaUtils.createStream_ can keep up for the same processing 
_foreachrdd_ (I mean reprocess all from the beginning + keep up to newer/recent 
events in Kafka). Of course, I'm using the same executors Spark configuration 
for both (core/ram). Or maybe I'm doing something wrong somewhere.

 KafkaUtils.createDirectStream - unreasonable processing time in absence of 
 load
 ---

 Key: SPARK-7122
 URL: https://issues.apache.org/jira/browse/SPARK-7122
 Project: Spark
  Issue Type: Question
  Components: Streaming
Affects Versions: 1.3.1
 Environment: Spark Streaming 1.3.1, standalone mode running on just 1 
 box: Ubuntu 14.04.2 LTS, 4 cores, 8GB RAM, java version 1.8.0_40
Reporter: Platon Potapov
Priority: Minor
 Attachments: 10.second.window.fast.job.txt, 
 5.second.window.slow.job.txt, SparkStreamingJob.scala


 attached is the complete source code of a test spark job. no external data 
 generators are run - just the presence of a kafka topic named raw suffices.
 the spark job is run with no load whatsoever. http://localhost:4040/streaming 
 is checked to obtain job processing duration.
 * in case the test contains the following transformation:
 {code}
 // dummy transformation
 val temperature = bytes.filter(_._1 == abc)
 val abc = temperature.window(Seconds(40), Seconds(5))
 abc.print()
 {code}
 the median processing time is 3 seconds 80 ms
 * in case the test contains the following transformation:
 {code}
 // dummy transformation
 val temperature = bytes.filter(_._1 == abc)
 val abc = temperature.map(x = (1, x))
 abc.print()
 {code}
 the median processing time is just 50 ms
 please explain why does the window transformation introduce such a growth 
 of job duration?
 note: the result is the same regardless of the number of kafka topic 
 partitions (I've tried 1 and 8)
 note2: the result is the same regardless of the window parameters (I've tried 
 (20, 2) and (40, 5))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8032) Make NumPy version checking in mllib/init.py

2015-06-02 Thread Manoj Kumar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-8032:
---
Summary: Make NumPy version checking in mllib/__init__.py  (was: Make 
version checking in mllib/__init__.py)

 Make NumPy version checking in mllib/__init__.py
 

 Key: SPARK-8032
 URL: https://issues.apache.org/jira/browse/SPARK-8032
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Manoj Kumar

 The current checking does version `1.x' is less than `1.4' this will fail if 
 x has greater than 1 digit, since x  4, however `1.x`  `1.4`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8032) Make NumPy version checking in mllib/init.py


[ 
https://issues.apache.org/jira/browse/SPARK-8032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568679#comment-14568679
 ] 

Apache Spark commented on SPARK-8032:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/6579

 Make NumPy version checking in mllib/__init__.py
 

 Key: SPARK-8032
 URL: https://issues.apache.org/jira/browse/SPARK-8032
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Manoj Kumar

 The current checking does version `1.x' is less than `1.4' this will fail if 
 x has greater than 1 digit, since x  4, however `1.x`  `1.4`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8032) Make NumPy version checking in mllib/init.py


 [ 
https://issues.apache.org/jira/browse/SPARK-8032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8032:
---

Assignee: Apache Spark

 Make NumPy version checking in mllib/__init__.py
 

 Key: SPARK-8032
 URL: https://issues.apache.org/jira/browse/SPARK-8032
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Manoj Kumar
Assignee: Apache Spark

 The current checking does version `1.x' is less than `1.4' this will fail if 
 x has greater than 1 digit, since x  4, however `1.x`  `1.4`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8023) Random Number Generation inconsistent in projections in DataFrame


 [ 
https://issues.apache.org/jira/browse/SPARK-8023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8023.

   Resolution: Fixed
Fix Version/s: 1.4.0

 Random Number Generation inconsistent in projections in DataFrame
 -

 Key: SPARK-8023
 URL: https://issues.apache.org/jira/browse/SPARK-8023
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Burak Yavuz
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.4.0


 to reproduce (in python):
 {code}
 df = sqlContext.range(0, 10).withColumn('uniform', rand(seed=10))
 df.select('uniform', 'uniform' + 1)
 {code}
 You should see that the first column + 1 doesn't equal the second column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8033) spark-sql thriftserver security authorization bugs!

nilone created SPARK-8033:
-

 Summary: spark-sql thriftserver security authorization bugs!
 Key: SPARK-8033
 URL: https://issues.apache.org/jira/browse/SPARK-8033
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.3.1, 1.3.0, 1.2.1
Reporter: nilone


I Try to use beeline to access thrift jdbc server for authorization test, and 
these params have added to the hive-site.xml :
--
hive.security.authorization.enabled : true 
hive.security.authorization.createtable.owner.grants ： select,alter,drop
--

1、cannot control select privilege : anyone can select any table created by 
other users,（all for spark1.1,spark1.2,spark1.3)
2、when create table under different beeline client with different user name,  
the server write wrong owner name into the hive metastore table 'TBLS',  always 
write the name who  is the first one make the create table operation .and 
cannot control drop ,alter privilege between users.
(this bug is for version after spark1.2, spark1.1 is ok, )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2015-06-02 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568726#comment-14568726
 ] 

Saisai Shao edited comment on SPARK-4352 at 6/2/15 9:23 AM:


Hi [~sandyr], I have a proposal based on ratio to calculate the node locality 
which can cover all the situation, even in the run-time of dynamic allocation, 
say if we have 300 tasks, 200 tasks with node a, b, c; and 100 tasks with node 
a, b, d. So the ratio of node locality is 300 : 300 : 200 : 100.

Now we need to allocate 10 executors, so according to the ratio distribution, 
we will calculate out the best distribution of 10 executors based on the ratio 
above:

300 * 10 / 300 : 300 * 10 / 300 : 200 * 10 / 300 : 100 * 10 / 300 = 10 : 10 : 7 
: 4, floor to get the integer.

and requests:

4 executors: a, b, c, d
3 executors: a, b, c
3 executors: a, b

The probability of a, b is highest, and d is lowest, basically follow the 
distribution of data.

If we request for 1 executor, this would be {{300 * 1 / 300 : 300 * 1 / 300 : 
200 * 1 / 300 : 100 * 1 / 300 = 1 : 1 : 1 : 1 }}, so each node has a equal 
chance to allocate the executor.

If {{task number = executor number * cores}} which means resource is over 
demanded, both above method and this ratio based method is OK, since they will 
by chance be the same, but ratio based implementation do not need to consider 
this special case, the algorithm is same for all the situation.

If currently we already have some nodes with executors allocated, say for 
example on nodes a, b, c, d, currently is 3 : 3 : 0 : 0, and we still need to 
request for 10 executors, so ideally the ratio changes to 1 : 1 : 7 : 4 by 
equal probability. And we already have 3 executors on a and b, so actually we 
only need 4 executors, round the ratio to be 4 based (1 : 1 : 4 : 3), so the 
executor allocations changes to :

a, b, c, d 1
c, d 2
c 1

and the left 6 executor requests a, b, c, d for equal chance. This will keep 
the optimal ratio as close to 3 : 3 : 2 : 1.

What do you think about this algorithm, it's fairly general, one concern is 
that it does not take the core numbers into consideration.













was (Author: jerryshao):
Hi [~sandyr], I have a proposal based on ratio to calculate the node locality 
which can cover all the situation, even in the run-time of dynamic allocation, 
say if we have 300 tasks, 200 tasks with node a, b, c; and 100 tasks with node 
a, b, d. So the ratio of node locality is 300 : 300 : 200 : 100.

Now we need to allocate 10 executors, so according to the ratio distribution, 
we will calculate out the best distribution of 10 executors based on the ratio 
above:

300 * 10 / 300 : 300 * 10 / 300 : 200 * 10 / 300 : 100 * 10 / 300 = 10 : 10 : 7 
: 4, floor to get the integer.

and requests:

4 executors: a, b, c, d
3 executors: a, b, c
3 executors: a, b

The probability of a, b is highest, and d is lowest, basicly follow the 
distribution of data.

If we request for 1 executor, this would be {{300 * 1 / 300 : 300 * 1 / 300 : 
200 * 1 / 300 : 100 * 1 / 300 = 1 : 1 : 1 : 1 }}, so each node has a equal 
chance to allocate the executor.

If {{task number = executor number * cores}} which means resource is over 
demanded, both above method and this ratio based method is OK, since they will 
by chance be the same, but ratio based implementation do not need to consider 
this special case, the algorithm is same for all the situation.

If currently we already have some nodes with executors allocated, say for 
example on nodes a, b, c, d, currently is 3 : 3 : 0 : 0, and we still need to 
request for 10 executors, originally the ratio is 3 : 3 : 2 : 1, so we will get 
10 executors on node a, b, c, d which is 3 : 3 : 2 : 2 by equal probability. 
And we already have 3 executors on a and b, so actually we only need 4 
executors with c, d to satisfy the ratio, and finally left 6 for a, b, c, d 
to equally increase the executor number (since now the probability is already 
satisfied).

What do you think about this algorithm, it's fairly general, one concern is 
that it does not take the core numbers into consideration.












 Incorporate locality preferences in dynamic allocation requests
 ---

 Key: SPARK-4352
 URL: https://issues.apache.org/jira/browse/SPARK-4352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Saisai Shao
Priority: Critical
 Attachments: Supportpreferrednodelocationindynamicallocation.pdf


 Currently, achieving data locality in Spark is difficult unless an 
 application takes resources on every node in the cluster.  
 preferredNodeLocalityData provides a sort of hacky workaround that

[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2015-06-02 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568839#comment-14568839
 ] 

Saisai Shao commented on SPARK-4352:


Hi [~steve_l], thanks a lot for your suggestions, I don't have strong 
background of Yarn, so I will try to understand your suggestion and change the 
code accordingly:).

 Incorporate locality preferences in dynamic allocation requests
 ---

 Key: SPARK-4352
 URL: https://issues.apache.org/jira/browse/SPARK-4352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Saisai Shao
Priority: Critical
 Attachments: Supportpreferrednodelocationindynamicallocation.pdf


 Currently, achieving data locality in Spark is difficult unless an 
 application takes resources on every node in the cluster.  
 preferredNodeLocalityData provides a sort of hacky workaround that has been 
 broken since 1.0.
 With dynamic executor allocation, Spark requests executors in response to 
 demand from the application.  When this occurs, it would be useful to look at 
 the pending tasks and communicate their location preferences to the cluster 
 resource manager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8032) Make version checking in mllib/init.py

2015-06-02 Thread Manoj Kumar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-8032:
---
Summary: Make version checking in mllib/__init__.py  (was: Make version 
checking in mllib/__init__.py more robust for version NumPy 1.10)

 Make version checking in mllib/__init__.py
 --

 Key: SPARK-8032
 URL: https://issues.apache.org/jira/browse/SPARK-8032
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Manoj Kumar

 The current checking does version `1.x' is less than `1.4' this will fail if 
 x has greater than 1 digit, since x  4, however `1.x`  `1.4`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7680) Add a fake Receiver that generates random strings, useful for prototyping

2015-06-02 Thread Rohith Yeravothula (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564815#comment-14564815
 ] 

Rohith Yeravothula edited comment on SPARK-7680 at 6/2/15 8:03 AM:
---

written a dummy_receiver with onstart and onstop methods doing nothing and 
receive method will return a random string for every recursive call being 
made. please mention if any thing else to be added


was (Author: rohith):
can you please give some more details about it

 Add a fake Receiver that generates random strings, useful for prototyping
 -

 Key: SPARK-7680
 URL: https://issues.apache.org/jira/browse/SPARK-7680
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6988) Fix Spark SQL documentation for 1.3.x


[ 
https://issues.apache.org/jira/browse/SPARK-6988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568798#comment-14568798
 ] 

Sean Owen commented on SPARK-6988:
--

[~Saurabh Santhosh] This isn't how you report new issues. This issue is closed. 
Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

You also need to report changes against master. This is already fixed.

 Fix Spark SQL documentation for 1.3.x
 -

 Key: SPARK-6988
 URL: https://issues.apache.org/jira/browse/SPARK-6988
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Olivier Girardot
Assignee: Olivier Girardot
Priority: Minor
 Fix For: 1.3.2, 1.4.0


 There are a few glitches regarding the DataFrame API usage in Java.
 The most important one being how to map a DataFrame result, using the javaRDD 
 method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8015) flume-sink should not depend on Guava.


 [ 
https://issues.apache.org/jira/browse/SPARK-8015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-8015.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

 flume-sink should not depend on Guava.
 --

 Key: SPARK-8015
 URL: https://issues.apache.org/jira/browse/SPARK-8015
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Minor
 Fix For: 1.4.0


 The flume-sink module, due to the shared shading code in our build, ends up 
 depending on the {{org.spark-project}} Guava classes. That means users who 
 deploy the sink in Flume will also need to provide those classes somehow, 
 generally by also adding the Spark assembly, which means adding a whole bunch 
 of other libraries to Flume, which may or may not cause other unforeseen 
 problems.
 It's better to not have that dependency in the flume-sink module instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7894) Graph Union Operator


 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7894:
-
Target Version/s:   (was: 1.5.0)

 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: graph, union
 Attachments: union_operator.png


 This operator aims to union two graphs and generate a new graph directly. The 
 union of two graphs is the union of their vertex sets and their edge 
 families.Vertexes and edges which are included in either graph will be part 
 of the new graph.
 bq.  G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H
 !union_operator.png|width=600px,align=center!
 A Simple interface would be：
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. For vertex, it's quite nature to just make a union and 
 remove those duplicate ones. But for edges, a mergeEdges function seems to be 
 more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8048) Explicit partitionning of an RDD with 0 partition will yield empty outer join

2015-06-02 Thread Olivier Toupin (JIRA)

Olivier Toupin created SPARK-8048:
-

 Summary: Explicit partitionning of an RDD with 0 partition will 
yield empty outer join
 Key: SPARK-8048
 URL: https://issues.apache.org/jira/browse/SPARK-8048
 Project: Spark
  Issue Type: Bug
Reporter: Olivier Toupin
Priority: Minor


Check this code =

https://gist.github.com/anonymous/0f935915f2bc182841f0

Because of this = {{.partitionBy(new HashPartitioner(0))}}

The join will return empty result.

Here a normal expected behaviour would the join to crash, cause error, or to 
return unjoined results, but instead will yield an empty RDD.

This a trivial exemple, but imagine: 

{{.partitionBy(new HashPartitioner(previous.partitions.length))}}. 

You join on an empty previous rdd, the lookup table is empty, Spark will you 
lose all your results, instead of returning unjoined results, and this without 
warnings or errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7893) Complex Operators between Graphs


[ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569495#comment-14569495
 ] 

Joseph K. Bradley commented on SPARK-7893:
--

[~andyyehoo] I'm removing the target version; that should really be set by 
committers, and I think a case needs to be made for each operation separately 
since they could have very different utility  complexity.

 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Umbrella
  Components: GraphX
Reporter: Andy Huang
  Labels: complex, graph, join, operators, union

 Currently there are 30+ operators in GraphX, while few of them consider 
 operators between graphs. The only one is _*mask*_, which takes another graph 
 as a parameter and return a new graph.
 In many complex case,such as _*streaming graph, small graph merge into huge 
 graph*_, higher level operators of graphs can help users to focus and think 
 in graph. Performance optimization can be done internally and be transparent 
 to them.
 Complex graph operator list is 
 here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]
 * Union of Graphs ( G ∪ H )
 * Intersection of Graphs( G ∩ H)
 * Graph Join
 * Difference of Graphs（G – H）
 * Graph Complement
 * Line Graph ( L(G) )
 This issue will be index of all these operators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7958) Failed StreamingContext.start() can leak active actors


 [ 
https://issues.apache.org/jira/browse/SPARK-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-7958:
-
Affects Version/s: (was: 1.4.0)
   1.1.1
   1.2.2
   1.3.1

 Failed StreamingContext.start() can leak active actors
 --

 Key: SPARK-7958
 URL: https://issues.apache.org/jira/browse/SPARK-7958
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.1, 1.2.2, 1.3.1
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical
 Fix For: 1.4.0


 StreamingContext.start() can throw exception because 
 DStream.validateAtStart() fails (say, checkpoint directory not set for 
 StateDStream). But by then JobScheduler, JobGenerator, and ReceiverTracker 
 has already started, along with their actors. But those cannot be shutdown 
 because the only way to do that is call StreamingContext.stop() which cannot 
 be called as the context has not been marked as ACTIVE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7541) Check model save/load for MLlib 1.4


[ 
https://issues.apache.org/jira/browse/SPARK-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569496#comment-14569496
 ] 

Joseph K. Bradley commented on SPARK-7541:
--

Oh, I see.  That sounds good to do, thanks!

 Check model save/load for MLlib 1.4
 ---

 Key: SPARK-7541
 URL: https://issues.apache.org/jira/browse/SPARK-7541
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib, PySpark
Reporter: Joseph K. Bradley
Assignee: yuhao yang

 For each model which supports save/load methods, we need to verify:
 * These methods are tested in unit tests in Scala and Python (if save/load is 
 supported in Python).
 * If a model's name, data members, or constructors have changed _at all_, 
 then we likely need to support a new save/load format version.  Different 
 versions must be tested in unit tests to ensure backwards compatibility 
 (i.e., verify we can load old model formats).
 * Examples in the programming guide should include save/load when available.  
 It's important to try running each example in the guide whenever it is 
 modified (since there are no automated tests).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8048) Explicit partitionning of an RDD with 0 partition will yield empty outer join

2015-06-02 Thread Olivier Toupin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier Toupin updated SPARK-8048:
--
Affects Version/s: 1.3.1

 Explicit partitionning of an RDD with 0 partition will yield empty outer join
 -

 Key: SPARK-8048
 URL: https://issues.apache.org/jira/browse/SPARK-8048
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.1
Reporter: Olivier Toupin
Priority: Minor

 Check this code =
 https://gist.github.com/anonymous/0f935915f2bc182841f0
 Because of this = {{.partitionBy(new HashPartitioner(0))}}
 The join will return empty result.
 Here a normal expected behaviour would the join to crash, cause error, or to 
 return unjoined results, but instead will yield an empty RDD.
 This a trivial exemple, but imagine: 
 {{.partitionBy(new HashPartitioner(previous.partitions.length))}}. 
 You join on an empty previous rdd, the lookup table is empty, Spark will 
 you lose all your results, instead of returning unjoined results, and this 
 without warnings or errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7958) Failed StreamingContext.start() can leak active actors


 [ 
https://issues.apache.org/jira/browse/SPARK-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-7958:
-
Target Version/s: 1.4.0  (was: 1.4.1)

 Failed StreamingContext.start() can leak active actors
 --

 Key: SPARK-7958
 URL: https://issues.apache.org/jira/browse/SPARK-7958
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.1, 1.2.2, 1.3.1
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical
 Fix For: 1.4.0


 StreamingContext.start() can throw exception because 
 DStream.validateAtStart() fails (say, checkpoint directory not set for 
 StateDStream). But by then JobScheduler, JobGenerator, and ReceiverTracker 
 has already started, along with their actors. But those cannot be shutdown 
 because the only way to do that is call StreamingContext.stop() which cannot 
 be called as the context has not been marked as ACTIVE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7985) Remove fittingParamMap references. Update ML Doc Estimator, Transformer, and Param examples.


 [ 
https://issues.apache.org/jira/browse/SPARK-7985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-7985.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6514
[https://github.com/apache/spark/pull/6514]

 Remove fittingParamMap references. Update ML Doc Estimator, Transformer, 
 and Param examples.
 

 Key: SPARK-7985
 URL: https://issues.apache.org/jira/browse/SPARK-7985
 Project: Spark
  Issue Type: Bug
  Components: Documentation, ML
Reporter: Mike Dusenberry
Priority: Minor
 Fix For: 1.4.0


 Update ML Doc's Estimator, Transformer, and Param Scala  Java examples to 
 use model.extractParamMap instead of model.fittingParamMap, which no longer 
 exists.  Remove all other references to fittingParamMap throughout Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2015-06-02 Thread Sandy Ryza (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569623#comment-14569623
]

Sandy Ryza commented on SPARK-4352:
---

In the case where the task number = executor number * cores, I think my
earlier argument still stands. Any executor requests beyond the ones needed to
satisfy our preferences should be submitted with locality preferences. This
means we will be less likely to bunch up requests on particular nodes where
executors are not needed. Consider the extreme case where we want to request
100 executors but only have a single task with locality preferences, for data
on 3 nodes. Going purely by the ratio approach, we would end up requesting all
100 executors on those three nodes.

For the other cases, your approach makes sense to me.

Incorporate locality preferences in dynamic allocation requests
---

Key: SPARK-4352
URL: https://issues.apache.org/jira/browse/SPARK-4352
Project: Spark
Issue Type: Improvement
Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Saisai Shao
Priority: Critical
Attachments: Supportpreferrednodelocationindynamicallocation.pdf

Currently, achieving data locality in Spark is difficult unless an
application takes resources on every node in the cluster.
preferredNodeLocalityData provides a sort of hacky workaround that has been
broken since 1.0.
With dynamic executor allocation, Spark requests executors in response to
demand from the application. When this occurs, it would be useful to look at
the pending tasks and communicate their location preferences to the cluster
resource manager.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7893) Complex Operators between Graphs


[ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569490#comment-14569490
 ] 

Joseph K. Bradley commented on SPARK-7893:
--

I guess I'm OK with keeping an umbrella JIRA since I like organization in JIRA. 
 But we should make sure to consider  justify each operation so that we 
prioritize ones which many users really need.

 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: complex, graph, join, operators, union

 Currently there are 30+ operators in GraphX, while few of them consider 
 operators between graphs. The only one is _*mask*_, which takes another graph 
 as a parameter and return a new graph.
 In many complex case,such as _*streaming graph, small graph merge into huge 
 graph*_, higher level operators of graphs can help users to focus and think 
 in graph. Performance optimization can be done internally and be transparent 
 to them.
 Complex graph operator list is 
 here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]
 * Union of Graphs ( G ∪ H )
 * Intersection of Graphs( G ∩ H)
 * Graph Join
 * Difference of Graphs（G – H）
 * Graph Complement
 * Line Graph ( L(G) )
 This issue will be index of all these operators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7893) Complex Operators between Graphs


 [ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7893:
-
Issue Type: Umbrella  (was: Improvement)

 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Umbrella
  Components: GraphX
Reporter: Andy Huang
  Labels: complex, graph, join, operators, union

 Currently there are 30+ operators in GraphX, while few of them consider 
 operators between graphs. The only one is _*mask*_, which takes another graph 
 as a parameter and return a new graph.
 In many complex case,such as _*streaming graph, small graph merge into huge 
 graph*_, higher level operators of graphs can help users to focus and think 
 in graph. Performance optimization can be done internally and be transparent 
 to them.
 Complex graph operator list is 
 here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]
 * Union of Graphs ( G ∪ H )
 * Intersection of Graphs( G ∩ H)
 * Graph Join
 * Difference of Graphs（G – H）
 * Graph Complement
 * Line Graph ( L(G) )
 This issue will be index of all these operators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7890) Document that Spark 2.11 now supports Kafka


 [ 
https://issues.apache.org/jira/browse/SPARK-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-7890:
-
Fix Version/s: (was: 1.4.1)
   (was: 1.5.0)
   1.4.0

 Document that Spark 2.11 now supports Kafka
 ---

 Key: SPARK-7890
 URL: https://issues.apache.org/jira/browse/SPARK-7890
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Sean Owen
Priority: Critical
 Fix For: 1.4.0


 The building-spark.html page needs to be updated. It's a simple fix, just 
 remove the caveat about Kafka.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8015) flume-sink should not depend on Guava.


 [ 
https://issues.apache.org/jira/browse/SPARK-8015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-8015:
-
Assignee: Marcelo Vanzin

 flume-sink should not depend on Guava.
 --

 Key: SPARK-8015
 URL: https://issues.apache.org/jira/browse/SPARK-8015
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Minor
 Fix For: 1.4.0


 The flume-sink module, due to the shared shading code in our build, ends up 
 depending on the {{org.spark-project}} Guava classes. That means users who 
 deploy the sink in Flume will also need to provide those classes somehow, 
 generally by also adding the Spark assembly, which means adding a whole bunch 
 of other libraries to Flume, which may or may not cause other unforeseen 
 problems.
 It's better to not have that dependency in the flume-sink module instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7985) Remove fittingParamMap references. Update ML Doc Estimator, Transformer, and Param examples.


 [ 
https://issues.apache.org/jira/browse/SPARK-7985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7985:
-
Assignee: Mike Dusenberry

 Remove fittingParamMap references. Update ML Doc Estimator, Transformer, 
 and Param examples.
 

 Key: SPARK-7985
 URL: https://issues.apache.org/jira/browse/SPARK-7985
 Project: Spark
  Issue Type: Bug
  Components: Documentation, ML
Reporter: Mike Dusenberry
Assignee: Mike Dusenberry
Priority: Minor
 Fix For: 1.4.0


 Update ML Doc's Estimator, Transformer, and Param Scala  Java examples to 
 use model.extractParamMap instead of model.fittingParamMap, which no longer 
 exists.  Remove all other references to fittingParamMap throughout Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7893) Complex Operators between Graphs


 [ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7893:
-
Target Version/s:   (was: 1.5.0)

 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Umbrella
  Components: GraphX
Reporter: Andy Huang
  Labels: complex, graph, join, operators, union

 Currently there are 30+ operators in GraphX, while few of them consider 
 operators between graphs. The only one is _*mask*_, which takes another graph 
 as a parameter and return a new graph.
 In many complex case,such as _*streaming graph, small graph merge into huge 
 graph*_, higher level operators of graphs can help users to focus and think 
 in graph. Performance optimization can be done internally and be transparent 
 to them.
 Complex graph operator list is 
 here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]
 * Union of Graphs ( G ∪ H )
 * Intersection of Graphs( G ∩ H)
 * Graph Join
 * Difference of Graphs（G – H）
 * Graph Complement
 * Line Graph ( L(G) )
 This issue will be index of all these operators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5784) Add StatsDSink to MetricsSystem

2015-06-02 Thread Ryan Williams (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Williams closed SPARK-5784.

Resolution: Not A Problem

 Add StatsDSink to MetricsSystem
 ---

 Key: SPARK-5784
 URL: https://issues.apache.org/jira/browse/SPARK-5784
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Ryan Williams
Priority: Minor

 [StatsD|https://github.com/etsy/statsd/] is a common wrapper for Graphite; it 
 would be useful to support sending metrics to StatsD in addition to [the 
 existing Graphite 
 support|https://github.com/apache/spark/blob/6a1be026cf37e4c8bf39133dfb4a73f7caedcc26/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala].
 [readytalk/metrics-statsd|https://github.com/readytalk/metrics-statsd] is a 
 StatsD adapter for the 
 [dropwizard/metrics|https://github.com/dropwizard/metrics] library that Spark 
 uses. The Maven repository at http://dl.bintray.com/readytalk/maven/ serves 
 {{metrics-statsd}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5784) Add StatsDSink to MetricsSystem

2015-06-02 Thread Ryan Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569655#comment-14569655
 ] 

Ryan Williams commented on SPARK-5784:
--

[~varvind] seems like no; Spark packages would be a reasonable place to index / 
host a built version of this if someone wanted to do that! I didn't end up 
doing much with this myself.

 Add StatsDSink to MetricsSystem
 ---

 Key: SPARK-5784
 URL: https://issues.apache.org/jira/browse/SPARK-5784
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Ryan Williams
Priority: Minor

 [StatsD|https://github.com/etsy/statsd/] is a common wrapper for Graphite; it 
 would be useful to support sending metrics to StatsD in addition to [the 
 existing Graphite 
 support|https://github.com/apache/spark/blob/6a1be026cf37e4c8bf39133dfb4a73f7caedcc26/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala].
 [readytalk/metrics-statsd|https://github.com/readytalk/metrics-statsd] is a 
 StatsD adapter for the 
 [dropwizard/metrics|https://github.com/dropwizard/metrics] library that Spark 
 uses. The Maven repository at http://dl.bintray.com/readytalk/maven/ serves 
 {{metrics-statsd}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8050) Make Savable and Loader Java-friendly.


 [ 
https://issues.apache.org/jira/browse/SPARK-8050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8050:
---

Assignee: Xiangrui Meng  (was: Apache Spark)

 Make Savable and Loader Java-friendly.
 --

 Key: SPARK-8050
 URL: https://issues.apache.org/jira/browse/SPARK-8050
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0, 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor

 Should overload save/load to accept JavaSparkContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7991) Python DataFrame: support passing a list into describe

2015-06-02 Thread Amey Chaugule (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569650#comment-14569650
 ] 

Amey Chaugule commented on SPARK-7991:
--

[~rxin] : I'd like to work on this in case nobody else is.

 Python DataFrame: support passing a list into describe
 --

 Key: SPARK-7991
 URL: https://issues.apache.org/jira/browse/SPARK-7991
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: starter

 DataFrame.describe in Python takes a vararg, i.e. it can be invoked this way:
 {code}
 df.describe('col1', 'col2', 'col3')
 {code}
 Most of our DataFrame functions accept a list in addition to varargs. 
 describe should do the same, i.e. it should also accept a Python list:
 {code}
 df.describe(['col1', 'col2', 'col3'])
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8049) OneVsRest's output includes a temp column


 [ 
https://issues.apache.org/jira/browse/SPARK-8049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8049:
---

Assignee: Apache Spark  (was: Xiangrui Meng)

 OneVsRest's output includes a temp column
 -

 Key: SPARK-8049
 URL: https://issues.apache.org/jira/browse/SPARK-8049
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Apache Spark

 The temp accumulator column mbc$acc is included in the output which should 
 be removed with withoutColumn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8049) OneVsRest's output includes a temp column


[ 
https://issues.apache.org/jira/browse/SPARK-8049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569708#comment-14569708
 ] 

Apache Spark commented on SPARK-8049:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/6592

 OneVsRest's output includes a temp column
 -

 Key: SPARK-8049
 URL: https://issues.apache.org/jira/browse/SPARK-8049
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 The temp accumulator column mbc$acc is included in the output which should 
 be removed with withoutColumn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8049) OneVsRest's output includes a temp column


 [ 
https://issues.apache.org/jira/browse/SPARK-8049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8049:
---

Assignee: Xiangrui Meng  (was: Apache Spark)

 OneVsRest's output includes a temp column
 -

 Key: SPARK-8049
 URL: https://issues.apache.org/jira/browse/SPARK-8049
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 The temp accumulator column mbc$acc is included in the output which should 
 be removed with withoutColumn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8050) Make Savable and Loader Java-friendly.


 [ 
https://issues.apache.org/jira/browse/SPARK-8050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8050:
---

Assignee: Apache Spark  (was: Xiangrui Meng)

 Make Savable and Loader Java-friendly.
 --

 Key: SPARK-8050
 URL: https://issues.apache.org/jira/browse/SPARK-8050
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0, 1.4.0
Reporter: Xiangrui Meng
Assignee: Apache Spark
Priority: Minor

 Should overload save/load to accept JavaSparkContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2015-06-02 Thread Greg Senia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569671#comment-14569671
 ] 

Greg Senia commented on SPARK-5159:
---

SparkSQLThriftServer does not adhere to hive.server2.enable.doAS even though it 
seems to implement HiveServer2's thrift service. Are there plans to implement 
this feature as without this feature SparkSQL ThriftServer seems to be a bit 
useless in a secure kerberos environment where the spark/hive user does not 
have access to the data directly due to audit reasons..

 Thrift server does not respect hive.server2.enable.doAs=true
 

 Key: SPARK-5159
 URL: https://issues.apache.org/jira/browse/SPARK-5159
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Andrew Ray

 I'm currently testing the spark sql thrift server on a kerberos secured 
 cluster in YARN mode. Currently any user can access any table regardless of 
 HDFS permissions as all data is read as the hive user. In HiveServer2 the 
 property hive.server2.enable.doAs=true causes all access to be done as the 
 submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8051) StringIndexerModel (and other models) shouldn't complain if the input column is missing.

Xiangrui Meng created SPARK-8051:


 Summary: StringIndexerModel (and other models) shouldn't complain 
if the input column is missing.
 Key: SPARK-8051
 URL: https://issues.apache.org/jira/browse/SPARK-8051
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


If a transformer is not used during transformation, it should keep silent if 
the input column is missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8014) DataFrame.write.mode(error).save(...) should not scan the output folder

2015-06-02 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-8014.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6583
[https://github.com/apache/spark/pull/6583]

 DataFrame.write.mode(error).save(...) should not scan the output folder
 -

 Key: SPARK-8014
 URL: https://issues.apache.org/jira/browse/SPARK-8014
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Jianshi Huang
Assignee: Cheng Lian
 Fix For: 1.4.0


 When saving a DataFrame with {{ErrorIfExists}} as save mode, we shouldn't do 
 metadata discovery if the destination folder exists. This also applies to 
 {{SaveMode.Overwrite}} and {{SaveMode.Ignore}}.
 To reproduce this issue, we may make an empty directory {{/tmp/foo}} and 
 leave an empty file {{bar}} there, then execute the following code in Spark 
 shell:
 {code}
 import sqlContext._
 import sqlContext.implicits._
 Seq(1 - a).toDF(i, 
 s).write.format(parquet).mode(error).save(file:///tmp/foo)
 {code}
 From the exception stack trace we can see that metadata discovery code path 
 is executed:
 {noformat}
 java.io.IOException: Could not read footer: java.lang.RuntimeException: 
 file:/tmp/foo/bar is not a Parquet file (too small)
 at 
 parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:238)
 at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369)
 at 
 org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154)
 at 
 org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152)
 at 
 org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:193)
 at 
 org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:502)
 at 
 org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:501)
 at 
 org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:331)
 at 
 org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
 at 
 org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135)
 ...
 Caused by: java.lang.RuntimeException: file:/tmp/foo/bar is not a Parquet 
 file (too small)
 at 
 parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:408)
 at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:228)
 at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:224)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8050) Make Savable and Loader Java-friendly.

Xiangrui Meng created SPARK-8050:


 Summary: Make Savable and Loader Java-friendly.
 Key: SPARK-8050
 URL: https://issues.apache.org/jira/browse/SPARK-8050
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0, 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor


Should overload save/load to accept JavaSparkContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8049) OneVsRest's output includes a temp column

Xiangrui Meng created SPARK-8049:


 Summary: OneVsRest's output includes a temp column
 Key: SPARK-8049
 URL: https://issues.apache.org/jira/browse/SPARK-8049
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


The temp accumulator column mbc$acc is included in the output which should be 
removed with withoutColumn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6164) CrossValidatorModel should keep stats from fitting

2015-06-02 Thread Leah McGuire (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569714#comment-14569714
 ] 

Leah McGuire commented on SPARK-6164:
-

I fixed the merge conflict. Should be good to go now.

 CrossValidatorModel should keep stats from fitting
 --

 Key: SPARK-6164
 URL: https://issues.apache.org/jira/browse/SPARK-6164
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley
Priority: Minor

 CrossValidator computes stats for each (model, fold) pair, but they are 
 thrown out by the created model.  CrossValidatorModel should keep this info 
 and expose it to users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8057) Call TaskAttemptContext.getTaskAttemptID using Reflection

2015-06-02 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-8057:
---

 Summary: Call TaskAttemptContext.getTaskAttemptID using Reflection
 Key: SPARK-8057
 URL: https://issues.apache.org/jira/browse/SPARK-8057
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Shixiong Zhu


Someone may use the Spark core jar in the maven repo with hadoop 1. SPARK-2075 
has already resolved the compatibility issue to support it. But 
SparkHadoopMapRedUtil.commitTask broke it recently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8057) Call TaskAttemptContext.getTaskAttemptID using Reflection

2015-06-02 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-8057:

Affects Version/s: 1.3.1

 Call TaskAttemptContext.getTaskAttemptID using Reflection
 -

 Key: SPARK-8057
 URL: https://issues.apache.org/jira/browse/SPARK-8057
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Shixiong Zhu

 Someone may use the Spark core jar in the maven repo with hadoop 1. 
 SPARK-2075 has already resolved the compatibility issue to support it. But 
 SparkHadoopMapRedUtil.commitTask broke it recently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8057) Call TaskAttemptContext.getTaskAttemptID using Reflection


 [ 
https://issues.apache.org/jira/browse/SPARK-8057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8057:
---

Assignee: (was: Apache Spark)

 Call TaskAttemptContext.getTaskAttemptID using Reflection
 -

 Key: SPARK-8057
 URL: https://issues.apache.org/jira/browse/SPARK-8057
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Shixiong Zhu

 Someone may use the Spark core jar in the maven repo with hadoop 1. 
 SPARK-2075 has already resolved the compatibility issue to support it. But 
 SparkHadoopMapRedUtil.commitTask broke it recently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8057) Call TaskAttemptContext.getTaskAttemptID using Reflection


[ 
https://issues.apache.org/jira/browse/SPARK-8057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570043#comment-14570043
 ] 

Apache Spark commented on SPARK-8057:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/6599

 Call TaskAttemptContext.getTaskAttemptID using Reflection
 -

 Key: SPARK-8057
 URL: https://issues.apache.org/jira/browse/SPARK-8057
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Shixiong Zhu

 Someone may use the Spark core jar in the maven repo with hadoop 1. 
 SPARK-2075 has already resolved the compatibility issue to support it. But 
 SparkHadoopMapRedUtil.commitTask broke it recently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8057) Call TaskAttemptContext.getTaskAttemptID using Reflection


 [ 
https://issues.apache.org/jira/browse/SPARK-8057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8057:
---

Assignee: Apache Spark

 Call TaskAttemptContext.getTaskAttemptID using Reflection
 -

 Key: SPARK-8057
 URL: https://issues.apache.org/jira/browse/SPARK-8057
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Shixiong Zhu
Assignee: Apache Spark

 Someone may use the Spark core jar in the maven repo with hadoop 1. 
 SPARK-2075 has already resolved the compatibility issue to support it. But 
 SparkHadoopMapRedUtil.commitTask broke it recently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8058) Add tests for SPARK-7853 and SPARK-8020

2015-06-02 Thread Yin Huai (JIRA)

Yin Huai created SPARK-8058:
---

 Summary: Add tests for SPARK-7853 and SPARK-8020
 Key: SPARK-8058
 URL: https://issues.apache.org/jira/browse/SPARK-8058
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Yin Huai


This jira is used to track the work of adding tests for SPARK-7853 (make sure 
{{spark-shell}} with and without {{--jars}} works with the isolated class 
loader) and SPARK-8020 (we are using correct metastore versions and jars 
setting to initialize {{metadataHive}}). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8059) Reduce latency between executor requests and RM heartbeat

2015-06-02 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-8059:
-

 Summary: Reduce latency between executor requests and RM heartbeat
 Key: SPARK-8059
 URL: https://issues.apache.org/jira/browse/SPARK-8059
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.4.0
Reporter: Marcelo Vanzin
Priority: Minor


This is a follow up to SPARK-7533. On top of the changes done as part of that 
issue, we could reduce allocation latency by waking up the allocation thread 
when the driver send new requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8040) Remove Debian specific loopback address setting code

2015-06-02 Thread Yuta Kurosaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570093#comment-14570093
 ] 

Yuta Kurosaki commented on SPARK-8040:
--

I am sorry that I couldn't explain well.
With that situations, I wrote my-pc.local 127.0.0.1 into /tec/hosts but this 
was ignored.
Do you think this is correct behavior ?

Now I found Issue [SPARK-4389]. It seems like related to this.
Can I re-create Issue when this resolved ?

 Remove Debian specific loopback address setting code
 

 Key: SPARK-8040
 URL: https://issues.apache.org/jira/browse/SPARK-8040
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Yuta Kurosaki
Priority: Minor

 This Issue related to core/src/main/scala/org/apache/spark/util/Utils.scala.
 Method findLocalInetAddress should not return non-loopback address when 
 SPARK_LOCAL_IP not set.
 This implementation may cause Error.
 Mainly develop environment, Interface IP address  changed occasionally when 
 spark running.
 But this implementation does not follow this change.
 So, I suggest simple behaviors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8055) Spark Launcher Improvements

2015-06-02 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-8055:
-

 Summary: Spark Launcher Improvements
 Key: SPARK-8055
 URL: https://issues.apache.org/jira/browse/SPARK-8055
 Project: Spark
  Issue Type: Umbrella
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Marcelo Vanzin


Filing a bug to track different enhancements to the Spark launcher library. 
Please file sub-tasks for each particular enhancement instead of tagging PRs 
with this bug's number.

After some discussion in the mailing list, people have requested different 
enhancements to the library. I'll try to capture those here but feel free to 
add more in the comments.

- Missing information about the launched Spark application.

Currently the library returns an opaque Process object that doesn't have a 
lot of Spark-related functionality. It would be useful to get at least some 
information about the underlying process; in the very least the application ID 
of the actual Spark job. Other useful information could be, for example, the 
current status of the submitted job.

- Ability to control the underlying application.

The Process object only allows you to kill the underlying application. It 
would be better to have better application-level APIs to try to stop the 
application more cleanly (e.g. by asking the cluster manager to kill it, or by 
stopping the SparkContext in client mode).

- Ability to run Spark applications in the same JVM.

This could potentially be done today for cluster mode apps without getting bit 
by the limitations of SparkContext. In the long run, it would be nice to also 
support client mode apps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8055) Spark Launcher Improvements

2015-06-02 Thread Marcelo Vanzin (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-8055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1457#comment-1457
]

Marcelo Vanzin commented on SPARK-8055:
---

/cc [~klmarkey] [~chester.c...@webwarecorp.com]

Spark Launcher Improvements
---

Key: SPARK-8055
URL: https://issues.apache.org/jira/browse/SPARK-8055
Project: Spark
Issue Type: Umbrella
Components: Spark Core
Affects Versions: 1.4.0
Reporter: Marcelo Vanzin

Filing a bug to track different enhancements to the Spark launcher library.
Please file sub-tasks for each particular enhancement instead of tagging PRs
with this bug's number.
After some discussion in the mailing list, people have requested different
enhancements to the library. I'll try to capture those here but feel free to
add more in the comments.
- Missing information about the launched Spark application.
Currently the library returns an opaque Process object that doesn't have a
lot of Spark-related functionality. It would be useful to get at least some
information about the underlying process; in the very least the application
ID of the actual Spark job. Other useful information could be, for example,
the current status of the submitted job.
- Ability to control the underlying application.
The Process object only allows you to kill the underlying application. It
would be better to have better application-level APIs to try to stop the
application more cleanly (e.g. by asking the cluster manager to kill it, or
by stopping the SparkContext in client mode).
- Ability to run Spark applications in the same JVM.
This could potentially be done today for cluster mode apps without getting
bit by the limitations of SparkContext. In the long run, it would be nice to
also support client mode apps.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8026) Add Column.alias to Scala/Java API


 [ 
https://issues.apache.org/jira/browse/SPARK-8026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8026.

   Resolution: Fixed
Fix Version/s: 1.4.0

 Add Column.alias to Scala/Java API
 --

 Key: SPARK-8026
 URL: https://issues.apache.org/jira/browse/SPARK-8026
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.4.0


 To be consistent with the Python API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8059) Reduce latency between executor requests and RM heartbeat


[ 
https://issues.apache.org/jira/browse/SPARK-8059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570073#comment-14570073
 ] 

Apache Spark commented on SPARK-8059:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6600

 Reduce latency between executor requests and RM heartbeat
 -

 Key: SPARK-8059
 URL: https://issues.apache.org/jira/browse/SPARK-8059
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.4.0
Reporter: Marcelo Vanzin
Priority: Minor

 This is a follow up to SPARK-7533. On top of the changes done as part of that 
 issue, we could reduce allocation latency by waking up the allocation thread 
 when the driver send new requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8056) Design an easier way to construct schema for both Scala and Python

Reynold Xin created SPARK-8056:
--

 Summary: Design an easier way to construct schema for both Scala 
and Python
 Key: SPARK-8056
 URL: https://issues.apache.org/jira/browse/SPARK-8056
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


StructType is fairly hard to construct, especially in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8049) OneVsRest's output includes a temp column


 [ 
https://issues.apache.org/jira/browse/SPARK-8049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8049:
-
Fix Version/s: 1.5.0
   1.4.1

 OneVsRest's output includes a temp column
 -

 Key: SPARK-8049
 URL: https://issues.apache.org/jira/browse/SPARK-8049
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.4.1, 1.5.0


 The temp accumulator column mbc$acc is included in the output which should 
 be removed with withoutColumn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7558) Log test name when starting and finishing each test


[ 
https://issues.apache.org/jira/browse/SPARK-7558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570033#comment-14570033
 ] 

Apache Spark commented on SPARK-7558:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/6598

 Log test name when starting and finishing each test
 ---

 Key: SPARK-7558
 URL: https://issues.apache.org/jira/browse/SPARK-7558
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Reporter: Patrick Wendell
Assignee: Andrew Or
 Fix For: 1.5.0


 Right now it's really tough to interpret testing output because logs for 
 different tests are interspersed in the same unit-tests.log file. This makes 
 it particularly hard to diagnose flaky tests. This would be much easier if we 
 logged the test name before and after every test (e.g. Starting test XX, 
 Finished test XX). Then you could get right to the logs.
 I think one way to do this might be to create a custom test fixture that logs 
 the test class name and then mix that into every test suite /cc [~joshrosen] 
 for his superb knowledge of Scalatest.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8059) Reduce latency between executor requests and RM heartbeat


 [ 
https://issues.apache.org/jira/browse/SPARK-8059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8059:
---

Assignee: (was: Apache Spark)

 Reduce latency between executor requests and RM heartbeat
 -

 Key: SPARK-8059
 URL: https://issues.apache.org/jira/browse/SPARK-8059
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.4.0
Reporter: Marcelo Vanzin
Priority: Minor

 This is a follow up to SPARK-7533. On top of the changes done as part of that 
 issue, we could reduce allocation latency by waking up the allocation thread 
 when the driver send new requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7879) KMeans API for spark.ml Pipelines

2015-06-02 Thread Yu Ishikawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570089#comment-14570089
 ] 

Yu Ishikawa commented on SPARK-7879:


I will implement it.

 KMeans API for spark.ml Pipelines
 -

 Key: SPARK-7879
 URL: https://issues.apache.org/jira/browse/SPARK-7879
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Joseph K. Bradley

 Create a K-Means API for the spark.ml Pipelines API.  This should wrap the 
 existing KMeans implementation in spark.mllib.
 This should be the first clustering method added to Pipelines, and it will be 
 important to consider [SPARK-7610] and think about designing the clustering 
 API.  We do not have to have abstractions from the beginning (and probably 
 should not) but should think far enough ahead so we can add abstractions 
 later on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7991) Python DataFrame: support passing a list into describe


[ 
https://issues.apache.org/jira/browse/SPARK-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569788#comment-14569788
 ] 

Reynold Xin commented on SPARK-7991:


Please go ahead. This one should be simple.


 Python DataFrame: support passing a list into describe
 --

 Key: SPARK-7991
 URL: https://issues.apache.org/jira/browse/SPARK-7991
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: starter

 DataFrame.describe in Python takes a vararg, i.e. it can be invoked this way:
 {code}
 df.describe('col1', 'col2', 'col3')
 {code}
 Most of our DataFrame functions accept a list in addition to varargs. 
 describe should do the same, i.e. it should also accept a Python list:
 {code}
 df.describe(['col1', 'col2', 'col3'])
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8053) ElementwiseProduct scalingVec param name should match between ml,mllib

Joseph K. Bradley created SPARK-8053:


 Summary: ElementwiseProduct scalingVec param name should match 
between ml,mllib
 Key: SPARK-8053
 URL: https://issues.apache.org/jira/browse/SPARK-8053
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor


spark.mllib's ElementwiseProduct uses scalingVector

spark.ml's ElementwiseProduct uses scalingVec

We should make them match.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8054) Java compatibility fixes for MLlib 1.4


[ 
https://issues.apache.org/jira/browse/SPARK-8054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569900#comment-14569900
 ] 

Apache Spark commented on SPARK-8054:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/6562

 Java compatibility fixes for MLlib 1.4
 --

 Key: SPARK-8054
 URL: https://issues.apache.org/jira/browse/SPARK-8054
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 See [SPARK-7529]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7529) Java compatibility check for MLlib 1.4


 [ 
https://issues.apache.org/jira/browse/SPARK-7529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-7529.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

I'm marking this Fixed.  Since the PR with the fixes will go into 1.4.1 and 
1.5, this JIRA is complete.

 Java compatibility check for MLlib 1.4
 --

 Key: SPARK-7529
 URL: https://issues.apache.org/jira/browse/SPARK-7529
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Joseph K. Bradley
 Fix For: 1.4.0


 Check Java compatibility for MLlib 1.4. We should create separate JIRAs for 
 each possible issue.
 Checking compatibility means:
 * comparing with the Scala doc
 * verifying that Java docs are not messed up by Scala type incompatibilities. 
  Some items to look out for are:
 ** Check for generic Object types where Java cannot understand complex 
 Scala types.
 ** Check Scala objects (especially with nesting!) carefully.
 ** Check for uses of Scala and Java enumerations, which can show up oddly in 
 the other language's doc.
 * If needed for complex issues, create small Java unit tests which execute 
 each method.  (The correctness can be checked in Scala.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2551) Cleanup FilteringParquetRowInputFormat

2015-06-02 Thread Thomas Omans (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569912#comment-14569912
 ] 

Thomas Omans commented on SPARK-2551:
-

Wanted to chime in since I upgraded parquet re: SPARK-7743

After looking at the PARQUET-16 issue it looks like the pull request 
https://github.com/apache/parquet-mr/pull/17 made by [~liancheng] (the reporter 
of PARQUET-16) closed the PR as resolved by 
https://github.com/apache/parquet-mr/pull/45 (which is included in the 1.7.0 
upgrade).

That means that these reflection hacks should be ready for removal, or that the 
PARQUET-16 ticket should be closed at the very least ;)

 Cleanup FilteringParquetRowInputFormat
 --

 Key: SPARK-2551
 URL: https://issues.apache.org/jira/browse/SPARK-2551
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.1, 1.0.2
Reporter: Cheng Lian
Priority: Minor

 To workaround [PARQUET-16|https://issues.apache.org/jira/browse/PARQUET-16] 
 and fix [SPARK-2119|https://issues.apache.org/jira/browse/SPARK-2119], we did 
 some reflection hacks in {{FilteringParquetRowInputFormat}}. This should be 
 cleaned up once PARQUET-16 is fixed.
 A PR for PARQUET-16 is 
 [here|https://github.com/apache/incubator-parquet-mr/pull/17].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8051) StringIndexerModel (and other models) shouldn't complain if the input column is missing.


 [ 
https://issues.apache.org/jira/browse/SPARK-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8051:
---

Assignee: Apache Spark  (was: Xiangrui Meng)

 StringIndexerModel (and other models) shouldn't complain if the input column 
 is missing.
 

 Key: SPARK-8051
 URL: https://issues.apache.org/jira/browse/SPARK-8051
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Apache Spark

 If a transformer is not used during transformation, it should keep silent if 
 the input column is missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8051) StringIndexerModel (and other models) shouldn't complain if the input column is missing.