date:20151216

[jira] [Created] (SPARK-12397) Improve error messages for data sources when they are not found

2015-12-16 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-12397:
---

 Summary: Improve error messages for data sources when they are not 
found
 Key: SPARK-12397
 URL: https://issues.apache.org/jira/browse/SPARK-12397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


We can point them to spark-packages.org to find them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12397) Improve error messages for data sources when they are not found

2015-12-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15061659#comment-15061659
 ] 

Apache Spark commented on SPARK-12397:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/10351

> Improve error messages for data sources when they are not found
> ---
>
> Key: SPARK-12397
> URL: https://issues.apache.org/jira/browse/SPARK-12397
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> We can point them to spark-packages.org to find them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12398) Smart truncation of DataFrame / Dataset toString

2015-12-16 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-12398:
---

 Summary: Smart truncation of DataFrame / Dataset toString
 Key: SPARK-12398
 URL: https://issues.apache.org/jira/browse/SPARK-12398
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin


When a DataFrame or Dataset has a long schema, we should intelligently truncate 
to avoid flooding the screen with unreadable information.

{code}
// Standard output
[a: int, b: int]

// Truncate many top level fields
[a: int, b, string ... 10 more fields]

// Truncate long inner structs
[a: struct]
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12393) Add read.text and write.text for SparkR

2015-12-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12393:


Assignee: Apache Spark

> Add read.text and write.text for SparkR
> ---
>
> Key: SPARK-12393
> URL: https://issues.apache.org/jira/browse/SPARK-12393
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> Add read.text and write.text for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12393) Add read.text and write.text for SparkR

2015-12-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12393:


Assignee: (was: Apache Spark)

> Add read.text and write.text for SparkR
> ---
>
> Key: SPARK-12393
> URL: https://issues.apache.org/jira/browse/SPARK-12393
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Add read.text and write.text for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12399) Display correct error message when accessing REST API with an unknown app Id

2015-12-16 Thread Carson Wang (JIRA)

Carson Wang created SPARK-12399:
---

 Summary: Display correct error message when accessing REST API 
with an unknown app Id
 Key: SPARK-12399
 URL: https://issues.apache.org/jira/browse/SPARK-12399
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.5.2
Reporter: Carson Wang
Priority: Minor


I got an exception when accessing the below REST API with an unknown 
application Id.
/api/v1/applications/xxx/jobs

Instead of an exception, I expect an error message "no such app: xxx" which is 
a similar error message when I access "/api/v1/applications/xxx"

{code}
org.spark-project.guava.util.concurrent.UncheckedExecutionException: 
java.util.NoSuchElementException: no app with key xxx
at 
org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2263)
at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000)
at 
org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at 
org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at 
org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:116)
at 
org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:226)
at 
org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:46)
at 
org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12400) Avoid writing a shuffle file if a partition has no output (empty)

2015-12-16 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-12400:
---

 Summary: Avoid writing a shuffle file if a partition has no output 
(empty)
 Key: SPARK-12400
 URL: https://issues.apache.org/jira/browse/SPARK-12400
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Reporter: Reynold Xin


A Spark user was asking for automatic setting of # reducers. When I pushed for 
more, it turned out the problem for them is that 200 creates too many files, 
when most partitions are empty.

It seems like a simple thing we can do is to avoid creating shuffle files if a 
partition is empty.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12399) Display correct error message when accessing REST API with an unknown app Id

2015-12-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15061679#comment-15061679
 ] 

Apache Spark commented on SPARK-12399:
--

User 'carsonwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10352

> Display correct error message when accessing REST API with an unknown app Id
> 
>
> Key: SPARK-12399
> URL: https://issues.apache.org/jira/browse/SPARK-12399
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.2
>Reporter: Carson Wang
>Priority: Minor
>
> I got an exception when accessing the below REST API with an unknown 
> application Id.
> /api/v1/applications/xxx/jobs
> Instead of an exception, I expect an error message "no such app: xxx" which 
> is a similar error message when I access "/api/v1/applications/xxx"
> {code}
> org.spark-project.guava.util.concurrent.UncheckedExecutionException: 
> java.util.NoSuchElementException: no app with key xxx
>   at 
> org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2263)
>   at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000)
>   at 
> org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:116)
>   at 
> org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:226)
>   at 
> org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:46)
>   at 
> org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12361) Should set PYSPARK_DRIVER_PYTHON before python test

2015-12-16 Thread Jeff Zhang (JIRA)

Jeff Zhang created SPARK-12361:
--

 Summary: Should set PYSPARK_DRIVER_PYTHON before python test
 Key: SPARK-12361
 URL: https://issues.apache.org/jira/browse/SPARK-12361
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Tests
Affects Versions: 1.6.0
Reporter: Jeff Zhang
Priority: Minor


If PYSPARK_DRIVER_PYTHON is not set, python version mismatch exception may 
happen.
{code}
 File "/Users/jzhang/github/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 64, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.6 than that in driver 2.7, 
PySpark cannot run with different minor versions

at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12353) wrong output for countByValue and countByValueAndWindow

2015-12-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059675#comment-15059675
 ] 

Sean Owen commented on SPARK-12353:
---

Yeah, since the implementation is

{code}
return self.map(lambda x: (x, None)).reduceByKey(lambda x, y: 
None).count()
{code}

it does seem like this is actually an implementation of countDistinct or 
something similar.

> wrong output for countByValue and countByValueAndWindow
> ---
>
> Key: SPARK-12353
> URL: https://issues.apache.org/jira/browse/SPARK-12353
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Input/Output, PySpark, Streaming
>Affects Versions: 1.5.2
> Environment: Ubuntu 14.04, Python 2.7.6
>Reporter: Bo Jin
>  Labels: easyfix
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> http://stackoverflow.com/q/34114585/4698425
> In PySpark Streaming, function countByValue and countByValueAndWindow return 
> one single number which is the count of distinct elements, instead of a list 
> of (k,v) pairs.
> It's inconsistent with the documentation: 
> countByValue: When called on a DStream of elements of type K, return a new 
> DStream of (K, Long) pairs where the value of each key is its frequency in 
> each RDD of the source DStream.
> countByValueAndWindow: When called on a DStream of (K, V) pairs, returns a 
> new DStream of (K, Long) pairs where the value of each key is its frequency 
> within a sliding window. Like in reduceByKeyAndWindow, the number of reduce 
> tasks is configurable through an optional argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12361) Should set PYSPARK_DRIVER_PYTHON before python test

2015-12-16 Thread Jeff Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-12361:
---
Description: 
If PYSPARK_DRIVER_PYTHON is not set, python version mismatch exception may 
happen. And the weird thing is that this exception won't cause the unit test 
fail. The return_code is still 0 which hide the unit test failure. And if I 
invoke the test command directly, I can see the return code is not 0. This is 
very weird. 

* invoke unit test command directly
{code}
export SPARK_TESTING = 1
export PYSPARK_DRIVER_PYTHON=python2.6
bin/pyspark pyspark.ml.clustering  
{code}
* return code from python unit test
{code}
retcode = subprocess.Popen(
[os.path.join(SPARK_HOME, "bin/pyspark"), test_name],
stderr=per_test_output, stdout=per_test_output, env=env).wait()
{code}
* exception of python version mismatch
{code}
 File "/Users/jzhang/github/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 64, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.6 than that in driver 2.7, 
PySpark cannot run with different minor versions

at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

{code}

  was:
If PYSPARK_DRIVER_PYTHON is not set, python version mismatch exception may 
happen.
{code}
 File "/Users/jzhang/github/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 64, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.6 than that in driver 2.7, 
PySpark cannot run with different minor versions

at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheck

[jira] [Updated] (SPARK-12361) Should set PYSPARK_DRIVER_PYTHON before python test

2015-12-16 Thread Jeff Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-12361:
---
Description: 
If PYSPARK_DRIVER_PYTHON is not set, python version mismatch exception may 
happen (when I set PYSPARK_DRIVER_PYTHON in .profile). And the weird thing is 
that this exception won't cause the unit test fail. The return_code is still 0 
which hide the unit test failure. And if I invoke the test command directly, I 
can see the return code is not 0. This is very weird. 

* invoke unit test command directly
{code}
export SPARK_TESTING = 1
export PYSPARK_DRIVER_PYTHON=python2.6
bin/pyspark pyspark.ml.clustering  
{code}
* return code from python unit test
{code}
retcode = subprocess.Popen(
[os.path.join(SPARK_HOME, "bin/pyspark"), test_name],
stderr=per_test_output, stdout=per_test_output, env=env).wait()
{code}
* exception of python version mismatch
{code}
 File "/Users/jzhang/github/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 64, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.6 than that in driver 2.7, 
PySpark cannot run with different minor versions

at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

{code}

  was:
If PYSPARK_DRIVER_PYTHON is not set, python version mismatch exception may 
happen. And the weird thing is that this exception won't cause the unit test 
fail. The return_code is still 0 which hide the unit test failure. And if I 
invoke the test command directly, I can see the return code is not 0. This is 
very weird. 

* invoke unit test command directly
{code}
export SPARK_TESTING = 1
export PYSPARK_DRIVER_PYTHON=python2.6
bin/pyspark pyspark.ml.clustering  
{code}
* return code from python unit test
{code}
retcode = subprocess.Popen(
[os.path.join(SPARK_HOME, "bin/pyspark"), test_name],
stderr=per_test_output, stdout=per_test_output, env=env).wait()
{code}
* exception of python version mismatch
{code}
 File "/Users/jzhang/github/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 64, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.6 than that in driver 2.7, 
PySpark cannot run with different minor versions

at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scal

[jira] [Updated] (SPARK-12361) Should set PYSPARK_DRIVER_PYTHON before python test

2015-12-16 Thread Jeff Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-12361:
---
Description: 
If PYSPARK_DRIVER_PYTHON is not set, python version mismatch exception may 
happen (when I set PYSPARK_DRIVER_PYTHON in .profile). And the weird thing is 
that this exception won't cause the unit test fail. The return_code is still 0 
which hide the unit test failure. And if I invoke the test command directly, I 
can see the return code is not 0. This is very weird. 

* invoke unit test command directly
{code}
export SPARK_TESTING = 1
export PYSPARK_PYTHON=python2.6
bin/pyspark pyspark.ml.clustering  
{code}
* return code from python unit test
{code}
retcode = subprocess.Popen(
[os.path.join(SPARK_HOME, "bin/pyspark"), test_name],
stderr=per_test_output, stdout=per_test_output, env=env).wait()
{code}
* exception of python version mismatch
{code}
 File "/Users/jzhang/github/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 64, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.6 than that in driver 2.7, 
PySpark cannot run with different minor versions

at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

{code}

  was:
If PYSPARK_DRIVER_PYTHON is not set, python version mismatch exception may 
happen (when I set PYSPARK_DRIVER_PYTHON in .profile). And the weird thing is 
that this exception won't cause the unit test fail. The return_code is still 0 
which hide the unit test failure. And if I invoke the test command directly, I 
can see the return code is not 0. This is very weird. 

* invoke unit test command directly
{code}
export SPARK_TESTING = 1
export PYSPARK_DRIVER_PYTHON=python2.6
bin/pyspark pyspark.ml.clustering  
{code}
* return code from python unit test
{code}
retcode = subprocess.Popen(
[os.path.join(SPARK_HOME, "bin/pyspark"), test_name],
stderr=per_test_output, stdout=per_test_output, env=env).wait()
{code}
* exception of python version mismatch
{code}
 File "/Users/jzhang/github/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 64, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.6 than that in driver 2.7, 
PySpark cannot run with different minor versions

at 
org.apache.spark.api.python.

[jira] [Commented] (SPARK-12361) Should set PYSPARK_DRIVER_PYTHON before python test

2015-12-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059686#comment-15059686
 ] 

Apache Spark commented on SPARK-12361:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/10322

> Should set PYSPARK_DRIVER_PYTHON before python test
> ---
>
> Key: SPARK-12361
> URL: https://issues.apache.org/jira/browse/SPARK-12361
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>Priority: Minor
>
> If PYSPARK_DRIVER_PYTHON is not set, python version mismatch exception may 
> happen (when I set PYSPARK_DRIVER_PYTHON in .profile). And the weird thing is 
> that this exception won't cause the unit test fail. The return_code is still 
> 0 which hide the unit test failure. And if I invoke the test command 
> directly, I can see the return code is not 0. This is very weird. 
> * invoke unit test command directly
> {code}
> export SPARK_TESTING = 1
> export PYSPARK_PYTHON=python2.6
> bin/pyspark pyspark.ml.clustering  
> {code}
> * return code from python unit test
> {code}
> retcode = subprocess.Popen(
> [os.path.join(SPARK_HOME, "bin/pyspark"), test_name],
> stderr=per_test_output, stdout=per_test_output, env=env).wait()
> {code}
> * exception of python version mismatch
> {code}
>  File "/Users/jzhang/github/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 64, in main
> ("%d.%d" % sys.version_info[:2], version))
> Exception: Python in worker has different version 2.6 than that in driver 
> 2.7, PySpark cannot run with different minor versions
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
> at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12361) Should set PYSPARK_DRIVER_PYTHON before python test

2015-12-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12361:


Assignee: Apache Spark

> Should set PYSPARK_DRIVER_PYTHON before python test
> ---
>
> Key: SPARK-12361
> URL: https://issues.apache.org/jira/browse/SPARK-12361
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> If PYSPARK_DRIVER_PYTHON is not set, python version mismatch exception may 
> happen (when I set PYSPARK_DRIVER_PYTHON in .profile). And the weird thing is 
> that this exception won't cause the unit test fail. The return_code is still 
> 0 which hide the unit test failure. And if I invoke the test command 
> directly, I can see the return code is not 0. This is very weird. 
> * invoke unit test command directly
> {code}
> export SPARK_TESTING = 1
> export PYSPARK_PYTHON=python2.6
> bin/pyspark pyspark.ml.clustering  
> {code}
> * return code from python unit test
> {code}
> retcode = subprocess.Popen(
> [os.path.join(SPARK_HOME, "bin/pyspark"), test_name],
> stderr=per_test_output, stdout=per_test_output, env=env).wait()
> {code}
> * exception of python version mismatch
> {code}
>  File "/Users/jzhang/github/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 64, in main
> ("%d.%d" % sys.version_info[:2], version))
> Exception: Python in worker has different version 2.6 than that in driver 
> 2.7, PySpark cannot run with different minor versions
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
> at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12361) Should set PYSPARK_DRIVER_PYTHON before python test

2015-12-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12361:


Assignee: (was: Apache Spark)

> Should set PYSPARK_DRIVER_PYTHON before python test
> ---
>
> Key: SPARK-12361
> URL: https://issues.apache.org/jira/browse/SPARK-12361
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>Priority: Minor
>
> If PYSPARK_DRIVER_PYTHON is not set, python version mismatch exception may 
> happen (when I set PYSPARK_DRIVER_PYTHON in .profile). And the weird thing is 
> that this exception won't cause the unit test fail. The return_code is still 
> 0 which hide the unit test failure. And if I invoke the test command 
> directly, I can see the return code is not 0. This is very weird. 
> * invoke unit test command directly
> {code}
> export SPARK_TESTING = 1
> export PYSPARK_PYTHON=python2.6
> bin/pyspark pyspark.ml.clustering  
> {code}
> * return code from python unit test
> {code}
> retcode = subprocess.Popen(
> [os.path.join(SPARK_HOME, "bin/pyspark"), test_name],
> stderr=per_test_output, stdout=per_test_output, env=env).wait()
> {code}
> * exception of python version mismatch
> {code}
>  File "/Users/jzhang/github/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 64, in main
> ("%d.%d" % sys.version_info[:2], version))
> Exception: Python in worker has different version 2.6 than that in driver 
> 2.7, PySpark cannot run with different minor versions
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
> at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12362) Inline a full-fledged SQL parser

2015-12-16 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-12362:
---

 Summary: Inline a full-fledged SQL parser
 Key: SPARK-12362
 URL: https://issues.apache.org/jira/browse/SPARK-12362
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin


Spark currently has two SQL parsers it is using: a simple one based on Scala 
parser combinator, and another one based on Hive.

Neither is a good long term solution. The parser combinator one has bad error 
messages for users and does not warn when there are conflicts in the defined 
grammar. The Hive one depends directly on Hive itself, and as a result, it is 
very difficult to introduce new grammar.

The goal of the ticket is to create a single SQL query parser that is powerful 
enough to replace the existing ones. The requirements for the new parser are:

1. Can support almost all of HiveQL
2. Can support all existing SQL parser built using Scala parser combinators
3. Can be used for expression parsing in addition to SQL query parsing
4. Can provide good error messages for incorrect syntax

Rather than building one from scratch, we should investigate whether we can 
leverage existing open source projects such as Hive (by inlining the parser 
part) or Calcite.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8585) Support LATERAL VIEW in Spark SQL parser

2015-12-16 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8585.

Resolution: Duplicate

> Support LATERAL VIEW in Spark SQL parser
> 
>
> Key: SPARK-8585
> URL: https://issues.apache.org/jira/browse/SPARK-8585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Konstantin Shaposhnikov
>Priority: Minor
>
> It would be good to support LATERAL VIEW SQL syntax without need to create 
> HiveContext.
> Docs: 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12362) Create a full-fledged built-in SQL parser

2015-12-16 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12362:

Summary: Create a full-fledged built-in SQL parser  (was: Inline a 
full-fledged SQL parser)

> Create a full-fledged built-in SQL parser
> -
>
> Key: SPARK-12362
> URL: https://issues.apache.org/jira/browse/SPARK-12362
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark currently has two SQL parsers it is using: a simple one based on Scala 
> parser combinator, and another one based on Hive.
> Neither is a good long term solution. The parser combinator one has bad error 
> messages for users and does not warn when there are conflicts in the defined 
> grammar. The Hive one depends directly on Hive itself, and as a result, it is 
> very difficult to introduce new grammar.
> The goal of the ticket is to create a single SQL query parser that is 
> powerful enough to replace the existing ones. The requirements for the new 
> parser are:
> 1. Can support almost all of HiveQL
> 2. Can support all existing SQL parser built using Scala parser combinators
> 3. Can be used for expression parsing in addition to SQL query parsing
> 4. Can provide good error messages for incorrect syntax
> Rather than building one from scratch, we should investigate whether we can 
> leverage existing open source projects such as Hive (by inlining the parser 
> part) or Calcite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6936) SQLContext.sql() caused deadlock in multi-thread env

2015-12-16 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059703#comment-15059703
 ] 

Reynold Xin commented on SPARK-6936:


Is this still a problem on the latest Spark? I think we have fixed it already 
right?


> SQLContext.sql() caused deadlock in multi-thread env
> 
>
> Key: SPARK-6936
> URL: https://issues.apache.org/jira/browse/SPARK-6936
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: JDK 1.8.x, RedHat
> Linux version 2.6.32-431.23.3.el6.x86_64 
> (mockbu...@x86-027.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red 
> Hat 4.4.7-4) (GCC) ) #1 SMP Wed Jul 16 06:12:23 EDT 2014
>Reporter: Paul Wu
>  Labels: deadlock, sql, threading
>
> Doing (the same query) in more than one threads with SQLConext.sql may lead 
> to deadlock. Here is a way to reproduce it (since this is multi-thread issue, 
> the reproduction may or may not be so easy).
> 1. Register a relatively big table.
> 2.  Create two different classes and in the classes, do the same query in a 
> method and put the results in a set and print out the set size.
> 3.  Create two threads to use an object from each class in the run method. 
> Start the threads. For my tests,  it can have a deadlock just in a few runs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-9379) org.apache.spark.sql.catalyst.SqlParser should be extensible easily

2015-12-16 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-9379.
--
Resolution: Invalid

I'm closing this ticket because we are going to remove the existing parser soon 
and replace it with a full fledged one.

https://issues.apache.org/jira/browse/SPARK-12362

> org.apache.spark.sql.catalyst.SqlParser should be extensible easily 
> 
>
> Key: SPARK-9379
> URL: https://issues.apache.org/jira/browse/SPARK-9379
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Rishi
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When extending org.apache.spark.sql.catalyst.SqlParser we are usually stuck 
> with Scala limitation of not able to override lazy vals and refer super 
> values from it . (See http://www.scala-lang.org/old/node/11315.html) . This 
> can be avoided with a simple alteration to SqlPasrer.scala. See patch below.
> -  protected lazy val start: Parser[LogicalPlan] =
> -start1 | insert | cte
> +  protected lazy val start: Parser[LogicalPlan] = allParsers
> +
> +  protected def allParsers = start1 | insert | cte
> This will allow subclasses of SqlParser to override only "allParsers" . This 
> will also ease of upgrading through spark revisions 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12346) GLM summary crashes with NoSuchElementException if attributes are missing names

2015-12-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059710#comment-15059710
 ] 

Apache Spark commented on SPARK-12346:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/10323

> GLM summary crashes with NoSuchElementException if attributes are missing 
> names
> ---
>
> Key: SPARK-12346
> URL: https://issues.apache.org/jira/browse/SPARK-12346
> Project: Spark
>  Issue Type: Bug
>Reporter: Eric Liang
>
> In getModelFeatures() of SparkRWrappers.scala, we call _.name.get on all the 
> feature column attributes. This fails when the attribute name is not defined.
> One way of reproducing this is to perform glm() in R with a vector-type input 
> feature that lacks ML attrs, then trying to call summary() on it, for example:
> {code}
> df <- sql(sqlContext, "SELECT * FROM testData")
> df2 <- withColumnRenamed(df, "f1", "f2") // This drops the ML attrs from f1
> lrModel <- glm(hours_per_week ~ f2, data = df2, family = "gaussian")
> summary(lrModel) // NoSuchElementException
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12345) Mesos cluster mode is broken

2015-12-16 Thread Iulian Dragos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059717#comment-15059717
 ] 

Iulian Dragos commented on SPARK-12345:
---

There's isn't any {{SPARK_HOME}} set on any of the Mesos slaves.

Here's what I think happens: the {{SPARK_HOME}} variable is exported by 
{{spark-submit}}, and copied 
[here|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L369-L372]
 to the driver environment.

> Mesos cluster mode is broken
> 
>
> Key: SPARK-12345
> URL: https://issues.apache.org/jira/browse/SPARK-12345
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Critical
>
> The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2.
> The driver is confused about where SPARK_HOME is. It resolves 
> `mesos.executor.uri` or `spark.mesos.executor.home` relative to the 
> filesystem where the driver runs, which is wrong.
> {code}
> I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0
> I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave 
> 130bdc39-44e7-4256-8c22-602040d337f1-S1
> bin/spark-submit: line 27: 
> /Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class:
>  No such file or directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12361) Should set PYSPARK_DRIVER_PYTHON before python test

2015-12-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12361:


Assignee: (was: Apache Spark)

> Should set PYSPARK_DRIVER_PYTHON before python test
> ---
>
> Key: SPARK-12361
> URL: https://issues.apache.org/jira/browse/SPARK-12361
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>Priority: Minor
>
> If PYSPARK_DRIVER_PYTHON is not set, python version mismatch exception may 
> happen (when I set PYSPARK_DRIVER_PYTHON in .profile). And the weird thing is 
> that this exception won't cause the unit test fail. The return_code is still 
> 0 which hide the unit test failure. And if I invoke the test command 
> directly, I can see the return code is not 0. This is very weird. 
> * invoke unit test command directly
> {code}
> export SPARK_TESTING = 1
> export PYSPARK_PYTHON=python2.6
> bin/pyspark pyspark.ml.clustering  
> {code}
> * return code from python unit test
> {code}
> retcode = subprocess.Popen(
> [os.path.join(SPARK_HOME, "bin/pyspark"), test_name],
> stderr=per_test_output, stdout=per_test_output, env=env).wait()
> {code}
> * exception of python version mismatch
> {code}
>  File "/Users/jzhang/github/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 64, in main
> ("%d.%d" % sys.version_info[:2], version))
> Exception: Python in worker has different version 2.6 than that in driver 
> 2.7, PySpark cannot run with different minor versions
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
> at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12361) Should set PYSPARK_DRIVER_PYTHON before python test

2015-12-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12361:


Assignee: Apache Spark

> Should set PYSPARK_DRIVER_PYTHON before python test
> ---
>
> Key: SPARK-12361
> URL: https://issues.apache.org/jira/browse/SPARK-12361
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> If PYSPARK_DRIVER_PYTHON is not set, python version mismatch exception may 
> happen (when I set PYSPARK_DRIVER_PYTHON in .profile). And the weird thing is 
> that this exception won't cause the unit test fail. The return_code is still 
> 0 which hide the unit test failure. And if I invoke the test command 
> directly, I can see the return code is not 0. This is very weird. 
> * invoke unit test command directly
> {code}
> export SPARK_TESTING = 1
> export PYSPARK_PYTHON=python2.6
> bin/pyspark pyspark.ml.clustering  
> {code}
> * return code from python unit test
> {code}
> retcode = subprocess.Popen(
> [os.path.join(SPARK_HOME, "bin/pyspark"), test_name],
> stderr=per_test_output, stdout=per_test_output, env=env).wait()
> {code}
> * exception of python version mismatch
> {code}
>  File "/Users/jzhang/github/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 64, in main
> ("%d.%d" % sys.version_info[:2], version))
> Exception: Python in worker has different version 2.6 than that in driver 
> 2.7, PySpark cannot run with different minor versions
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
> at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics

2015-12-16 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10620:

Assignee: (was: Andrew Or)

> Look into whether accumulator mechanism can replace TaskMetrics
> ---
>
> Key: SPARK-10620
> URL: https://issues.apache.org/jira/browse/SPARK-10620
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Patrick Wendell
>
> This task is simply to explore whether the internal representation used by 
> TaskMetrics could be performed by using accumulators rather than having two 
> separate mechanisms. Note that we need to continue to preserve the existing 
> "Task Metric" data structures that are exposed to users through event logs 
> etc. The question is can we use a single internal codepath and perhaps make 
> this easier to extend in the future.
> I think a full exploration would answer the following questions:
> - How do the semantics of accumulators on stage retries differ from aggregate 
> TaskMetrics for a stage? Could we implement clearer retry semantics for 
> internal accumulators to allow them to be the same - for instance, zeroing 
> accumulator values if a stage is retried (see discussion here: SPARK-10042).
> - Are there metrics that do not fit well into the accumulator model, or would 
> be difficult to update as an accumulator.
> - If we expose metrics through accumulators in the future rather than 
> continuing to add fields to TaskMetrics, what is the best way to coerce 
> compatibility?
> - Are there any other considerations?
> - Is it worth it to do this, or is the consolidation too complicated to 
> justify?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12346) GLM summary crashes with NoSuchElementException if attributes are missing names

2015-12-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12346:


Assignee: Apache Spark

> GLM summary crashes with NoSuchElementException if attributes are missing 
> names
> ---
>
> Key: SPARK-12346
> URL: https://issues.apache.org/jira/browse/SPARK-12346
> Project: Spark
>  Issue Type: Bug
>Reporter: Eric Liang
>Assignee: Apache Spark
>
> In getModelFeatures() of SparkRWrappers.scala, we call _.name.get on all the 
> feature column attributes. This fails when the attribute name is not defined.
> One way of reproducing this is to perform glm() in R with a vector-type input 
> feature that lacks ML attrs, then trying to call summary() on it, for example:
> {code}
> df <- sql(sqlContext, "SELECT * FROM testData")
> df2 <- withColumnRenamed(df, "f1", "f2") // This drops the ML attrs from f1
> lrModel <- glm(hours_per_week ~ f2, data = df2, family = "gaussian")
> summary(lrModel) // NoSuchElementException
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12346) GLM summary crashes with NoSuchElementException if attributes are missing names

2015-12-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12346:


Assignee: (was: Apache Spark)

> GLM summary crashes with NoSuchElementException if attributes are missing 
> names
> ---
>
> Key: SPARK-12346
> URL: https://issues.apache.org/jira/browse/SPARK-12346
> Project: Spark
>  Issue Type: Bug
>Reporter: Eric Liang
>
> In getModelFeatures() of SparkRWrappers.scala, we call _.name.get on all the 
> feature column attributes. This fails when the attribute name is not defined.
> One way of reproducing this is to perform glm() in R with a vector-type input 
> feature that lacks ML attrs, then trying to call summary() on it, for example:
> {code}
> df <- sql(sqlContext, "SELECT * FROM testData")
> df2 <- withColumnRenamed(df, "f1", "f2") // This drops the ML attrs from f1
> lrModel <- glm(hours_per_week ~ f2, data = df2, family = "gaussian")
> summary(lrModel) // NoSuchElementException
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12363) PowerIterationClustering test case failed if we deprecated KMeans.setRuns

2015-12-16 Thread Yanbo Liang (JIRA)

Yanbo Liang created SPARK-12363:
---

 Summary: PowerIterationClustering test case failed if we 
deprecated KMeans.setRuns
 Key: SPARK-12363
 URL: https://issues.apache.org/jira/browse/SPARK-12363
 Project: Spark
  Issue Type: Bug
  Components: GraphX, MLlib
Reporter: Yanbo Liang


We plan to deprecated `runs` of KMeans, PowerIterationClustering will leverage 
KMeans to train model.
I removed `setRuns` used in PowerIterationClustering, but one of the test cases 
failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12363) PowerIterationClustering test case failed if we deprecated KMeans.setRuns

2015-12-16 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059753#comment-15059753
 ] 

Yanbo Liang commented on SPARK-12363:
-

After I removed [this 
line|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala#L388],
 
[this|https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/clustering/PowerIterationClusteringSuite.scala#L71]
 test cases failed.
It's very strange that the following test cases are the same dataset, but one 
success and the other failed.
{code}
test("power iteration clustering") {
/*
 We use the following graph to test PIC. All edges are assigned similarity 
1.0 except 0.1 for
 edge (3, 4).
 15-14 -13 -12
 |   |
 4 . 3 - 2  11
 |   | x |   |
 5   0 - 1  10
 |   |
 6 - 7 - 8 - 9
 */

val similarities = Seq[(Long, Long, Double)]((0, 1, 1.0), (0, 2, 1.0), (0, 
3, 1.0), (1, 2, 1.0),
  (1, 3, 1.0), (2, 3, 1.0), (3, 4, 0.1), // (3, 4) is a weak edge
  (4, 5, 1.0), (4, 15, 1.0), (5, 6, 1.0), (6, 7, 1.0), (7, 8, 1.0), (8, 9, 
1.0), (9, 10, 1.0),
  (10, 11, 1.0), (11, 12, 1.0), (12, 13, 1.0), (13, 14, 1.0), (14, 15, 1.0))
val model = new PowerIterationClustering()
  .setK(2)
  .run(sc.parallelize(similarities, 2))
val predictions = Array.fill(2)(mutable.Set.empty[Long])
model.assignments.collect().foreach { a =>
  predictions(a.cluster) += a.id
}
assert(predictions.toSet == Set((0 to 3).toSet, (4 to 15).toSet))

val model2 = new PowerIterationClustering()
  .setK(2)
  .setInitializationMode("degree")
  .run(sc.parallelize(similarities, 2))
val predictions2 = Array.fill(2)(mutable.Set.empty[Long])
model2.assignments.collect().foreach { a =>
  predictions2(a.cluster) += a.id
}
assert(predictions2.toSet == Set((0 to 3).toSet, (4 to 15).toSet))
  }

  test("power iteration clustering on graph") {
/*
 We use the following graph to test PIC. All edges are assigned similarity 
1.0 except 0.1 for
 edge (3, 4).
 15-14 -13 -12
 |   |
 4 . 3 - 2  11
 |   | x |   |
 5   0 - 1  10
 |   |
 6 - 7 - 8 - 9
 */

val similarities = Seq[(Long, Long, Double)]((0, 1, 1.0), (0, 2, 1.0), (0, 
3, 1.0), (1, 2, 1.0),
  (1, 3, 1.0), (2, 3, 1.0), (3, 4, 0.1), // (3, 4) is a weak edge
  (4, 5, 1.0), (4, 15, 1.0), (5, 6, 1.0), (6, 7, 1.0), (7, 8, 1.0), (8, 9, 
1.0), (9, 10, 1.0),
  (10, 11, 1.0), (11, 12, 1.0), (12, 13, 1.0), (13, 14, 1.0), (14, 15, 1.0))

val edges = similarities.flatMap { case (i, j, s) =>
  if (i != j) {
Seq(Edge(i, j, s), Edge(j, i, s))
  } else {
None
  }
}
val graph = Graph.fromEdges(sc.parallelize(edges, 2), 0.0)

val model = new PowerIterationClustering()
  .setK(2)
  .run(graph)
val predictions = Array.fill(2)(mutable.Set.empty[Long])
model.assignments.collect().foreach { a =>
  predictions(a.cluster) += a.id
}
assert(predictions.toSet == Set((0 to 3).toSet, (4 to 15).toSet))

val model2 = new PowerIterationClustering()
  .setK(2)
  .setInitializationMode("degree")
  .run(sc.parallelize(similarities, 2))
val predictions2 = Array.fill(2)(mutable.Set.empty[Long])
model2.assignments.collect().foreach { a =>
  predictions2(a.cluster) += a.id
}
assert(predictions2.toSet == Set((0 to 3).toSet, (4 to 15).toSet))
  }
{code}

> PowerIterationClustering test case failed if we deprecated KMeans.setRuns
> -
>
> Key: SPARK-12363
> URL: https://issues.apache.org/jira/browse/SPARK-12363
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, MLlib
>Reporter: Yanbo Liang
>
> We plan to deprecated `runs` of KMeans, PowerIterationClustering will 
> leverage KMeans to train model.
> I removed `setRuns` used in PowerIterationClustering, but one of the test 
> cases failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12363) PowerIterationClustering test case failed if we deprecated KMeans.setRuns

2015-12-16 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059753#comment-15059753
 ] 

Yanbo Liang edited comment on SPARK-12363 at 12/16/15 9:38 AM:
---

After I removed [this 
line|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala#L388],
 
[this|https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/clustering/PowerIterationClusteringSuite.scala#L71]
 test cases failed.
I pasted the test cases here. It's very strange that the following test cases 
are based on the same dataset, but one success and the other failed.
Another clue is that when use `setInitializationMode("degree")` to train the 
PIC model, both the following two cases can pass. But if we use 
`setInitializationMode("random")`, the second test case will failed.
{code}
test("power iteration clustering") {
/*
 We use the following graph to test PIC. All edges are assigned similarity 
1.0 except 0.1 for
 edge (3, 4).
 15-14 -13 -12
 |   |
 4 . 3 - 2  11
 |   | x |   |
 5   0 - 1  10
 |   |
 6 - 7 - 8 - 9
 */

val similarities = Seq[(Long, Long, Double)]((0, 1, 1.0), (0, 2, 1.0), (0, 
3, 1.0), (1, 2, 1.0),
  (1, 3, 1.0), (2, 3, 1.0), (3, 4, 0.1), // (3, 4) is a weak edge
  (4, 5, 1.0), (4, 15, 1.0), (5, 6, 1.0), (6, 7, 1.0), (7, 8, 1.0), (8, 9, 
1.0), (9, 10, 1.0),
  (10, 11, 1.0), (11, 12, 1.0), (12, 13, 1.0), (13, 14, 1.0), (14, 15, 1.0))
val model = new PowerIterationClustering()
  .setK(2)
  .run(sc.parallelize(similarities, 2))
val predictions = Array.fill(2)(mutable.Set.empty[Long])
model.assignments.collect().foreach { a =>
  predictions(a.cluster) += a.id
}
assert(predictions.toSet == Set((0 to 3).toSet, (4 to 15).toSet))

val model2 = new PowerIterationClustering()
  .setK(2)
  .setInitializationMode("degree")
  .run(sc.parallelize(similarities, 2))
val predictions2 = Array.fill(2)(mutable.Set.empty[Long])
model2.assignments.collect().foreach { a =>
  predictions2(a.cluster) += a.id
}
assert(predictions2.toSet == Set((0 to 3).toSet, (4 to 15).toSet))
  }

  test("power iteration clustering on graph") {
/*
 We use the following graph to test PIC. All edges are assigned similarity 
1.0 except 0.1 for
 edge (3, 4).
 15-14 -13 -12
 |   |
 4 . 3 - 2  11
 |   | x |   |
 5   0 - 1  10
 |   |
 6 - 7 - 8 - 9
 */

val similarities = Seq[(Long, Long, Double)]((0, 1, 1.0), (0, 2, 1.0), (0, 
3, 1.0), (1, 2, 1.0),
  (1, 3, 1.0), (2, 3, 1.0), (3, 4, 0.1), // (3, 4) is a weak edge
  (4, 5, 1.0), (4, 15, 1.0), (5, 6, 1.0), (6, 7, 1.0), (7, 8, 1.0), (8, 9, 
1.0), (9, 10, 1.0),
  (10, 11, 1.0), (11, 12, 1.0), (12, 13, 1.0), (13, 14, 1.0), (14, 15, 1.0))

val edges = similarities.flatMap { case (i, j, s) =>
  if (i != j) {
Seq(Edge(i, j, s), Edge(j, i, s))
  } else {
None
  }
}
val graph = Graph.fromEdges(sc.parallelize(edges, 2), 0.0)

val model = new PowerIterationClustering()
  .setK(2)
  .run(graph)
val predictions = Array.fill(2)(mutable.Set.empty[Long])
model.assignments.collect().foreach { a =>
  predictions(a.cluster) += a.id
}
assert(predictions.toSet == Set((0 to 3).toSet, (4 to 15).toSet))

val model2 = new PowerIterationClustering()
  .setK(2)
  .setInitializationMode("degree")
  .run(sc.parallelize(similarities, 2))
val predictions2 = Array.fill(2)(mutable.Set.empty[Long])
model2.assignments.collect().foreach { a =>
  predictions2(a.cluster) += a.id
}
assert(predictions2.toSet == Set((0 to 3).toSet, (4 to 15).toSet))
  }
{code}


was (Author: yanboliang):
After I removed [this 
line|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala#L388],
 
[this|https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/clustering/PowerIterationClusteringSuite.scala#L71]
 test cases failed.
It's very strange that the following test cases are the same dataset, but one 
success and the other failed.
{code}
test("power iteration clustering") {
/*
 We use the following graph to test PIC. All edges are assigned similarity 
1.0 except 0.1 for
 edge (3, 4).
 15-14 -13 -12
 |   |
 4 . 3 - 2  11
 |   | x |   |
 5   0 - 1  10
 |   |
 6 - 7 - 8 - 9
 */

val similarities = Seq[(Long, Long, Double)]((0, 1, 1.0), (0, 2, 1.0), (0, 
3, 1.0), (1, 2, 1.0),
  (1, 3, 1.0), (2, 3, 1.0), (3, 4, 0.1), // (3, 4) is a weak edge
  (4, 5, 1.0), (4, 15, 1.0), (5, 6, 1.0), (6, 7, 1.0), (7, 8, 1.0), (8, 9, 
1.0), (9, 10, 1.0),

[jira] [Updated] (SPARK-12363) PowerIterationClustering test case failed if we deprecated KMeans.setRuns

2015-12-16 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12363:

Component/s: (was: GraphX)

> PowerIterationClustering test case failed if we deprecated KMeans.setRuns
> -
>
> Key: SPARK-12363
> URL: https://issues.apache.org/jira/browse/SPARK-12363
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Yanbo Liang
>
> We plan to deprecated `runs` of KMeans, PowerIterationClustering will 
> leverage KMeans to train model.
> I removed `setRuns` used in PowerIterationClustering, but one of the test 
> cases failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12363) PowerIterationClustering test case failed if we deprecated KMeans.setRuns

2015-12-16 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059753#comment-15059753
 ] 

Yanbo Liang edited comment on SPARK-12363 at 12/16/15 9:40 AM:
---

After I removed [this 
line|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala#L388],
 one of the test cases failed.
I pasted the test cases here. It's very strange that the following two cases 
are based on the same dataset, but one success and the other failed.
Another clue is that when use `setInitializationMode("degree")` to train the 
PIC model, both the following two cases can pass. But if we use 
`setInitializationMode("random")`, the second test case will failed.
{code}
test("power iteration clustering") {
/*
 We use the following graph to test PIC. All edges are assigned similarity 
1.0 except 0.1 for
 edge (3, 4).
 15-14 -13 -12
 |   |
 4 . 3 - 2  11
 |   | x |   |
 5   0 - 1  10
 |   |
 6 - 7 - 8 - 9
 */

val similarities = Seq[(Long, Long, Double)]((0, 1, 1.0), (0, 2, 1.0), (0, 
3, 1.0), (1, 2, 1.0),
  (1, 3, 1.0), (2, 3, 1.0), (3, 4, 0.1), // (3, 4) is a weak edge
  (4, 5, 1.0), (4, 15, 1.0), (5, 6, 1.0), (6, 7, 1.0), (7, 8, 1.0), (8, 9, 
1.0), (9, 10, 1.0),
  (10, 11, 1.0), (11, 12, 1.0), (12, 13, 1.0), (13, 14, 1.0), (14, 15, 1.0))
val model = new PowerIterationClustering()
  .setK(2)
  .run(sc.parallelize(similarities, 2))
val predictions = Array.fill(2)(mutable.Set.empty[Long])
model.assignments.collect().foreach { a =>
  predictions(a.cluster) += a.id
}
assert(predictions.toSet == Set((0 to 3).toSet, (4 to 15).toSet))

val model2 = new PowerIterationClustering()
  .setK(2)
  .setInitializationMode("degree")
  .run(sc.parallelize(similarities, 2))
val predictions2 = Array.fill(2)(mutable.Set.empty[Long])
model2.assignments.collect().foreach { a =>
  predictions2(a.cluster) += a.id
}
assert(predictions2.toSet == Set((0 to 3).toSet, (4 to 15).toSet))
  }

  test("power iteration clustering on graph") {
/*
 We use the following graph to test PIC. All edges are assigned similarity 
1.0 except 0.1 for
 edge (3, 4).
 15-14 -13 -12
 |   |
 4 . 3 - 2  11
 |   | x |   |
 5   0 - 1  10
 |   |
 6 - 7 - 8 - 9
 */

val similarities = Seq[(Long, Long, Double)]((0, 1, 1.0), (0, 2, 1.0), (0, 
3, 1.0), (1, 2, 1.0),
  (1, 3, 1.0), (2, 3, 1.0), (3, 4, 0.1), // (3, 4) is a weak edge
  (4, 5, 1.0), (4, 15, 1.0), (5, 6, 1.0), (6, 7, 1.0), (7, 8, 1.0), (8, 9, 
1.0), (9, 10, 1.0),
  (10, 11, 1.0), (11, 12, 1.0), (12, 13, 1.0), (13, 14, 1.0), (14, 15, 1.0))

val edges = similarities.flatMap { case (i, j, s) =>
  if (i != j) {
Seq(Edge(i, j, s), Edge(j, i, s))
  } else {
None
  }
}
val graph = Graph.fromEdges(sc.parallelize(edges, 2), 0.0)

val model = new PowerIterationClustering()
  .setK(2)
  .run(graph)
val predictions = Array.fill(2)(mutable.Set.empty[Long])
model.assignments.collect().foreach { a =>
  predictions(a.cluster) += a.id
}
assert(predictions.toSet == Set((0 to 3).toSet, (4 to 15).toSet))

val model2 = new PowerIterationClustering()
  .setK(2)
  .setInitializationMode("degree")
  .run(sc.parallelize(similarities, 2))
val predictions2 = Array.fill(2)(mutable.Set.empty[Long])
model2.assignments.collect().foreach { a =>
  predictions2(a.cluster) += a.id
}
assert(predictions2.toSet == Set((0 to 3).toSet, (4 to 15).toSet))
  }
{code}


was (Author: yanboliang):
After I removed [this 
line|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala#L388],
 
[this|https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/clustering/PowerIterationClusteringSuite.scala#L71]
 test cases failed.
I pasted the test cases here. It's very strange that the following test cases 
are based on the same dataset, but one success and the other failed.
Another clue is that when use `setInitializationMode("degree")` to train the 
PIC model, both the following two cases can pass. But if we use 
`setInitializationMode("random")`, the second test case will failed.
{code}
test("power iteration clustering") {
/*
 We use the following graph to test PIC. All edges are assigned similarity 
1.0 except 0.1 for
 edge (3, 4).
 15-14 -13 -12
 |   |
 4 . 3 - 2  11
 |   | x |   |
 5   0 - 1  10
 |   |
 6 - 7 - 8 - 9
 */

val similarities = Seq[(Long, Long, Double)]((0, 1, 1.0), (0, 2, 1.0), (0, 
3, 1.0), (1, 2, 1.0),
  (1, 3, 1.0), (2, 3, 1.0), (3, 4, 0.1), // (3, 4) is a wea

[jira] [Comment Edited] (SPARK-12363) PowerIterationClustering test case failed if we deprecated KMeans.setRuns

2015-12-16 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059753#comment-15059753
 ] 

Yanbo Liang edited comment on SPARK-12363 at 12/16/15 9:41 AM:
---

This bug is very easy to reproduce. After I removed [this 
line|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala#L388],
 one of the test cases failed.
I pasted the test cases here. It's very strange that the following two cases 
are based on the same dataset, but one success and the other failed.
Another clue is that when use `setInitializationMode("degree")` to train the 
PIC model, both the following two cases can pass. But if we use 
`setInitializationMode("random")`, the second test case will failed.
{code}
test("power iteration clustering") {
/*
 We use the following graph to test PIC. All edges are assigned similarity 
1.0 except 0.1 for
 edge (3, 4).
 15-14 -13 -12
 |   |
 4 . 3 - 2  11
 |   | x |   |
 5   0 - 1  10
 |   |
 6 - 7 - 8 - 9
 */

val similarities = Seq[(Long, Long, Double)]((0, 1, 1.0), (0, 2, 1.0), (0, 
3, 1.0), (1, 2, 1.0),
  (1, 3, 1.0), (2, 3, 1.0), (3, 4, 0.1), // (3, 4) is a weak edge
  (4, 5, 1.0), (4, 15, 1.0), (5, 6, 1.0), (6, 7, 1.0), (7, 8, 1.0), (8, 9, 
1.0), (9, 10, 1.0),
  (10, 11, 1.0), (11, 12, 1.0), (12, 13, 1.0), (13, 14, 1.0), (14, 15, 1.0))
val model = new PowerIterationClustering()
  .setK(2)
  .run(sc.parallelize(similarities, 2))
val predictions = Array.fill(2)(mutable.Set.empty[Long])
model.assignments.collect().foreach { a =>
  predictions(a.cluster) += a.id
}
assert(predictions.toSet == Set((0 to 3).toSet, (4 to 15).toSet))

val model2 = new PowerIterationClustering()
  .setK(2)
  .setInitializationMode("degree")
  .run(sc.parallelize(similarities, 2))
val predictions2 = Array.fill(2)(mutable.Set.empty[Long])
model2.assignments.collect().foreach { a =>
  predictions2(a.cluster) += a.id
}
assert(predictions2.toSet == Set((0 to 3).toSet, (4 to 15).toSet))
  }

  test("power iteration clustering on graph") {
/*
 We use the following graph to test PIC. All edges are assigned similarity 
1.0 except 0.1 for
 edge (3, 4).
 15-14 -13 -12
 |   |
 4 . 3 - 2  11
 |   | x |   |
 5   0 - 1  10
 |   |
 6 - 7 - 8 - 9
 */

val similarities = Seq[(Long, Long, Double)]((0, 1, 1.0), (0, 2, 1.0), (0, 
3, 1.0), (1, 2, 1.0),
  (1, 3, 1.0), (2, 3, 1.0), (3, 4, 0.1), // (3, 4) is a weak edge
  (4, 5, 1.0), (4, 15, 1.0), (5, 6, 1.0), (6, 7, 1.0), (7, 8, 1.0), (8, 9, 
1.0), (9, 10, 1.0),
  (10, 11, 1.0), (11, 12, 1.0), (12, 13, 1.0), (13, 14, 1.0), (14, 15, 1.0))

val edges = similarities.flatMap { case (i, j, s) =>
  if (i != j) {
Seq(Edge(i, j, s), Edge(j, i, s))
  } else {
None
  }
}
val graph = Graph.fromEdges(sc.parallelize(edges, 2), 0.0)

val model = new PowerIterationClustering()
  .setK(2)
  .run(graph)
val predictions = Array.fill(2)(mutable.Set.empty[Long])
model.assignments.collect().foreach { a =>
  predictions(a.cluster) += a.id
}
assert(predictions.toSet == Set((0 to 3).toSet, (4 to 15).toSet))

val model2 = new PowerIterationClustering()
  .setK(2)
  .setInitializationMode("degree")
  .run(sc.parallelize(similarities, 2))
val predictions2 = Array.fill(2)(mutable.Set.empty[Long])
model2.assignments.collect().foreach { a =>
  predictions2(a.cluster) += a.id
}
assert(predictions2.toSet == Set((0 to 3).toSet, (4 to 15).toSet))
  }
{code}


was (Author: yanboliang):
After I removed [this 
line|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala#L388],
 one of the test cases failed.
I pasted the test cases here. It's very strange that the following two cases 
are based on the same dataset, but one success and the other failed.
Another clue is that when use `setInitializationMode("degree")` to train the 
PIC model, both the following two cases can pass. But if we use 
`setInitializationMode("random")`, the second test case will failed.
{code}
test("power iteration clustering") {
/*
 We use the following graph to test PIC. All edges are assigned similarity 
1.0 except 0.1 for
 edge (3, 4).
 15-14 -13 -12
 |   |
 4 . 3 - 2  11
 |   | x |   |
 5   0 - 1  10
 |   |
 6 - 7 - 8 - 9
 */

val similarities = Seq[(Long, Long, Double)]((0, 1, 1.0), (0, 2, 1.0), (0, 
3, 1.0), (1, 2, 1.0),
  (1, 3, 1.0), (2, 3, 1.0), (3, 4, 0.1), // (3, 4) is a weak edge
  (4, 5, 1.0), (4, 15, 1.0), (5, 6, 1.0), (6, 7, 1.0), (7, 8, 1.0), (8, 9, 
1.0), (9, 10, 1

[jira] [Commented] (SPARK-12363) PowerIterationClustering test case failed if we deprecated KMeans.setRuns

2015-12-16 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059763#comment-15059763
 ] 

Yanbo Liang commented on SPARK-12363:
-

cc [~mengxr] [~josephkb] [~viirya] Would you mind to take a look at this issue?

> PowerIterationClustering test case failed if we deprecated KMeans.setRuns
> -
>
> Key: SPARK-12363
> URL: https://issues.apache.org/jira/browse/SPARK-12363
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Yanbo Liang
>
> We plan to deprecated `runs` of KMeans, PowerIterationClustering will 
> leverage KMeans to train model.
> I removed `setRuns` used in PowerIterationClustering, but one of the test 
> cases failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12345) Mesos cluster mode is broken

2015-12-16 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059769#comment-15059769
 ] 

Stavros Kontopoulos commented on SPARK-12345:
-

Adding to [~dragos].(MesosClusterScheduler.scala) ...

Mesos Dispatcher passed that env variable to the driver at line:
builder.setEnvironment(envBuilder.build())
also having set executor uri it picks this path:  

else if (executorUri.isDefined) {

In 1.6 the spark-submit script has changed...
https://github.com/apache/ spark/blob/master/bin/spark-submit
This line make SPARK_HOME not to change to the local path on mesos slave (from 
which the spark-submit was called from) because it has already been set.
if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "dirname "$0""/..; pwd)"
fi
And the driver is started with spark-submit command anyway... 


> Mesos cluster mode is broken
> 
>
> Key: SPARK-12345
> URL: https://issues.apache.org/jira/browse/SPARK-12345
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Critical
>
> The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2.
> The driver is confused about where SPARK_HOME is. It resolves 
> `mesos.executor.uri` or `spark.mesos.executor.home` relative to the 
> filesystem where the driver runs, which is wrong.
> {code}
> I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0
> I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave 
> 130bdc39-44e7-4256-8c22-602040d337f1-S1
> bin/spark-submit: line 27: 
> /Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class:
>  No such file or directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12345) Mesos cluster mode is broken

2015-12-16 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059778#comment-15059778
 ] 

Saisai Shao commented on SPARK-12345:
-

A simple solution is to change the scripts to not expose `SPARK_HOME`. From my 
understanding a good solution for Mesos is to do the same way as YARN to 
directly invoke Java program, not relying on script to start program.

> Mesos cluster mode is broken
> 
>
> Key: SPARK-12345
> URL: https://issues.apache.org/jira/browse/SPARK-12345
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Critical
>
> The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2.
> The driver is confused about where SPARK_HOME is. It resolves 
> `mesos.executor.uri` or `spark.mesos.executor.home` relative to the 
> filesystem where the driver runs, which is wrong.
> {code}
> I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0
> I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave 
> 130bdc39-44e7-4256-8c22-602040d337f1-S1
> bin/spark-submit: line 27: 
> /Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class:
>  No such file or directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12350) VectorAssembler#transform() initially throws an exception

2015-12-16 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059788#comment-15059788
 ] 

Yanbo Liang commented on SPARK-12350:
-

I can reproduce this issue, but it not caused by ML because it can output the 
transformed dataframe at the end of the error log. And if we did not use 
spark-shell to run this program, it works well.

> VectorAssembler#transform() initially throws an exception
> -
>
> Key: SPARK-12350
> URL: https://issues.apache.org/jira/browse/SPARK-12350
> Project: Spark
>  Issue Type: Bug
>  Components: ML
> Environment: sparkShell command from sbt
>Reporter: Jakob Odersky
>
> Calling VectorAssembler.transform() initially throws an exception, subsequent 
> calls work.
> h3. Steps to reproduce
> In spark-shell,
> 1. Create a dummy dataframe and define an assembler
> {code}
> import org.apache.spark.ml.feature.VectorAssembler
> val df = sc.parallelize(List((1,2), (3,4))).toDF
> val assembler = new VectorAssembler().setInputCols(Array("_1", 
> "_2")).setOutputCol("features")
> {code}
> 2. Run
> {code}
> assembler.transform(df).show
> {code}
> Initially the following exception is thrown:
> {code}
> 15/12/15 16:20:19 ERROR TransportRequestHandler: Error opening stream 
> /classes/org/apache/spark/sql/catalyst/expressions/Object.class for request 
> from /9.72.139.102:60610
> java.lang.IllegalArgumentException: requirement failed: File not found: 
> /classes/org/apache/spark/sql/catalyst/expressions/Object.class
>   at scala.Predef$.require(Predef.scala:233)
>   at 
> org.apache.spark.rpc.netty.NettyStreamManager.openStream(NettyStreamManager.scala:60)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:106)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Subsequent calls work:
> {code}
> +---+---+-+
> | _1| _2| features|
> +---+---+-+
> |  1|  2|[1.0,2.0]|
> |  3|  4|[3.0,4.0]|
> +---+---+-+
> {code}
> It seems as though there is some internal state that is not initialized.
> [~iyounus] originally found this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12350) VectorAssembler#transform() initially throws an exception

2015-12-16 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059788#comment-15059788
 ] 

Yanbo Liang edited comment on SPARK-12350 at 12/16/15 10:15 AM:


I can reproduce this issue, but it not caused by ML because it can output the 
transformed dataframe at the end of the error log. And if we did not run this 
program in spark-shell, it works well.


was (Author: yanboliang):
I can reproduce this issue, but it not caused by ML because it can output the 
transformed dataframe at the end of the error log. And if we did not use 
spark-shell to run this program, it works well.

> VectorAssembler#transform() initially throws an exception
> -
>
> Key: SPARK-12350
> URL: https://issues.apache.org/jira/browse/SPARK-12350
> Project: Spark
>  Issue Type: Bug
>  Components: ML
> Environment: sparkShell command from sbt
>Reporter: Jakob Odersky
>
> Calling VectorAssembler.transform() initially throws an exception, subsequent 
> calls work.
> h3. Steps to reproduce
> In spark-shell,
> 1. Create a dummy dataframe and define an assembler
> {code}
> import org.apache.spark.ml.feature.VectorAssembler
> val df = sc.parallelize(List((1,2), (3,4))).toDF
> val assembler = new VectorAssembler().setInputCols(Array("_1", 
> "_2")).setOutputCol("features")
> {code}
> 2. Run
> {code}
> assembler.transform(df).show
> {code}
> Initially the following exception is thrown:
> {code}
> 15/12/15 16:20:19 ERROR TransportRequestHandler: Error opening stream 
> /classes/org/apache/spark/sql/catalyst/expressions/Object.class for request 
> from /9.72.139.102:60610
> java.lang.IllegalArgumentException: requirement failed: File not found: 
> /classes/org/apache/spark/sql/catalyst/expressions/Object.class
>   at scala.Predef$.require(Predef.scala:233)
>   at 
> org.apache.spark.rpc.netty.NettyStreamManager.openStream(NettyStreamManager.scala:60)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:106)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Subsequent calls work:
> {code}
> +---+---+-+
> | _1| _2| features|
> +---+---+-+
> |  1|  2|[1.0,2.0]|
> |  3|  4|[3.0,4.0]|
> +---+---+-+
> {code}
> It seems as though there is some internal state that is not initialized.
> [~iyounus] originally found this issue.



--
Thi

[jira] [Commented] (SPARK-12345) Mesos cluster mode is broken

2015-12-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059812#comment-15059812
 ] 

Sean Owen commented on SPARK-12345:
---

Yeah, but why is {{SPARK_HOME}} copied across machines to begin with? that 
seems like the more fundamental issue.

> Mesos cluster mode is broken
> 
>
> Key: SPARK-12345
> URL: https://issues.apache.org/jira/browse/SPARK-12345
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Critical
>
> The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2.
> The driver is confused about where SPARK_HOME is. It resolves 
> `mesos.executor.uri` or `spark.mesos.executor.home` relative to the 
> filesystem where the driver runs, which is wrong.
> {code}
> I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0
> I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave 
> 130bdc39-44e7-4256-8c22-602040d337f1-S1
> bin/spark-submit: line 27: 
> /Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class:
>  No such file or directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12180) DataFrame.join() in PySpark gives misleading exception when column name exists on both side

2015-12-16 Thread Daniel Thomas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059816#comment-15059816
 ] 

Daniel Thomas commented on SPARK-12180:
---

Here is the code. Without renaming the columns it was throwing the exception.
{code}
sel_starts = starts.select('uuid', 'at').withColumnRenamed('uuid', 
'uuid_x').withColumnRenamed('at', 'at_x')
sel_closes = closes.select('uuid', 'at', 'session_uuid', 'total_session_sec')
start_close = sel_starts.join(sel_closes, sel_starts['uuid_x'] == 
sel_closes['session_uuid'])
start_close.cache()
start_close.take(1)
{code}

> DataFrame.join() in PySpark gives misleading exception when column name 
> exists on both side
> ---
>
> Key: SPARK-12180
> URL: https://issues.apache.org/jira/browse/SPARK-12180
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Daniel Thomas
>
> When joining two DataFrames on a column 'session_uuid' I got the following 
> exception, because both DataFrames hat a column called 'at'. The exception is 
> misleading in the cause and in the column causing the problem. Renaming the 
> column fixed the exception.
> ---
> Py4JJavaError Traceback (most recent call last)
> /Applications/spark-1.5.2-bin-hadoop2.4/python/pyspark/sql/utils.py in 
> deco(*a, **kw)
>  35 try:
> ---> 36 return f(*a, **kw)
>  37 except py4j.protocol.Py4JJavaError as e:
> /Applications/spark-1.5.2-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> Py4JJavaError: An error occurred while calling o484.join.
> : org.apache.spark.sql.AnalysisException: resolved attribute(s) 
> session_uuid#3278 missing from 
> uuid_x#9078,total_session_sec#9115L,at#3248,session_uuid#9114,uuid#9117,at#9084
>  in operator !Join Inner, Some((uuid_x#9078 = session_uuid#3278));
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> During handling of the above exception, another exception occurred:
> AnalysisException Traceback (most recent call last)
>  in ()
>   1 sel_starts = starts.select('uuid', 'at').withColumnRenamed('uuid', 
> 'uuid_x')#.withColumnRenamed('at', 'at_x')
>   2 sel_closes = closes.select('uuid', 'at', 'session_uuid', 
> 'total_session_sec')
> > 3 start_close = sel_starts.join(sel_closes, sel_starts['uuid_x'] == 
> sel_closes['session_uuid'])
>   4 start_close.cache()
>   5 start_close.take(1)
> /Applications/spark-1.5.2-bin-hadoop2.4/python/pyspark/sql/dataframe.py in 
> join(self, other, on, how)
> 579 on = on[0]
> 580 if how is

[jira] [Resolved] (SPARK-5269) BlockManager.dataDeserialize always creates a new serializer instance

2015-12-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5269.
--
Resolution: Won't Fix
  Assignee: (was: Matt Cheah)

> BlockManager.dataDeserialize always creates a new serializer instance
> -
>
> Key: SPARK-5269
> URL: https://issues.apache.org/jira/browse/SPARK-5269
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ivan Vergiliev
>  Labels: performance, serializers
>
> BlockManager.dataDeserialize always creates a new instance of the serializer, 
> which is pretty slow in some cases. I'm using Kryo serialization and have a 
> custom registrator, and its register method is showing up as taking about 15% 
> of the execution time in my profiles. This started happening after I 
> increased the number of keys in a job with a shuffle phase by a factor of 40.
> One solution I can think of is to create a ThreadLocal SerializerInstance for 
> the defaultSerializer, and only create a new one if a custom serializer is 
> passed in. AFAICT a custom serializer is passed only from 
> DiskStore.getValues, and that, on the other hand, depends on the serializer 
> passed to ExternalSorter. I don't know how often this is used, but I think 
> this can still be a good solution for the standard use case.
> Oh, and also - ExternalSorter already has a SerializerInstance, so if the 
> getValues method is called from a single thread, maybe we can pass that 
> directly?
> I'd be happy to try a patch but would probably need a confirmation from 
> someone that this approach would indeed work (or an idea for another).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12364) Add ML example for SparkR

2015-12-16 Thread Yanbo Liang (JIRA)

Yanbo Liang created SPARK-12364:
---

 Summary: Add ML example for SparkR
 Key: SPARK-12364
 URL: https://issues.apache.org/jira/browse/SPARK-12364
 Project: Spark
  Issue Type: Improvement
  Components: ML, SparkR
Reporter: Yanbo Liang


Add ML example for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12364) Add ML example for SparkR

2015-12-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059835#comment-15059835
 ] 

Apache Spark commented on SPARK-12364:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10324

> Add ML example for SparkR
> -
>
> Key: SPARK-12364
> URL: https://issues.apache.org/jira/browse/SPARK-12364
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>
> Add ML example for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12364) Add ML example for SparkR

2015-12-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12364:


Assignee: (was: Apache Spark)

> Add ML example for SparkR
> -
>
> Key: SPARK-12364
> URL: https://issues.apache.org/jira/browse/SPARK-12364
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>
> Add ML example for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2015-12-16 Thread Nikita Tarasenko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059840#comment-15059840
 ] 

Nikita Tarasenko commented on SPARK-12177:
--

Current implementation doesn't include SSL support.
I could add SSL support after accepting my current changes. I think for Spark 
integration we could use spark.ssl.* properties as global configuration and add 
new spark.ssl.kafka.* properties for overriding global properties.

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12364) Add ML example for SparkR

2015-12-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12364:


Assignee: Apache Spark

> Add ML example for SparkR
> -
>
> Key: SPARK-12364
> URL: https://issues.apache.org/jira/browse/SPARK-12364
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> Add ML example for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10114) Add optional getters to the spark.sql.Row

2015-12-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10114.
---
Resolution: Duplicate

> Add optional getters to the spark.sql.Row
> -
>
> Key: SPARK-10114
> URL: https://issues.apache.org/jira/browse/SPARK-10114
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: etienne
>Priority: Trivial
>
> Add function 
> {code}
> def getAsOpt[T](i : Int ) : Option[T] 
> def getAsOption[T](filedname : String) : Option[T]
> {code}
> [PR here|https://github.com/apache/spark/pull/8312#issuecomment-132607361]
> This will clarify the code which extract optional value from row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12365) Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called

2015-12-16 Thread Ted Yu (JIRA)

Ted Yu created SPARK-12365:
--

 Summary: Use ShutdownHookManager where 
Runtime.getRuntime.addShutdownHook() is called
 Key: SPARK-12365
 URL: https://issues.apache.org/jira/browse/SPARK-12365
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Ted Yu
Priority: Minor


SPARK-9886 fixed call to Runtime.getRuntime.addShutdownHook() in 
ExternalBlockStore.scala

This issue intends to address remaining usage of 
Runtime.getRuntime.addShutdownHook()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12365) Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called

2015-12-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12365:


Assignee: Apache Spark

> Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called
> 
>
> Key: SPARK-12365
> URL: https://issues.apache.org/jira/browse/SPARK-12365
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Ted Yu
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-9886 fixed call to Runtime.getRuntime.addShutdownHook() in 
> ExternalBlockStore.scala
> This issue intends to address remaining usage of 
> Runtime.getRuntime.addShutdownHook()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12365) Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called

2015-12-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059861#comment-15059861
 ] 

Apache Spark commented on SPARK-12365:
--

User 'ted-yu' has created a pull request for this issue:
https://github.com/apache/spark/pull/10325

> Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called
> 
>
> Key: SPARK-12365
> URL: https://issues.apache.org/jira/browse/SPARK-12365
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Ted Yu
>Priority: Minor
>
> SPARK-9886 fixed call to Runtime.getRuntime.addShutdownHook() in 
> ExternalBlockStore.scala
> This issue intends to address remaining usage of 
> Runtime.getRuntime.addShutdownHook()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12366) IllegalArgumentException: requirement failed: File not found: ...sql/catalyst/expressions/GeneratedClass.class when df.show

2015-12-16 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-12366:
---

 Summary: IllegalArgumentException: requirement failed: File not 
found: ...sql/catalyst/expressions/GeneratedClass.class when df.show
 Key: SPARK-12366
 URL: https://issues.apache.org/jira/browse/SPARK-12366
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Jacek Laskowski


When executing {{df.show}} with a dataframe from JSON {{spark-shell}} printed 
out *loads* of {{java.lang.IllegalArgumentException}}. It was fine the second 
time I called {{df.show}}. See below.

{code}
➜  spark git:(master) ✗ ./bin/spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
WARN NativeCodeLoader: Unable to load native-hadoop library for your 
platform... using builtin-java classes where applicable
Spark context available as sc (master = local[*], app id = local-1450265501090).
WARN ObjectStore: Version information not found in metastore. 
hive.metastore.schema.verification is not enabled so recording the schema 
version 1.2.0
WARN ObjectStore: Failed to get database default, returning 
NoSuchObjectException
SQL context available as sqlContext.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0-SNAPSHOT
  /_/

Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df = sqlContext.read.json("examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> df.show
ERROR TransportRequestHandler: Error opening stream 
/classes/org/apache/spark/sql/catalyst/expressions/GeneratedClass.class for 
request from /172.20.4.141:55136
java.lang.IllegalArgumentException: requirement failed: File not found: 
/classes/org/apache/spark/sql/catalyst/expressions/GeneratedClass.class
at scala.Predef$.require(Predef.scala:219)
at 
org.apache.spark.rpc.netty.NettyStreamManager.openStream(NettyStreamManager.scala:60)
at 
org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:106)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
ERROR NettyRpcEnv: Error downloading stream 
/classes/org/apache/spark/sql/catalyst/expressions/GeneratedClass.class.
java.lang.RuntimeException: java.lang.IllegalArgumentException: requirement 
failed: File not found: 
/classes/org/apache/spark/sql/catalyst/

[jira] [Created] (SPARK-12367) NoSuchElementException during prediction with Random Forest Regressor

2015-12-16 Thread Eugene Morozov (JIRA)

Eugene Morozov created SPARK-12367:
--

 Summary: NoSuchElementException during prediction with Random 
Forest Regressor
 Key: SPARK-12367
 URL: https://issues.apache.org/jira/browse/SPARK-12367
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.5.2
Reporter: Eugene Morozov


I'm consistently getting "java.util.NoSuchElementException: key not found: 
1.0". Code, input data and stack trace attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12367) NoSuchElementException during prediction with Random Forest Regressor

2015-12-16 Thread Eugene Morozov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Morozov updated SPARK-12367:
---
Attachment: complete-stack-trace.log
input.gz
CodeThatGivesANoSuchElementException.java

> NoSuchElementException during prediction with Random Forest Regressor
> -
>
> Key: SPARK-12367
> URL: https://issues.apache.org/jira/browse/SPARK-12367
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.2
>Reporter: Eugene Morozov
> Attachments: CodeThatGivesANoSuchElementException.java, 
> complete-stack-trace.log, input.gz
>
>
> I'm consistently getting "java.util.NoSuchElementException: key not found: 
> 1.0". Code, input data and stack trace attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12367) NoSuchElementException during prediction with Random Forest Regressor

2015-12-16 Thread Eugene Morozov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Morozov updated SPARK-12367:
---
Description: 
I'm consistently getting "java.util.NoSuchElementException: key not found: 1.0" 
while trying to do a prediction on a trained model.
I use ml package - Pipeline API. The model is successfully trained, I see some 
stats in the output: total, findSplitsBins, findBestSplits, chooseSplits. I can 
even serialize it into a file and use afterwards, but the prediction is broken 
somehow.

Code, input data and stack trace attached.

  was:I'm consistently getting "java.util.NoSuchElementException: key not 
found: 1.0". Code, input data and stack trace attached.


> NoSuchElementException during prediction with Random Forest Regressor
> -
>
> Key: SPARK-12367
> URL: https://issues.apache.org/jira/browse/SPARK-12367
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.2
>Reporter: Eugene Morozov
> Attachments: CodeThatGivesANoSuchElementException.java, 
> complete-stack-trace.log, input.gz
>
>
> I'm consistently getting "java.util.NoSuchElementException: key not found: 
> 1.0" while trying to do a prediction on a trained model.
> I use ml package - Pipeline API. The model is successfully trained, I see 
> some stats in the output: total, findSplitsBins, findBestSplits, 
> chooseSplits. I can even serialize it into a file and use afterwards, but the 
> prediction is broken somehow.
> Code, input data and stack trace attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10618) Refactoring and adding test for Mesos coarse-grained Scheduler

2015-12-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059905#comment-15059905
 ] 

Apache Spark commented on SPARK-10618:
--

User 'SleepyThread' has created a pull request for this issue:
https://github.com/apache/spark/pull/10326

> Refactoring and adding test for Mesos coarse-grained Scheduler
> --
>
> Key: SPARK-10618
> URL: https://issues.apache.org/jira/browse/SPARK-10618
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Akash Mishra
>Priority: Trivial
>
> Various condition for checking if Mesos offer is valid for Scheduling or not 
> is cluttered in the resourceOffer method and has no unit test. This is a 
> refactoring JIRA to extract the logic in a method and test that method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12345) Mesos cluster mode is broken

2015-12-16 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059937#comment-15059937
 ] 

Saisai Shao commented on SPARK-12345:
-

I think by default Spark Mesos implementation will ship all the environment 
variables to the remote nodes, which includes {{SPARK_HOME}}, and Mesos itself 
will invoke the spark application through scripts, and inside the scripts we 
honor if {{SPARK_HOME}} is already set, so that's the problem.

Basically, I think there're two sides we could fix:

1. We should not expose {{SPARK_HOME}} to the environment if it is not set 
specifically. Otherwise cases like here will potentially have problem.
2. Spark on Mesos should not blindly ship all the environment variables to the 
remote side. The best way for Spark on Mesos is to invoke the Java program like 
what YARN did currently, not rely on scripts.

> Mesos cluster mode is broken
> 
>
> Key: SPARK-12345
> URL: https://issues.apache.org/jira/browse/SPARK-12345
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Critical
>
> The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2.
> The driver is confused about where SPARK_HOME is. It resolves 
> `mesos.executor.uri` or `spark.mesos.executor.home` relative to the 
> filesystem where the driver runs, which is wrong.
> {code}
> I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0
> I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave 
> 130bdc39-44e7-4256-8c22-602040d337f1-S1
> bin/spark-submit: line 27: 
> /Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class:
>  No such file or directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12345) Mesos cluster mode is broken

2015-12-16 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059954#comment-15059954
 ] 

Saisai Shao commented on SPARK-12345:
-

Having a quick test by not exporting {{SPARK_HOME}}, the application is failed 
to start, code in {{SparkLaucher}} needs {{SPARK_HOME}}. So solution 2 is the 
only choice to filter out {{SPARK_HOME}} if necessary.

> Mesos cluster mode is broken
> 
>
> Key: SPARK-12345
> URL: https://issues.apache.org/jira/browse/SPARK-12345
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Critical
>
> The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2.
> The driver is confused about where SPARK_HOME is. It resolves 
> `mesos.executor.uri` or `spark.mesos.executor.home` relative to the 
> filesystem where the driver runs, which is wrong.
> {code}
> I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0
> I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave 
> 130bdc39-44e7-4256-8c22-602040d337f1-S1
> bin/spark-submit: line 27: 
> /Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class:
>  No such file or directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties

2015-12-16 Thread Chandra Sekhar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059959#comment-15059959
 ] 

Chandra Sekhar commented on SPARK-10625:


Hi I built spark with the pull request now It's able to handle JDBC drivers 
that adds unserializable objects in connection properties

> Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds 
> unserializable objects into connection properties
> --
>
> Key: SPARK-10625
> URL: https://issues.apache.org/jira/browse/SPARK-10625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
> Environment: Ubuntu 14.04
>Reporter: Peng Cheng
>  Labels: jdbc, spark, sparksql
>
> Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by 
> adding new objects into the connection properties, which is then reused by 
> Spark to be deployed to workers. When some of these new objects are unable to 
> be serializable it will trigger an org.apache.spark.SparkException: Task not 
> serializable. The following test code snippet demonstrate this problem by 
> using a modified H2 driver:
>   test("INSERT to JDBC Datasource with UnserializableH2Driver") {
> object UnserializableH2Driver extends org.h2.Driver {
>   override def connect(url: String, info: Properties): Connection = {
> val result = super.connect(url, info)
> info.put("unserializableDriver", this)
> result
>   }
>   override def getParentLogger: Logger = ???
> }
> import scala.collection.JavaConversions._
> val oldDrivers = 
> DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq
> oldDrivers.foreach{
>   DriverManager.deregisterDriver
> }
> DriverManager.registerDriver(UnserializableH2Driver)
> sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE")
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count)
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", 
> properties).collect()(0).length)
> DriverManager.deregisterDriver(UnserializableH2Driver)
> oldDrivers.foreach{
>   DriverManager.registerDriver
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12349) Make spark.ml PCAModel load backwards compatible

2015-12-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12349:


Assignee: (was: Apache Spark)

> Make spark.ml PCAModel load backwards compatible
> 
>
> Key: SPARK-12349
> URL: https://issues.apache.org/jira/browse/SPARK-12349
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>
> [SPARK-11530] introduced a new data member {{explainedVariance}} to spark.ml 
> PCAModel and modified the save/load method.  This method needs to be updated 
> in order to load models saved with Spark 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11759) Spark task on mesos with docker fails with sh: 1: /opt/spark/bin/spark-class: not found

2015-12-16 Thread Iulian Dragos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060020#comment-15060020
 ] 

Iulian Dragos commented on SPARK-11759:
---

Are you using 1.6.0 release candidates, or a snapshot version?

> Spark task on mesos with docker fails with sh: 1: /opt/spark/bin/spark-class: 
> not found
> ---
>
> Key: SPARK-11759
> URL: https://issues.apache.org/jira/browse/SPARK-11759
> Project: Spark
>  Issue Type: Question
>  Components: Deploy, Mesos
>Reporter: Luis Alves
>
> I'm using Spark 1.5.1 and Mesos 0.25 in cluster mode. I've the 
> spark-dispatcher running, and run spark-submit. The driver is launched, but 
> it fails because it seems that the task it launches fails.
> In the logs of the launched task I can see the following error: 
> sh: 1: /opt/spark/bin/spark-class: not found
> I checked my docker image and the  /opt/spark/bin/spark-class exists. I then 
> noticed that it's using sh, therefore I tried to run (in the docker image) 
> the following:
> sh /opt/spark/bin/spark-class org.apache.spark.deploy.master.Master
> It fails with the following error:
> spark-class: 73: spark-class: Syntax error: "(" unexpected
> Is this an error in Spark?
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12349) Make spark.ml PCAModel load backwards compatible

2015-12-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12349:


Assignee: Apache Spark

> Make spark.ml PCAModel load backwards compatible
> 
>
> Key: SPARK-12349
> URL: https://issues.apache.org/jira/browse/SPARK-12349
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> [SPARK-11530] introduced a new data member {{explainedVariance}} to spark.ml 
> PCAModel and modified the save/load method.  This method needs to be updated 
> in order to load models saved with Spark 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12345) Mesos cluster mode is broken

2015-12-16 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060035#comment-15060035
 ] 

Saisai Shao commented on SPARK-12345:
-

Here is the one solution 
(https://github.com/apache/spark/compare/master...jerryshao:SPARK-12345), mind 
taking a trying in your cluster? Thanks a lot.

> Mesos cluster mode is broken
> 
>
> Key: SPARK-12345
> URL: https://issues.apache.org/jira/browse/SPARK-12345
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Critical
>
> The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2.
> The driver is confused about where SPARK_HOME is. It resolves 
> `mesos.executor.uri` or `spark.mesos.executor.home` relative to the 
> filesystem where the driver runs, which is wrong.
> {code}
> I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0
> I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave 
> 130bdc39-44e7-4256-8c22-602040d337f1-S1
> bin/spark-submit: line 27: 
> /Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class:
>  No such file or directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12349) Make spark.ml PCAModel load backwards compatible

2015-12-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12349:


Assignee: Apache Spark

> Make spark.ml PCAModel load backwards compatible
> 
>
> Key: SPARK-12349
> URL: https://issues.apache.org/jira/browse/SPARK-12349
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> [SPARK-11530] introduced a new data member {{explainedVariance}} to spark.ml 
> PCAModel and modified the save/load method.  This method needs to be updated 
> in order to load models saved with Spark 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12349) Make spark.ml PCAModel load backwards compatible

2015-12-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12349:


Assignee: (was: Apache Spark)

> Make spark.ml PCAModel load backwards compatible
> 
>
> Key: SPARK-12349
> URL: https://issues.apache.org/jira/browse/SPARK-12349
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>
> [SPARK-11530] introduced a new data member {{explainedVariance}} to spark.ml 
> PCAModel and modified the save/load method.  This method needs to be updated 
> in order to load models saved with Spark 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12349) Make spark.ml PCAModel load backwards compatible

2015-12-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12349:


Assignee: (was: Apache Spark)

> Make spark.ml PCAModel load backwards compatible
> 
>
> Key: SPARK-12349
> URL: https://issues.apache.org/jira/browse/SPARK-12349
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>
> [SPARK-11530] introduced a new data member {{explainedVariance}} to spark.ml 
> PCAModel and modified the save/load method.  This method needs to be updated 
> in order to load models saved with Spark 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12349) Make spark.ml PCAModel load backwards compatible

2015-12-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12349:


Assignee: Apache Spark

> Make spark.ml PCAModel load backwards compatible
> 
>
> Key: SPARK-12349
> URL: https://issues.apache.org/jira/browse/SPARK-12349
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> [SPARK-11530] introduced a new data member {{explainedVariance}} to spark.ml 
> PCAModel and modified the save/load method.  This method needs to be updated 
> in order to load models saved with Spark 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12345) Mesos cluster mode is broken

2015-12-16 Thread Luc Bourlier (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060153#comment-15060153
 ] 

Luc Bourlier commented on SPARK-12345:
--

I have almost the same fix, which is the same logic: do not carry `SPARK_HOME` 
information across systems. But I changed it in SparkSubmit side:

https://github.com/skyluc/spark/commit/5b6eaa5bf936ef42d46b53564816d62b2aa44e86

I'm running tests to check that Mesos is working fine with those changes.

> Mesos cluster mode is broken
> 
>
> Key: SPARK-12345
> URL: https://issues.apache.org/jira/browse/SPARK-12345
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Critical
>
> The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2.
> The driver is confused about where SPARK_HOME is. It resolves 
> `mesos.executor.uri` or `spark.mesos.executor.home` relative to the 
> filesystem where the driver runs, which is wrong.
> {code}
> I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0
> I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave 
> 130bdc39-44e7-4256-8c22-602040d337f1-S1
> bin/spark-submit: line 27: 
> /Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class:
>  No such file or directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12345) Mesos cluster mode is broken

2015-12-16 Thread Luc Bourlier (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060153#comment-15060153
 ] 

Luc Bourlier edited comment on SPARK-12345 at 12/16/15 3:33 PM:


I have almost the same fix, which is the same logic: do not carry 
{{SPARK_HOME}} information across systems. But I changed it in SparkSubmit side:

https://github.com/skyluc/spark/commit/5b6eaa5bf936ef42d46b53564816d62b2aa44e86

I'm running tests to check that Mesos is working fine with those changes.


was (Author: skyluc):
I have almost the same fix, which is the same logic: do not carry `SPARK_HOME` 
information across systems. But I changed it in SparkSubmit side:

https://github.com/skyluc/spark/commit/5b6eaa5bf936ef42d46b53564816d62b2aa44e86

I'm running tests to check that Mesos is working fine with those changes.

> Mesos cluster mode is broken
> 
>
> Key: SPARK-12345
> URL: https://issues.apache.org/jira/browse/SPARK-12345
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Critical
>
> The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2.
> The driver is confused about where SPARK_HOME is. It resolves 
> `mesos.executor.uri` or `spark.mesos.executor.home` relative to the 
> filesystem where the driver runs, which is wrong.
> {code}
> I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0
> I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave 
> 130bdc39-44e7-4256-8c22-602040d337f1-S1
> bin/spark-submit: line 27: 
> /Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class:
>  No such file or directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12368) Better doc for the binary classification evaluator setMetricName method

2015-12-16 Thread Benjamin Fradet (JIRA)

Benjamin Fradet created SPARK-12368:
---

 Summary: Better doc for the binary classification evaluator 
setMetricName method
 Key: SPARK-12368
 URL: https://issues.apache.org/jira/browse/SPARK-12368
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, ML
Reporter: Benjamin Fradet
Priority: Minor


For the BinaryClassificationEvaluator, the scaladoc doesn't mention that 
"areaUnderPR" is supported, only that the default is "areadUnderROC".

Also, in the documentation, it is said that:
"The default metric used to choose the best ParamMap can be overriden by the 
setMetric method in each of these evaluators."
However, the method is called setMetricName.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12368) Better doc for the binary classification evaluator setMetricName method

2015-12-16 Thread Benjamin Fradet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060167#comment-15060167
 ] 

Benjamin Fradet commented on SPARK-12368:
-

I've started working on this.

> Better doc for the binary classification evaluator setMetricName method
> ---
>
> Key: SPARK-12368
> URL: https://issues.apache.org/jira/browse/SPARK-12368
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Benjamin Fradet
>Priority: Minor
>
> For the BinaryClassificationEvaluator, the scaladoc doesn't mention that 
> "areaUnderPR" is supported, only that the default is "areadUnderROC".
> Also, in the documentation, it is said that:
> "The default metric used to choose the best ParamMap can be overriden by the 
> setMetric method in each of these evaluators."
> However, the method is called setMetricName.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9042) Spark SQL incompatibility if security is enforced on the Hive warehouse

2015-12-16 Thread Charmee Patel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060177#comment-15060177
 ] 

Charmee Patel commented on SPARK-9042:
--

I don't agree with closing this issue. 

We have an actual Production environment where Cloudera Support has configured 
Sentry. Read/Insert on a specific table works fine. But as soon as we have 
queries that create new partitions, we get exact same permissions issue. Any 
queries that also alter information hive metastore are blocked. "Create table 
As" is also blocked. 

Look at comment up here by Vijay Singh - he nails down that the problem is how 
Hive Context works directly with Hive Metastore and Sentry blocks that.

> Spark SQL incompatibility if security is enforced on the Hive warehouse
> ---
>
> Key: SPARK-9042
> URL: https://issues.apache.org/jira/browse/SPARK-9042
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Nitin Kak
>
> Hive queries executed from Spark using HiveContext use CLI to create the 
> query plan and then access the Hive table directories(under 
> /user/hive/warehouse/) directly. This gives AccessContolException if Apache 
> Sentry is installed:
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=kakn, access=READ_EXECUTE, 
> inode="/user/hive/warehouse/mastering.db/sample_table":hive:hive:drwxrwx--t 
> With Apache Sentry, only "hive" user(created only for Sentry) has the 
> permissions to access the hive warehouse directory. After Sentry 
> installations all the queries are directed to HiveServer2 which translates 
> the changes the invoking user to "hive" and then access the hive warehouse 
> directory. However, HiveContext does not execute the query through 
> HiveServer2 which is leading to the issue. Here is an example of executing 
> hive query through HiveContext.
> val hqlContext = new HiveContext(sc) // Create context to run Hive queries 
> val pairRDD = hqlContext.sql(hql) // where hql is the string with hive query 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12368) Better doc for the binary classification evaluator' metricName

2015-12-16 Thread Benjamin Fradet (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Fradet updated SPARK-12368:

Summary: Better doc for the binary classification evaluator' metricName  
(was: Better doc for the binary classification evaluator setMetricName method)

> Better doc for the binary classification evaluator' metricName
> --
>
> Key: SPARK-12368
> URL: https://issues.apache.org/jira/browse/SPARK-12368
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Benjamin Fradet
>Priority: Minor
>
> For the BinaryClassificationEvaluator, the scaladoc doesn't mention that 
> "areaUnderPR" is supported, only that the default is "areadUnderROC".
> Also, in the documentation, it is said that:
> "The default metric used to choose the best ParamMap can be overriden by the 
> setMetric method in each of these evaluators."
> However, the method is called setMetricName.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9042) Spark SQL incompatibility if security is enforced on the Hive warehouse

2015-12-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060185#comment-15060185
 ] 

Sean Owen commented on SPARK-9042:
--

As I say -- AFAIK this problem is resolved by enabling the Sentry plugin for 
HDFS. Did you do that?

> Spark SQL incompatibility if security is enforced on the Hive warehouse
> ---
>
> Key: SPARK-9042
> URL: https://issues.apache.org/jira/browse/SPARK-9042
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Nitin Kak
>
> Hive queries executed from Spark using HiveContext use CLI to create the 
> query plan and then access the Hive table directories(under 
> /user/hive/warehouse/) directly. This gives AccessContolException if Apache 
> Sentry is installed:
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=kakn, access=READ_EXECUTE, 
> inode="/user/hive/warehouse/mastering.db/sample_table":hive:hive:drwxrwx--t 
> With Apache Sentry, only "hive" user(created only for Sentry) has the 
> permissions to access the hive warehouse directory. After Sentry 
> installations all the queries are directed to HiveServer2 which translates 
> the changes the invoking user to "hive" and then access the hive warehouse 
> directory. However, HiveContext does not execute the query through 
> HiveServer2 which is leading to the issue. Here is an example of executing 
> hive query through HiveContext.
> val hqlContext = new HiveContext(sc) // Create context to run Hive queries 
> val pairRDD = hqlContext.sql(hql) // where hql is the string with hive query 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9042) Spark SQL incompatibility if security is enforced on the Hive warehouse

2015-12-16 Thread Charmee Patel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060196#comment-15060196
 ] 

Charmee Patel commented on SPARK-9042:
--

Sentry plugin is only managing access to HDFS. We have no issues reading
data from tables based on appropriate permissions. The production cluster
where we encounter this issue was configured for Sentry by Cloudera team.
But we can follow up one more time. Vijay Singh, who commented on this
issue and was on cloudera team helped us narrow down the issue to Hive
Metastore + Sentry being the culprit.





> Spark SQL incompatibility if security is enforced on the Hive warehouse
> ---
>
> Key: SPARK-9042
> URL: https://issues.apache.org/jira/browse/SPARK-9042
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Nitin Kak
>
> Hive queries executed from Spark using HiveContext use CLI to create the 
> query plan and then access the Hive table directories(under 
> /user/hive/warehouse/) directly. This gives AccessContolException if Apache 
> Sentry is installed:
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=kakn, access=READ_EXECUTE, 
> inode="/user/hive/warehouse/mastering.db/sample_table":hive:hive:drwxrwx--t 
> With Apache Sentry, only "hive" user(created only for Sentry) has the 
> permissions to access the hive warehouse directory. After Sentry 
> installations all the queries are directed to HiveServer2 which translates 
> the changes the invoking user to "hive" and then access the hive warehouse 
> directory. However, HiveContext does not execute the query through 
> HiveServer2 which is leading to the issue. Here is an example of executing 
> hive query through HiveContext.
> val hqlContext = new HiveContext(sc) // Create context to run Hive queries 
> val pairRDD = hqlContext.sql(hql) // where hql is the string with hive query 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9042) Spark SQL incompatibility if security is enforced on the Hive warehouse

2015-12-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060201#comment-15060201
 ] 

Sean Owen commented on SPARK-9042:
--

Yea, this is coming from our support. (Can we take this to Cloudera land until 
it's clear?) I was told this was resolved and it was a Sentry plugin config 
issue.

> Spark SQL incompatibility if security is enforced on the Hive warehouse
> ---
>
> Key: SPARK-9042
> URL: https://issues.apache.org/jira/browse/SPARK-9042
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Nitin Kak
>
> Hive queries executed from Spark using HiveContext use CLI to create the 
> query plan and then access the Hive table directories(under 
> /user/hive/warehouse/) directly. This gives AccessContolException if Apache 
> Sentry is installed:
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=kakn, access=READ_EXECUTE, 
> inode="/user/hive/warehouse/mastering.db/sample_table":hive:hive:drwxrwx--t 
> With Apache Sentry, only "hive" user(created only for Sentry) has the 
> permissions to access the hive warehouse directory. After Sentry 
> installations all the queries are directed to HiveServer2 which translates 
> the changes the invoking user to "hive" and then access the hive warehouse 
> directory. However, HiveContext does not execute the query through 
> HiveServer2 which is leading to the issue. Here is an example of executing 
> hive query through HiveContext.
> val hqlContext = new HiveContext(sc) // Create context to run Hive queries 
> val pairRDD = hqlContext.sql(hql) // where hql is the string with hive query 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12288) Support UnsafeRow in Coalesce/Except/Intersect

2015-12-16 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12288:
---
Assignee: Xiao Li

> Support UnsafeRow in Coalesce/Except/Intersect
> --
>
> Key: SPARK-12288
> URL: https://issues.apache.org/jira/browse/SPARK-12288
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12293) Support UnsafeRow in LocalTableScan

2015-12-16 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12293:
---
Assignee: Liang-Chi Hsieh  (was: Apache Spark)

> Support UnsafeRow in LocalTableScan
> ---
>
> Key: SPARK-12293
> URL: https://issues.apache.org/jira/browse/SPARK-12293
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Liang-Chi Hsieh
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9042) Spark SQL incompatibility if security is enforced on the Hive warehouse

2015-12-16 Thread Andrew Ray (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060208#comment-15060208
 ] 

Andrew Ray commented on SPARK-9042:
---

Sean, I think there are a couple issues going on here. In my experience with 
the Sentry HDFS plugin, you can read tables just fine from spark (which was the 
stated issue here). However there are other similar issues that are real, you 
can't create/modify any tables. There are two issues there. First is HDFS 
permissions, the sentry hdfs plugin only gives you read access. Second is Hive 
metastore permissions, even if you create the table in some other hdfs location 
that you have write access to you will still fail as you can't make 
modifications to the hive metastore as it has a whitelist of users that is by 
default set to just hive and impala.

> Spark SQL incompatibility if security is enforced on the Hive warehouse
> ---
>
> Key: SPARK-9042
> URL: https://issues.apache.org/jira/browse/SPARK-9042
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Nitin Kak
>
> Hive queries executed from Spark using HiveContext use CLI to create the 
> query plan and then access the Hive table directories(under 
> /user/hive/warehouse/) directly. This gives AccessContolException if Apache 
> Sentry is installed:
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=kakn, access=READ_EXECUTE, 
> inode="/user/hive/warehouse/mastering.db/sample_table":hive:hive:drwxrwx--t 
> With Apache Sentry, only "hive" user(created only for Sentry) has the 
> permissions to access the hive warehouse directory. After Sentry 
> installations all the queries are directed to HiveServer2 which translates 
> the changes the invoking user to "hive" and then access the hive warehouse 
> directory. However, HiveContext does not execute the query through 
> HiveServer2 which is leading to the issue. Here is an example of executing 
> hive query through HiveContext.
> val hqlContext = new HiveContext(sc) // Create context to run Hive queries 
> val pairRDD = hqlContext.sql(hql) // where hql is the string with hive query 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12369) ataFrameReader fails on globbing parquet paths

2015-12-16 Thread Yana Kadiyska (JIRA)

Yana Kadiyska created SPARK-12369:
-

 Summary: ataFrameReader fails on globbing parquet paths
 Key: SPARK-12369
 URL: https://issues.apache.org/jira/browse/SPARK-12369
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
Reporter: Yana Kadiyska


Start with a list of parquet paths where some or all do not exist:

{noformat}
val paths=List("/foo/month=05/*.parquet","/foo/month=06/*.parquet")

 sqlContext.read.parquet(paths:_*)
java.lang.NullPointerException
at org.apache.hadoop.fs.Globber.glob(Globber.java:218)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625)
at 
org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:251)
at 
org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:258)
at 
org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:264)
at 
org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:260)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:260)
{noformat}

It would be better to produce a dataframe from the paths that do exist and log 
a warning that a path was missing. Not sure for "all paths are missing case" -- 
could return an emptyDF with no schema or a nicer exception...But I would 
prefer not to have to pre-validate paths





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12369) DataFrameReader fails on globbing parquet paths

2015-12-16 Thread Yana Kadiyska (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yana Kadiyska updated SPARK-12369:
--
Summary: DataFrameReader fails on globbing parquet paths  (was: 
ataFrameReader fails on globbing parquet paths)

> DataFrameReader fails on globbing parquet paths
> ---
>
> Key: SPARK-12369
> URL: https://issues.apache.org/jira/browse/SPARK-12369
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yana Kadiyska
>
> Start with a list of parquet paths where some or all do not exist:
> {noformat}
> val paths=List("/foo/month=05/*.parquet","/foo/month=06/*.parquet")
>  sqlContext.read.parquet(paths:_*)
> java.lang.NullPointerException
> at org.apache.hadoop.fs.Globber.glob(Globber.java:218)
> at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:251)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:258)
> at 
> org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:264)
> at 
> org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:260)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:260)
> {noformat}
> It would be better to produce a dataframe from the paths that do exist and 
> log a warning that a path was missing. Not sure for "all paths are missing 
> case" -- could return an emptyDF with no schema or a nicer exception...But I 
> would prefer not to have to pre-validate paths



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12179) Spark SQL get different result with the same code

2015-12-16 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060213#comment-15060213
 ] 

Davies Liu commented on SPARK-12179:


I think this UDF is not thread safe, rowNum  and comparedColumn  will be 
updated by multiple threads

> Spark SQL get different result with the same code
> -
>
> Key: SPARK-12179
> URL: https://issues.apache.org/jira/browse/SPARK-12179
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.3.0, 1.3.1, 1.3.2, 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, 
> 1.5.2, 1.5.3
> Environment: hadoop version: 2.5.0-cdh5.3.2
> spark version: 1.5.3
> run mode: yarn-client
>Reporter: Tao Li
>Priority: Critical
>
> I run the sql in yarn-client mode, but get different result each time.
> As you can see the example, I get the different shuffle write with the same 
> shuffle read in two jobs with the same code.
> Some of my spark app runs well, but some always met this problem. And I met 
> this problem on spark 1.3, 1.4 and 1.5 version.
> Can you give me some suggestions about the possible causes or how do I figure 
> out the problem?
> 1. First Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.8 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54934
> 2. Second Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.6 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54905



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12369) DataFrameReader fails on globbing parquet paths

2015-12-16 Thread Yana Kadiyska (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yana Kadiyska updated SPARK-12369:
--
Description: 
Start with a list of parquet paths where some or all do not exist:

{noformat}
val paths=List("/foo/month=05/*.parquet","/foo/month=06/*.parquet")

 sqlContext.read.parquet(paths:_*)
java.lang.NullPointerException
at org.apache.hadoop.fs.Globber.glob(Globber.java:218)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625)
at 
org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:251)
at 
org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:258)
at 
org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:264)
at 
org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:260)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:260)
{noformat}

It would be better to produce a dataframe from the paths that do exist and log 
a warning that a path was missing. Not sure for "all paths are missing case" -- 
probably return an emptyDF with no schema since that method already does so on 
empty path list.But I would prefer not to have to pre-validate paths



  was:
Start with a list of parquet paths where some or all do not exist:

{noformat}
val paths=List("/foo/month=05/*.parquet","/foo/month=06/*.parquet")

 sqlContext.read.parquet(paths:_*)
java.lang.NullPointerException
at org.apache.hadoop.fs.Globber.glob(Globber.java:218)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625)
at 
org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:251)
at 
org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:258)
at 
org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:264)
at 
org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:260)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:260)
{noformat}

It would be better to produce a dataframe from the paths that do exist and log 
a warning that a path was missing. Not sure for "all paths are missing case" -- 
could return an emptyDF with no schema or a nicer exception...But I would 
prefer not to have to pre-validate paths




> DataFrameReader fails on globbing parquet paths
> ---
>
> Key: SPARK-12369
> URL: https://issues.apache.org/jira/browse/SPARK-12369
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yana Kadiyska
>
> Start with a list of parquet paths where some or all do not exist:
> {noformat}
> val paths=List("/foo/month=05/*.parquet","/foo/month=06/*.parquet")
>  sqlContext.read.parquet(paths:_*)
> java.lang.NullPointerException
> at org.apache.hadoop.fs.Globber.glob(Globber.java:218)
> at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:251)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:258)
> at 
> org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:264)
> at 
> org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:260)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:260)
> {noformat}
> It would be better to pro

[jira] [Commented] (SPARK-12179) Spark SQL get different result with the same code

2015-12-16 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060217#comment-15060217
 ] 

Davies Liu commented on SPARK-12179:


Which  version of Spark are you using?

> Spark SQL get different result with the same code
> -
>
> Key: SPARK-12179
> URL: https://issues.apache.org/jira/browse/SPARK-12179
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.3.0, 1.3.1, 1.3.2, 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, 
> 1.5.2, 1.5.3
> Environment: hadoop version: 2.5.0-cdh5.3.2
> spark version: 1.5.3
> run mode: yarn-client
>Reporter: Tao Li
>Priority: Critical
>
> I run the sql in yarn-client mode, but get different result each time.
> As you can see the example, I get the different shuffle write with the same 
> shuffle read in two jobs with the same code.
> Some of my spark app runs well, but some always met this problem. And I met 
> this problem on spark 1.3, 1.4 and 1.5 version.
> Can you give me some suggestions about the possible causes or how do I figure 
> out the problem?
> 1. First Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.8 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54934
> 2. Second Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.6 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54905



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12179) Spark SQL get different result with the same code

2015-12-16 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060217#comment-15060217
 ] 

Davies Liu edited comment on SPARK-12179 at 12/16/15 4:04 PM:
--

Which  version of Spark are you using? Can you try latest 1.5 branch or 1.6 RC?


was (Author: davies):
Which  version of Spark are you using?

> Spark SQL get different result with the same code
> -
>
> Key: SPARK-12179
> URL: https://issues.apache.org/jira/browse/SPARK-12179
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.3.0, 1.3.1, 1.3.2, 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, 
> 1.5.2, 1.5.3
> Environment: hadoop version: 2.5.0-cdh5.3.2
> spark version: 1.5.3
> run mode: yarn-client
>Reporter: Tao Li
>Priority: Critical
>
> I run the sql in yarn-client mode, but get different result each time.
> As you can see the example, I get the different shuffle write with the same 
> shuffle read in two jobs with the same code.
> Some of my spark app runs well, but some always met this problem. And I met 
> this problem on spark 1.3, 1.4 and 1.5 version.
> Can you give me some suggestions about the possible causes or how do I figure 
> out the problem?
> 1. First Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.8 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54934
> 2. Second Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.6 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54905



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11478) ML StringIndexer return inconsistent schema

2015-12-16 Thread Wojciech Jurczyk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060210#comment-15060210
 ] 

Wojciech Jurczyk commented on SPARK-11478:
--

Any progress on this, [~yanboliang]? I faced the same issue and I'm wondering 
if you're still working on this.

> ML StringIndexer return inconsistent schema
> ---
>
> Key: SPARK-11478
> URL: https://issues.apache.org/jira/browse/SPARK-11478
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> ML StringIndexer transform and transformSchema return inconsistent schema.
> {code}
> val data = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, 
> "a"), (5, "c")), 2)
> val df = sqlContext.createDataFrame(data).toDF("id", "label")
> val indexer = new StringIndexer()
>   .setInputCol("label")
>   .setOutputCol("labelIndex")
>   .fit(df)
> val transformed = indexer.transform(df)
> println(transformed.schema.toString())
> println(indexer.transformSchema(df.schema))
> The nullable of "labelIndex" return inconsistent value:
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,true))
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,false))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9042) Spark SQL incompatibility if security is enforced on the Hive warehouse

2015-12-16 Thread Charmee Patel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060219#comment-15060219
 ] 

Charmee Patel commented on SPARK-9042:
--

Yes, we can take it back to cloudera land. I will follow up on that
separately.

However, I do not believe we had an hdfs access issue. We could read the
data fine. We could see our queries write into temp directories under the
table fine as well. After all data was written to temp location our process
failed because Hive Metastore rejected changes to partitions.




> Spark SQL incompatibility if security is enforced on the Hive warehouse
> ---
>
> Key: SPARK-9042
> URL: https://issues.apache.org/jira/browse/SPARK-9042
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Nitin Kak
>
> Hive queries executed from Spark using HiveContext use CLI to create the 
> query plan and then access the Hive table directories(under 
> /user/hive/warehouse/) directly. This gives AccessContolException if Apache 
> Sentry is installed:
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=kakn, access=READ_EXECUTE, 
> inode="/user/hive/warehouse/mastering.db/sample_table":hive:hive:drwxrwx--t 
> With Apache Sentry, only "hive" user(created only for Sentry) has the 
> permissions to access the hive warehouse directory. After Sentry 
> installations all the queries are directed to HiveServer2 which translates 
> the changes the invoking user to "hive" and then access the hive warehouse 
> directory. However, HiveContext does not execute the query through 
> HiveServer2 which is leading to the issue. Here is an example of executing 
> hive query through HiveContext.
> val hqlContext = new HiveContext(sc) // Create context to run Hive queries 
> val pairRDD = hqlContext.sql(hql) // where hql is the string with hive query 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2015-12-16 Thread Tom Waterhouse (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060218#comment-15060218
 ] 

Tom Waterhouse commented on SPARK-12177:


SSL is very important for our deployment, +1 for adding it into the integration.

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9042) Spark SQL incompatibility if security is enforced on the Hive warehouse

2015-12-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060223#comment-15060223
 ] 

Sean Owen commented on SPARK-9042:
--

OK, could be. I'm updating this by request from more expert people internally, 
who might comment here. If it's a slightly different issue we can reopen and 
alter the description if needed to narrow it down.

That said, isn't this a Sentry question, and not Spark? It seems like it either 
blocks metastore access on purpose, or, can allow the necessary access, in 
which case it's some config somewhere that needs to be made to allow it.

> Spark SQL incompatibility if security is enforced on the Hive warehouse
> ---
>
> Key: SPARK-9042
> URL: https://issues.apache.org/jira/browse/SPARK-9042
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Nitin Kak
>
> Hive queries executed from Spark using HiveContext use CLI to create the 
> query plan and then access the Hive table directories(under 
> /user/hive/warehouse/) directly. This gives AccessContolException if Apache 
> Sentry is installed:
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=kakn, access=READ_EXECUTE, 
> inode="/user/hive/warehouse/mastering.db/sample_table":hive:hive:drwxrwx--t 
> With Apache Sentry, only "hive" user(created only for Sentry) has the 
> permissions to access the hive warehouse directory. After Sentry 
> installations all the queries are directed to HiveServer2 which translates 
> the changes the invoking user to "hive" and then access the hive warehouse 
> directory. However, HiveContext does not execute the query through 
> HiveServer2 which is leading to the issue. Here is an example of executing 
> hive query through HiveContext.
> val hqlContext = new HiveContext(sc) // Create context to run Hive queries 
> val pairRDD = hqlContext.sql(hql) // where hql is the string with hive query 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12370) Documentation should link to examples from its own release version

2015-12-16 Thread Brian London (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian London updated SPARK-12370:
-
Issue Type: Improvement  (was: Bug)

> Documentation should link to examples from its own release version
> --
>
> Key: SPARK-12370
> URL: https://issues.apache.org/jira/browse/SPARK-12370
> Project: Spark
>  Issue Type: Improvement
>Reporter: Brian London
>
> When documentation is built is should reference examples from the same build. 
>  There are times when the docs have links that point to files in the github 
> head which may not be valid on the current release.
> As an example the spark streaming page for 1.5.2 (currently at 
> http://spark.apache.org/docs/latest/streaming-programming-guide.html) links 
> to the stateful network word count example (at 
> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala).
>   That example file utilizes a number of 1.6 features that are not available 
> in 1.5.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12370) Documentation should link to examples from its own release version

2015-12-16 Thread Brian London (JIRA)

Brian London created SPARK-12370:


 Summary: Documentation should link to examples from its own 
release version
 Key: SPARK-12370
 URL: https://issues.apache.org/jira/browse/SPARK-12370
 Project: Spark
  Issue Type: Bug
Reporter: Brian London


When documentation is built is should reference examples from the same build.  
There are times when the docs have links that point to files in the github head 
which may not be valid on the current release.

As an example the spark streaming page for 1.5.2 (currently at 
http://spark.apache.org/docs/latest/streaming-programming-guide.html) links to 
the stateful network word count example (at 
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala).
  That example file utilizes a number of 1.6 features that are not available in 
1.5.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12345) Mesos cluster mode is broken

2015-12-16 Thread Luc Bourlier (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060239#comment-15060239
 ] 

Luc Bourlier commented on SPARK-12345:
--

I tested our usual test cases with my change, and it is working well.

> Mesos cluster mode is broken
> 
>
> Key: SPARK-12345
> URL: https://issues.apache.org/jira/browse/SPARK-12345
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Critical
>
> The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2.
> The driver is confused about where SPARK_HOME is. It resolves 
> `mesos.executor.uri` or `spark.mesos.executor.home` relative to the 
> filesystem where the driver runs, which is wrong.
> {code}
> I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0
> I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave 
> 130bdc39-44e7-4256-8c22-602040d337f1-S1
> bin/spark-submit: line 27: 
> /Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class:
>  No such file or directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12345) Mesos cluster mode is broken

2015-12-16 Thread Iulian Dragos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060236#comment-15060236
 ] 

Iulian Dragos commented on SPARK-12345:
---

I'd prefer filtering it at at the submit side, if everything else works. Doing 
it in the scheduler will be confusing for users. The Driver environment would 
still show SPARK_HOME (in the dispatcher UI), but in fact it would be filtered 
out in practice.

> Mesos cluster mode is broken
> 
>
> Key: SPARK-12345
> URL: https://issues.apache.org/jira/browse/SPARK-12345
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Critical
>
> The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2.
> The driver is confused about where SPARK_HOME is. It resolves 
> `mesos.executor.uri` or `spark.mesos.executor.home` relative to the 
> filesystem where the driver runs, which is wrong.
> {code}
> I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0
> I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave 
> 130bdc39-44e7-4256-8c22-602040d337f1-S1
> bin/spark-submit: line 27: 
> /Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class:
>  No such file or directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6936) SQLContext.sql() caused deadlock in multi-thread env

2015-12-16 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6936.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

> SQLContext.sql() caused deadlock in multi-thread env
> 
>
> Key: SPARK-6936
> URL: https://issues.apache.org/jira/browse/SPARK-6936
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: JDK 1.8.x, RedHat
> Linux version 2.6.32-431.23.3.el6.x86_64 
> (mockbu...@x86-027.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red 
> Hat 4.4.7-4) (GCC) ) #1 SMP Wed Jul 16 06:12:23 EDT 2014
>Reporter: Paul Wu
>  Labels: deadlock, sql, threading
> Fix For: 1.5.0
>
>
> Doing (the same query) in more than one threads with SQLConext.sql may lead 
> to deadlock. Here is a way to reproduce it (since this is multi-thread issue, 
> the reproduction may or may not be so easy).
> 1. Register a relatively big table.
> 2.  Create two different classes and in the classes, do the same query in a 
> method and put the results in a set and print out the set size.
> 3.  Create two threads to use an object from each class in the run method. 
> Start the threads. For my tests,  it can have a deadlock just in a few runs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12368) Better doc for the binary classification evaluator' metricName

2015-12-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060279#comment-15060279
 ] 

Apache Spark commented on SPARK-12368:
--

User 'BenFradet' has created a pull request for this issue:
https://github.com/apache/spark/pull/10328

> Better doc for the binary classification evaluator' metricName
> --
>
> Key: SPARK-12368
> URL: https://issues.apache.org/jira/browse/SPARK-12368
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Benjamin Fradet
>Priority: Minor
>
> For the BinaryClassificationEvaluator, the scaladoc doesn't mention that 
> "areaUnderPR" is supported, only that the default is "areadUnderROC".
> Also, in the documentation, it is said that:
> "The default metric used to choose the best ParamMap can be overriden by the 
> setMetric method in each of these evaluators."
> However, the method is called setMetricName.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12289) Support UnsafeRow in TakeOrderedAndProject/Limit

2015-12-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12289:


Assignee: Apache Spark

> Support UnsafeRow in TakeOrderedAndProject/Limit
> 
>
> Key: SPARK-12289
> URL: https://issues.apache.org/jira/browse/SPARK-12289
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12289) Support UnsafeRow in TakeOrderedAndProject/Limit

2015-12-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060287#comment-15060287
 ] 

Apache Spark commented on SPARK-12289:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/10330

> Support UnsafeRow in TakeOrderedAndProject/Limit
> 
>
> Key: SPARK-12289
> URL: https://issues.apache.org/jira/browse/SPARK-12289
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12371) Make sure Dataset nullability conforms to its underlying logical plan

2015-12-16 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-12371:
--

 Summary: Make sure Dataset nullability conforms to its underlying 
logical plan
 Key: SPARK-12371
 URL: https://issues.apache.org/jira/browse/SPARK-12371
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.6.0, 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian


Currently it's possible to construct a Dataset with different nullability from 
its underlying logical plan, which should be caught during analysis phase:

{code}
val rowRDD = sqlContext.sparkContext.parallelize(Seq(Row("hello"), Row(null)))
val schema = StructType(Seq(StructField("_1", StringType, nullable = false)))
val df = sqlContext.createDataFrame(rowRDD, schema)
df.as[Tuple1[String]].collect().foreach(println)

// Output:
//
//   (hello)
//   (null)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12371) Make sure Dataset nullability conforms to its underlying logical plan

2015-12-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12371:


Assignee: Apache Spark  (was: Cheng Lian)

> Make sure Dataset nullability conforms to its underlying logical plan
> -
>
> Key: SPARK-12371
> URL: https://issues.apache.org/jira/browse/SPARK-12371
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> Currently it's possible to construct a Dataset with different nullability 
> from its underlying logical plan, which should be caught during analysis 
> phase:
> {code}
> val rowRDD = sqlContext.sparkContext.parallelize(Seq(Row("hello"), Row(null)))
> val schema = StructType(Seq(StructField("_1", StringType, nullable = false)))
> val df = sqlContext.createDataFrame(rowRDD, schema)
> df.as[Tuple1[String]].collect().foreach(println)
> // Output:
> //
> //   (hello)
> //   (null)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12360) Support using 64-bit long type in SparkR

2015-12-16 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060381#comment-15060381
 ] 

Shivaram Venkataraman commented on SPARK-12360:
---

The lack of 64 bit numbers is a limitation in R, but I'd like to understand the 
use-cases where this comes up before trying a complex fix. My understanding is 
that long values from JSON / HDFS / Parquet etc. will be read correctly because 
they go through the Scala layers and the problem only comes up when somebody 
does a collect / UDF ? If so I think the problem may not be that important as R 
users probably wouldn't expect long types to work on the R shell. 

Also it might lead to another solution where we don't add a dependency on 
bit64, but we check if bit64 is available and if so we avoid the truncation to 
double etc.

> Support using 64-bit long type in SparkR
> 
>
> Key: SPARK-12360
> URL: https://issues.apache.org/jira/browse/SPARK-12360
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>
> R has no support for 64-bit integers. While in Scala/Java API, some methods 
> have one or more arguments of long type. Currently we support only passing an 
> integer cast from a numeric to Scala/Java side for parameters of long type of 
> such methods. This may have problem covering large data sets.
> Storing a 64-bit integer in a double obviously does not work as some 64-bit 
> integers can not be exactly represented in double format, so x and x+1 can't 
> be distinguished.
> There is a bit64 package 
> (https://cran.r-project.org/web/packages/bit64/index.html) in CRAN which 
> supports vectors of 64-bit integers. We can investigate if it can be used for 
> this purpose.
> two questions are:
> 1. Is the license acceptable?
> 2. This will have SparkR depends on a  non-base third-party package, which 
> may complicate the deployment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 306 matches

Mail list logo