date:20150803

[jira] [Updated] (SPARK-9550) Configuration renaming, defaults changes, and deprecation for 1.5.0 (master ticket)

2015-08-03 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-9550:
--
Description: 
This ticket tracks configurations which need to be renamed, deprecated, or have 
their defaults changed for Spark 1.5.0.

Note that subtasks / comments here do not necessarily need to reflect changes 
that must be performed.  Rather, tasks should be added here to make sure that 
the relevant configurations are at least checked before we cut releases.  This 
ticket will also help us to track configuration changes which must make it into 
the release notes.

*Configuration renaming*

- Consider renaming {{spark.shuffle.memoryFraction}} to 
{{spark.execution.memoryFraction}} 
([discussion|https://github.com/apache/spark/pull/7770#discussion-diff-36019144]).
- Rename all public-facing uses of {{unsafe}} to something less scary, such as 
{{tungsten}}

*Defaults changes*
- Codegen is now enabled by default.
- Tungsten is now enabled by default.
- Parquet schema merging is now disabled by default.
- In-memory relation partition pruning should be enabled by default 
(SPARK-9554).

*Deprecation*
- Local execution has been removed.

  was:
This ticket tracks configurations which need to be renamed, deprecated, or have 
their defaults changed for Spark 1.5.0.

Note that subtasks / comments here do not necessarily need to reflect changes 
that must be performed.  Rather, tasks should be added here to make sure that 
the relevant configurations are at least checked before we cut releases.  This 
ticket will also help us to track configuration changes which must make it into 
the release notes.

*Configuration renaming*

- Consider renaming {{spark.shuffle.memoryFraction}} to 
{{spark.execution.memoryFraction}} 
([discussion|https://github.com/apache/spark/pull/7770#discussion-diff-36019144]).
- Rename all public-facing uses of {{unsafe}} to something less scary, such as 
{{tungsten}}

*Defaults changes*
- Codegen is now enabled by default.
- Tungsten is now enabled by default.
- Parquet schema merging is now disabled by default (SPARK-9554)
- In-memory relation partition pruning should be enabled by default. 

*Deprecation*
- Local execution has been removed.


 Configuration renaming, defaults changes, and deprecation for 1.5.0 (master 
 ticket)
 ---

 Key: SPARK-9550
 URL: https://issues.apache.org/jira/browse/SPARK-9550
 Project: Spark
  Issue Type: Task
  Components: Spark Core, SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Priority: Blocker

 This ticket tracks configurations which need to be renamed, deprecated, or 
 have their defaults changed for Spark 1.5.0.
 Note that subtasks / comments here do not necessarily need to reflect changes 
 that must be performed.  Rather, tasks should be added here to make sure that 
 the relevant configurations are at least checked before we cut releases.  
 This ticket will also help us to track configuration changes which must make 
 it into the release notes.
 *Configuration renaming*
 - Consider renaming {{spark.shuffle.memoryFraction}} to 
 {{spark.execution.memoryFraction}} 
 ([discussion|https://github.com/apache/spark/pull/7770#discussion-diff-36019144]).
 - Rename all public-facing uses of {{unsafe}} to something less scary, such 
 as {{tungsten}}
 *Defaults changes*
 - Codegen is now enabled by default.
 - Tungsten is now enabled by default.
 - Parquet schema merging is now disabled by default.
 - In-memory relation partition pruning should be enabled by default 
 (SPARK-9554).
 *Deprecation*
 - Local execution has been removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9550) Configuration renaming, defaults changes, and deprecation for 1.5.0 (master ticket)

2015-08-03 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-9550:
--
Description: 
This ticket tracks configurations which need to be renamed, deprecated, or have 
their defaults changed for Spark 1.5.0.

Note that subtasks / comments here do not necessarily need to reflect changes 
that must be performed.  Rather, tasks should be added here to make sure that 
the relevant configurations are at least checked before we cut releases.  This 
ticket will also help us to track configuration changes which must make it into 
the release notes.

*Configuration renaming*

- Consider renaming {{spark.shuffle.memoryFraction}} to 
{{spark.execution.memoryFraction}} 
([discussion|https://github.com/apache/spark/pull/7770#discussion-diff-36019144]).
- Rename all public-facing uses of {{unsafe}} to something less scary, such as 
{{tungsten}}

*Defaults changes*
- Codegen is now enabled by default.
- Tungsten is now enabled by default.
- Parquet schema merging is now disabled by default (SPARK-9554)
- In-memory relation partition pruning should be enabled by default. 

*Deprecation*
- Local execution has been removed.

  was:
This ticket tracks configurations which need to be renamed, deprecated, or have 
their defaults changed for Spark 1.5.0.

Note that subtasks / comments here do not necessarily need to reflect changes 
that must be performed.  Rather, tasks should be added here to make sure that 
the relevant configurations are at least checked before we cut releases.  This 
ticket will also help us to track configuration changes which must make it into 
the release notes.

*Configuration renaming*

- Consider renaming {{spark.shuffle.memoryFraction}} to 
{{spark.execution.memoryFraction}} 
([discussion|https://github.com/apache/spark/pull/7770#discussion-diff-36019144]).
- Rename all public-facing uses of {{unsafe}} to something less scary, such as 
{{tungsten}}

*Defaults changes*
- Codegen is now enabled by default.
- Tungsten is now enabled by default.
- Parquet schema merging is now disabled by default.
- In-memory relation partition pruning should be enabled by default. 

*Deprecation*
- Local execution has been removed.


 Configuration renaming, defaults changes, and deprecation for 1.5.0 (master 
 ticket)
 ---

 Key: SPARK-9550
 URL: https://issues.apache.org/jira/browse/SPARK-9550
 Project: Spark
  Issue Type: Task
  Components: Spark Core, SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Priority: Blocker

 This ticket tracks configurations which need to be renamed, deprecated, or 
 have their defaults changed for Spark 1.5.0.
 Note that subtasks / comments here do not necessarily need to reflect changes 
 that must be performed.  Rather, tasks should be added here to make sure that 
 the relevant configurations are at least checked before we cut releases.  
 This ticket will also help us to track configuration changes which must make 
 it into the release notes.
 *Configuration renaming*
 - Consider renaming {{spark.shuffle.memoryFraction}} to 
 {{spark.execution.memoryFraction}} 
 ([discussion|https://github.com/apache/spark/pull/7770#discussion-diff-36019144]).
 - Rename all public-facing uses of {{unsafe}} to something less scary, such 
 as {{tungsten}}
 *Defaults changes*
 - Codegen is now enabled by default.
 - Tungsten is now enabled by default.
 - Parquet schema merging is now disabled by default (SPARK-9554)
 - In-memory relation partition pruning should be enabled by default. 
 *Deprecation*
 - Local execution has been removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9550) Configuration renaming, defaults changes, and deprecation for 1.5.0 (master ticket)

2015-08-03 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652001#comment-14652001
 ] 

Josh Rosen commented on SPARK-9550:
---

Memory defaults changed (will find JIRA links later): 
https://github.com/apache/spark/pull/7896

 Configuration renaming, defaults changes, and deprecation for 1.5.0 (master 
 ticket)
 ---

 Key: SPARK-9550
 URL: https://issues.apache.org/jira/browse/SPARK-9550
 Project: Spark
  Issue Type: Task
  Components: Spark Core, SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Priority: Blocker

 This ticket tracks configurations which need to be renamed, deprecated, or 
 have their defaults changed for Spark 1.5.0.
 Note that subtasks / comments here do not necessarily need to reflect changes 
 that must be performed.  Rather, tasks should be added here to make sure that 
 the relevant configurations are at least checked before we cut releases.  
 This ticket will also help us to track configuration changes which must make 
 it into the release notes.
 *Configuration renaming*
 - Consider renaming {{spark.shuffle.memoryFraction}} to 
 {{spark.execution.memoryFraction}} 
 ([discussion|https://github.com/apache/spark/pull/7770#discussion-diff-36019144]).
 - Rename all public-facing uses of {{unsafe}} to something less scary, such 
 as {{tungsten}}
 *Defaults changes*
 - Codegen is now enabled by default.
 - Tungsten is now enabled by default.
 - Parquet schema merging is now disabled by default.
 - In-memory relation partition pruning should be enabled by default. 
 *Deprecation*
 - Local execution has been removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9559) Worker redundancy/failover in spark stand-alone mode

2015-08-03 Thread partha bishnu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651984#comment-14651984
 ] 

partha bishnu commented on SPARK-9559:
--

Thanks. If I understand correctly --num-executor is for deploying on Yarn 
cluster and --total-executor-cores for spark stand-alone cluster. I am using 
spark stand-alone cluster.

 Worker redundancy/failover in spark stand-alone mode
 

 Key: SPARK-9559
 URL: https://issues.apache.org/jira/browse/SPARK-9559
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: partha bishnu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9559) Worker redundancy/failover in spark stand-alone mode

2015-08-03 Thread partha bishnu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651907#comment-14651907
 ] 

partha bishnu edited comment on SPARK-9559 at 8/3/15 2:24 PM:
--

The expected behavior should be that the spark master on n-1 should restart the 
jobs with one new executor under the running worker jvm on the other worker 
node n-3 that is up and running after the n-2 went down. Isn that expected 
behavior ? But that does not happen.
Thanks for your comments



was (Author: pa1975):
The expected behavior should be that the spark master on n-1 should restart the 
jobs with one new executor under the running worker jvm on the other worker 
node n-2 that is up and running after the n-3 went down. Isn that expected 
behavior ? But that does not happen.
Thanks for your comments


 Worker redundancy/failover in spark stand-alone mode
 

 Key: SPARK-9559
 URL: https://issues.apache.org/jira/browse/SPARK-9559
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: partha bishnu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9484) Word2Vec import/export for original binary format

2015-08-03 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651899#comment-14651899
 ] 

Manoj Kumar commented on SPARK-9484:


I just went through the C code that does the .bin reading.

What would be the best way to go about this? The codepaths should be almost 
completely different if path.endsWith(.bin) or not right? Also should this 
use the SaveLoadV1_0 object, or should we have a different object (say 
SaveLoadBinary) which would keep the codepaths independent and help easier 
maintenance?

 Word2Vec import/export for original binary format
 -

 Key: SPARK-9484
 URL: https://issues.apache.org/jira/browse/SPARK-9484
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor

 It would be nice to add model import/export for Word2Vec which handles the 
 original binary format used by [https://code.google.com/p/word2vec/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9559) Worker redundancy/failover in spark stand-alone mode

2015-08-03 Thread partha bishnu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651907#comment-14651907
 ] 

partha bishnu commented on SPARK-9559:
--

The expected behavior should be that the spark master on n-1 should restart the 
jobs with one new executor under the running worker jvm on the other worker 
node n-2 that is up and running after the n-3 went down. Isn that expected 
behavior ? But that does not happen.
Thanks for your comments


 Worker redundancy/failover in spark stand-alone mode
 

 Key: SPARK-9559
 URL: https://issues.apache.org/jira/browse/SPARK-9559
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: partha bishnu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9559) Worker redundancy/failover in spark stand-alone mode

2015-08-03 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651923#comment-14651923
 ] 

Sean Owen commented on SPARK-9559:
--

OK so you have requested 1 total executor. Did the job fail then? or are you 
talking about the state after it completed?

 Worker redundancy/failover in spark stand-alone mode
 

 Key: SPARK-9559
 URL: https://issues.apache.org/jira/browse/SPARK-9559
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: partha bishnu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9559) Worker redundancy/failover in spark stand-alone mode

2015-08-03 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651924#comment-14651924
 ] 

Sean Owen commented on SPARK-9559:
--

PS you should try reproducing this on master rather than 1.3, which is 
relatively old at this stage.

 Worker redundancy/failover in spark stand-alone mode
 

 Key: SPARK-9559
 URL: https://issues.apache.org/jira/browse/SPARK-9559
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: partha bishnu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9499) Possible file handle leak in spilling/sort code

2015-08-03 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651932#comment-14651932
 ] 

Herman van Hovell commented on SPARK-9499:
--

I have also tried 
{noformat}spark.shuffle.sort.bypassMergeThreshold=0{noformat}. It does improve 
on the current situation, but now crashes a bit further down the line. I'll 
attach another {noformat}lsof{noformat} dump.

 Possible file handle leak in spilling/sort code
 ---

 Key: SPARK-9499
 URL: https://issues.apache.org/jira/browse/SPARK-9499
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin
Assignee: Josh Rosen
Priority: Blocker
 Attachments: perf_test4.scala


 As reported by [~hvanhovell]. See SPARK-8850.
 Hi,
 I am getting a Too many open files error since the unsafe mode is on. The 
 same thing popped up when playing with unsafe before. The error is below:
 {noformat}
 15/07/30 23:37:29 WARN TaskSetManager: Lost task 2.0 in stage 33.0 (TID 2423, 
 localhost): java.io.FileNotFoundException: 
 /tmp/blockmgr-b3d3e14a-f313-4075-8082-7d97f012e35a/14/temp_shuffle_1cab42fa-dcb1-4114-ae53-1674446f9dac
  (Too many open files)
   at java.io.FileOutputStream.open0(Native Method)
   at java.io.FileOutputStream.open(FileOutputStream.java:270)
   at java.io.FileOutputStream.init(FileOutputStream.java:213)
   at 
 org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
   at 
 org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:111)
   at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:86)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 {noformat}
 I am currently working on local mode (which is probably the cause of the 
 problem) using the following command line:
 {noformat}
 $SPARK_HOME/bin/spark-shell --master local[*] --driver-memory 14G 
 --driver-library-path $HADOOP_NATIVE_LIB
 {noformat}
 The maximum number of files I can open are 1024 (ulimit -n). I have tried to 
 run the same code with an increased limit, but this didn't work out.
 Dump of all open files after a Too Many Files Open error.
 The command used to make the dump:
 {code}
 lsof -c java  open
 {code}
 The job starts crashing after as soon as I start sorting 1000 rows for 
 the 9th time (doing benchmarking). I guess files are left open after every 
 benchmark? Is there a way to trigger the closing of files?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9499) Possible file handle leak in spilling/sort code

2015-08-03 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651932#comment-14651932
 ] 

Herman van Hovell edited comment on SPARK-9499 at 8/3/15 2:46 PM:
--

I have also tried {{spark.shuffle.sort.bypassMergeThreshold=0}}. It does 
improve on the current situation, but now crashes a bit further down the line. 
I'll attach another {{lsof}} dump.


was (Author: hvanhovell):
I have also tried 
{noformat}spark.shuffle.sort.bypassMergeThreshold=0{noformat}. It does improve 
on the current situation, but now crashes a bit further down the line. I'll 
attach another {noformat}lsof{noformat} dump.

 Possible file handle leak in spilling/sort code
 ---

 Key: SPARK-9499
 URL: https://issues.apache.org/jira/browse/SPARK-9499
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin
Assignee: Josh Rosen
Priority: Blocker
 Attachments: perf_test4.scala


 As reported by [~hvanhovell]. See SPARK-8850.
 Hi,
 I am getting a Too many open files error since the unsafe mode is on. The 
 same thing popped up when playing with unsafe before. The error is below:
 {noformat}
 15/07/30 23:37:29 WARN TaskSetManager: Lost task 2.0 in stage 33.0 (TID 2423, 
 localhost): java.io.FileNotFoundException: 
 /tmp/blockmgr-b3d3e14a-f313-4075-8082-7d97f012e35a/14/temp_shuffle_1cab42fa-dcb1-4114-ae53-1674446f9dac
  (Too many open files)
   at java.io.FileOutputStream.open0(Native Method)
   at java.io.FileOutputStream.open(FileOutputStream.java:270)
   at java.io.FileOutputStream.init(FileOutputStream.java:213)
   at 
 org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
   at 
 org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:111)
   at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:86)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 {noformat}
 I am currently working on local mode (which is probably the cause of the 
 problem) using the following command line:
 {noformat}
 $SPARK_HOME/bin/spark-shell --master local[*] --driver-memory 14G 
 --driver-library-path $HADOOP_NATIVE_LIB
 {noformat}
 The maximum number of files I can open are 1024 (ulimit -n). I have tried to 
 run the same code with an increased limit, but this didn't work out.
 Dump of all open files after a Too Many Files Open error.
 The command used to make the dump:
 {code}
 lsof -c java  open
 {code}
 The job starts crashing after as soon as I start sorting 1000 rows for 
 the 9th time (doing benchmarking). I guess files are left open after every 
 benchmark? Is there a way to trigger the closing of files?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9499) Possible file handle leak in spilling/sort code

2015-08-03 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-9499:
-
Attachment: open.files.II.txt

{{lsof}} with {{spark.shuffle.sort.bypassMergeThreshold=0}} setting

 Possible file handle leak in spilling/sort code
 ---

 Key: SPARK-9499
 URL: https://issues.apache.org/jira/browse/SPARK-9499
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin
Assignee: Josh Rosen
Priority: Blocker
 Attachments: open.files.II.txt, perf_test4.scala


 As reported by [~hvanhovell]. See SPARK-8850.
 Hi,
 I am getting a Too many open files error since the unsafe mode is on. The 
 same thing popped up when playing with unsafe before. The error is below:
 {noformat}
 15/07/30 23:37:29 WARN TaskSetManager: Lost task 2.0 in stage 33.0 (TID 2423, 
 localhost): java.io.FileNotFoundException: 
 /tmp/blockmgr-b3d3e14a-f313-4075-8082-7d97f012e35a/14/temp_shuffle_1cab42fa-dcb1-4114-ae53-1674446f9dac
  (Too many open files)
   at java.io.FileOutputStream.open0(Native Method)
   at java.io.FileOutputStream.open(FileOutputStream.java:270)
   at java.io.FileOutputStream.init(FileOutputStream.java:213)
   at 
 org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
   at 
 org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:111)
   at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:86)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 {noformat}
 I am currently working on local mode (which is probably the cause of the 
 problem) using the following command line:
 {noformat}
 $SPARK_HOME/bin/spark-shell --master local[*] --driver-memory 14G 
 --driver-library-path $HADOOP_NATIVE_LIB
 {noformat}
 The maximum number of files I can open are 1024 (ulimit -n). I have tried to 
 run the same code with an increased limit, but this didn't work out.
 Dump of all open files after a Too Many Files Open error.
 The command used to make the dump:
 {code}
 lsof -c java  open
 {code}
 The job starts crashing after as soon as I start sorting 1000 rows for 
 the 9th time (doing benchmarking). I guess files are left open after every 
 benchmark? Is there a way to trigger the closing of files?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9560) Add LDA data generator

2015-08-03 Thread yuhao yang (JIRA)

yuhao yang created SPARK-9560:
-

 Summary: Add LDA data generator
 Key: SPARK-9560
 URL: https://issues.apache.org/jira/browse/SPARK-9560
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: yuhao yang


Add data generator for LDA.
Hope it can help with performance improvement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9560) Add LDA data generator

2015-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9560:
---

Assignee: Apache Spark

 Add LDA data generator
 --

 Key: SPARK-9560
 URL: https://issues.apache.org/jira/browse/SPARK-9560
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: yuhao yang
Assignee: Apache Spark

 Add data generator for LDA.
 Hope it can help with performance improvement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9560) Add LDA data generator

2015-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9560:
---

Assignee: (was: Apache Spark)

 Add LDA data generator
 --

 Key: SPARK-9560
 URL: https://issues.apache.org/jira/browse/SPARK-9560
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: yuhao yang

 Add data generator for LDA.
 Hope it can help with performance improvement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9560) Add LDA data generator

2015-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652061#comment-14652061
 ] 

Apache Spark commented on SPARK-9560:
-

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/7898

 Add LDA data generator
 --

 Key: SPARK-9560
 URL: https://issues.apache.org/jira/browse/SPARK-9560
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: yuhao yang

 Add data generator for LDA.
 Hope it can help with performance improvement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9512) RemoveEvaluationFromSort reorders sort order

2015-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652620#comment-14652620
 ] 

Apache Spark commented on SPARK-9512:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/7906

 RemoveEvaluationFromSort reorders sort order
 

 Key: SPARK-9512
 URL: https://issues.apache.org/jira/browse/SPARK-9512
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yin Huai
Priority: Blocker

 Please refer to the comment in https://github.com/apache/spark/pull/7593 for 
 details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-925) Allow ec2 scripts to load default options from a json file

2015-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-925:
--

Assignee: Apache Spark

 Allow ec2 scripts to load default options from a json file
 --

 Key: SPARK-925
 URL: https://issues.apache.org/jira/browse/SPARK-925
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 0.8.0
Reporter: Shay Seng
Assignee: Apache Spark
Priority: Minor

 The option list for ec2 script can be a little irritating to type in, 
 especially things like path to identity-file, region , zone, ami etc.
 It would be nice if ec2 script looks for an options.json file in the 
 following order: (1) CWD, (2) ~/spark-ec2, (3) same dir as spark_ec2.py
 Something like:
 def get_defaults_from_options():
   # Check to see if a options.json file exists, if so load it. 
   # However, values in the options.json file can only overide values in opts
   # if the Opt values are None or 
   # i.e. commandline options take presidence 
   defaults = 
 {'aws-access-key-id':'','aws-secret-access-key':'','key-pair':'', 
 'identity-file':'', 'region':'ap-southeast-1', 'zone':'', 
 'ami':'','slaves':1, 'instance-type':'m1.large'}
   # Look for options.json in directory cluster was called from
   # Had to modify the spark_ec2 wrapper script since it mangles the pwd
   startwd = os.environ['STARTWD']
   if os.path.exists(os.path.join(startwd,options.json)):
   optionspath = os.path.join(startwd,options.json)
   else:
   optionspath = os.path.join(os.getcwd(),options.json)
   
   try:
 print Loading options file: , optionspath  
 with open (optionspath) as json_data:
 jdata = json.load(json_data)
 for k in jdata:
   defaults[k]=jdata[k]
   except IOError:
 print 'Warning: options.json file not loaded'
   # Check permissions on identity-file, if defined, otherwise launch will 
 fail late and will be irritating
   if defaults['identity-file']!='':
 st = os.stat(defaults['identity-file'])
 user_can_read = bool(st.st_mode  stat.S_IRUSR)
 grp_perms = bool(st.st_mode  stat.S_IRWXG)
 others_perm = bool(st.st_mode  stat.S_IRWXO)
 if (not user_can_read):
   print No read permission to read , defaults['identify-file']
   sys.exit(1)
 if (grp_perms or others_perm):
   print Permissions are too open, please chmod 600 file , 
 defaults['identify-file']
   sys.exit(1)
   # if defaults contain AWS access id or private key, set it to environment. 
   # required for use with boto to access the AWS console 
   if defaults['aws-access-key-id'] != '':
 os.environ['AWS_ACCESS_KEY_ID']=defaults['aws-access-key-id'] 
   if defaults['aws-secret-access-key'] != '':   
 os.environ['AWS_SECRET_ACCESS_KEY'] = defaults['aws-secret-access-key']
   return defaults  
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-925) Allow ec2 scripts to load default options from a json file

2015-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652622#comment-14652622
 ] 

Apache Spark commented on SPARK-925:


User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/7906

 Allow ec2 scripts to load default options from a json file
 --

 Key: SPARK-925
 URL: https://issues.apache.org/jira/browse/SPARK-925
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 0.8.0
Reporter: Shay Seng
Priority: Minor

 The option list for ec2 script can be a little irritating to type in, 
 especially things like path to identity-file, region , zone, ami etc.
 It would be nice if ec2 script looks for an options.json file in the 
 following order: (1) CWD, (2) ~/spark-ec2, (3) same dir as spark_ec2.py
 Something like:
 def get_defaults_from_options():
   # Check to see if a options.json file exists, if so load it. 
   # However, values in the options.json file can only overide values in opts
   # if the Opt values are None or 
   # i.e. commandline options take presidence 
   defaults = 
 {'aws-access-key-id':'','aws-secret-access-key':'','key-pair':'', 
 'identity-file':'', 'region':'ap-southeast-1', 'zone':'', 
 'ami':'','slaves':1, 'instance-type':'m1.large'}
   # Look for options.json in directory cluster was called from
   # Had to modify the spark_ec2 wrapper script since it mangles the pwd
   startwd = os.environ['STARTWD']
   if os.path.exists(os.path.join(startwd,options.json)):
   optionspath = os.path.join(startwd,options.json)
   else:
   optionspath = os.path.join(os.getcwd(),options.json)
   
   try:
 print Loading options file: , optionspath  
 with open (optionspath) as json_data:
 jdata = json.load(json_data)
 for k in jdata:
   defaults[k]=jdata[k]
   except IOError:
 print 'Warning: options.json file not loaded'
   # Check permissions on identity-file, if defined, otherwise launch will 
 fail late and will be irritating
   if defaults['identity-file']!='':
 st = os.stat(defaults['identity-file'])
 user_can_read = bool(st.st_mode  stat.S_IRUSR)
 grp_perms = bool(st.st_mode  stat.S_IRWXG)
 others_perm = bool(st.st_mode  stat.S_IRWXO)
 if (not user_can_read):
   print No read permission to read , defaults['identify-file']
   sys.exit(1)
 if (grp_perms or others_perm):
   print Permissions are too open, please chmod 600 file , 
 defaults['identify-file']
   sys.exit(1)
   # if defaults contain AWS access id or private key, set it to environment. 
   # required for use with boto to access the AWS console 
   if defaults['aws-access-key-id'] != '':
 os.environ['AWS_ACCESS_KEY_ID']=defaults['aws-access-key-id'] 
   if defaults['aws-secret-access-key'] != '':   
 os.environ['AWS_SECRET_ACCESS_KEY'] = defaults['aws-secret-access-key']
   return defaults  
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7165) Sort Merge Join for outer joins

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7165:
---
Sprint: Week 32

 Sort Merge Join for outer joins
 ---

 Key: SPARK-7165
 URL: https://issues.apache.org/jira/browse/SPARK-7165
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Adrian Wang
Assignee: Reynold Xin
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7165) Sort Merge Join for outer joins

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7165:
---
Assignee: Josh Rosen  (was: Reynold Xin)

 Sort Merge Join for outer joins
 ---

 Key: SPARK-7165
 URL: https://issues.apache.org/jira/browse/SPARK-7165
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Adrian Wang
Assignee: Josh Rosen
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7799) Move StreamingContext.actorStream to a separate project and deprecate it in StreamingContext

2015-08-03 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-7799:
-
Target Version/s: 1.6.0  (was: 1.5.0)

 Move StreamingContext.actorStream to a separate project and deprecate it in 
 StreamingContext
 --

 Key: SPARK-7799
 URL: https://issues.apache.org/jira/browse/SPARK-7799
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Shixiong Zhu

 Move {{StreamingContext.actorStream}} to a separate project and deprecate it 
 in {{StreamingContext}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4246) Add testsuite with end-to-end testing of driver failure

2015-08-03 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-4246:
-
Target Version/s:   (was: 1.5.0)

 Add testsuite with end-to-end testing of driver failure 
 

 Key: SPARK-4246
 URL: https://issues.apache.org/jira/browse/SPARK-4246
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9131) Python UDFs change data values

2015-08-03 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652694#comment-14652694
 ] 

Davies Liu commented on SPARK-9131:
---

I think this maybe fixed by https://github.com/apache/spark/pull/7131

[~luispeguerra] Could you help to confirm that its' fixed in master or not?

 Python UDFs change data values
 --

 Key: SPARK-9131
 URL: https://issues.apache.org/jira/browse/SPARK-9131
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0, 1.4.1
 Environment: Pyspark 1.4 and 1.4.1
Reporter: Luis Guerra
Assignee: Davies Liu
Priority: Blocker
 Attachments: testjson_jira9131.z01, testjson_jira9131.z02, 
 testjson_jira9131.z03, testjson_jira9131.z04, testjson_jira9131.z05, 
 testjson_jira9131.z06, testjson_jira9131.zip


 I am having some troubles when using a custom udf in dataframes with pyspark 
 1.4.
 I have rewritten the udf to simplify the problem and it gets even weirder. 
 The udfs I am using do absolutely nothing, they just receive some value and 
 output the same value with the same format.
 I show you my code below:
 {code}
 c= a.join(b, a['ID'] == b['ID_new'], 'inner')
 c.filter(c['ID'] == '62698917').show()
 udf_A = UserDefinedFunction(lambda x: x, DateType())
 udf_B = UserDefinedFunction(lambda x: x, DateType())
 udf_C = UserDefinedFunction(lambda x: x, DateType())
 d = c.select(c['ID'], c['t1'].alias('ta'), 
 udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'), 
 udf_C(vinc_muestra['t2']).alias('td'))
 d.filter(d['ID'] == '62698917').show()
 {code}
 I am showing here the results from the outputs:
 {code}
 +++--+--+
 |  ID | ID_new  | t1   |   t2 |
 +++--+--+
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 +++--+--+
 ++---+---+++
 |   ID|   ta |   tb|   tc| td 
   |
 ++---+---+++
 |62698917| 2012-02-28|   2007-03-05|2003-03-05|
 2014-02-28|
 |62698917| 2012-02-20|   2007-02-15|2002-02-15|
 2013-02-20|
 |62698917| 2012-02-28|   2007-03-10|2005-03-10|
 2014-02-28|
 |62698917| 2012-02-20|   2007-03-05|2003-03-05|
 2013-02-20|
 |62698917| 2012-02-20|   2013-08-02|2013-01-02|
 2013-02-20|
 |62698917| 2012-02-28|   2007-02-15|2002-02-15|
 2014-02-28|
 |62698917| 2012-02-28|   2007-02-15|2002-02-15|
 2014-02-28|
 |62698917| 2012-02-20|   2014-01-02|2013-01-02|
 2013-02-20|
 ++---+---+++
 {code}
 The problem here is that values at columns 'tb', 'tc' and 'td' in dataframe 
 'd' are completely different from values 't1' and 't2' in dataframe c even 
 when my udfs are doing nothing. It seems like if values were somehow got from 
 other registers (or just invented). Results are different between executions 
 (apparently random).
 Thanks in advance



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7441) Implement microbatch functionality so that Spark Streaming can process a large backlog of existing files discovered in batch in smaller batches

2015-08-03 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-7441:
-
Target Version/s: 1.6.0  (was: 1.5.0)

 Implement microbatch functionality so that Spark Streaming can process a 
 large backlog of existing files discovered in batch in smaller batches
 ---

 Key: SPARK-7441
 URL: https://issues.apache.org/jira/browse/SPARK-7441
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Emre Sevinç
  Labels: performance

 Implement microbatch functionality so that Spark Streaming can process a huge 
 backlog of existing files discovered in batch in smaller batches.
 Spark Streaming can process already existing files in a directory, and 
 depending on the value of {{spark.streaming.minRememberDuration}} (60 
 seconds by default, see SPARK-3276 for more details), this might mean that a 
 Spark Streaming application can receive thousands, or hundreds of thousands 
 of files within the first batch interval. This, in turn, leads to something 
 like a 'flooding' effect for the streaming application, that tries to deal 
 with a huge number of existing files in a single batch interval.
  We will propose a very simple change to 
 {{org.apache.spark.streaming.dstream.FileInputDStream}}, so that, based on a 
 configuration property such as {{spark.streaming.microbatch.size}}, it will 
 either keep its default behavior when  {{spark.streaming.microbatch.size}} 
 will have the default value of {{0}} (meaning as many as has been discovered 
 as new files in the current batch interval), or will process new files in 
 groups of {{spark.streaming.microbatch.size}} (e.g. in groups of 100s).
 We have tested this patch in one of our customers, and it's been running 
 successfully for weeks (e.g. there were cases where our Spark Streaming 
 application was stopped, and in the meantime tens of thousands file were 
 created in a directory, and our Spark Streaming application had to process 
 those existing files after it was started).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6116) DataFrame API improvement umbrella ticket (Spark 1.5)

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6116:
---
Target Version/s: 1.5.0  (was: 1.6.0)

 DataFrame API improvement umbrella ticket (Spark 1.5)
 -

 Key: SPARK-6116
 URL: https://issues.apache.org/jira/browse/SPARK-6116
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
  Labels: DataFrame

 An umbrella ticket to track improvements and changes needed to make DataFrame 
 API non-experimental.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6116) DataFrame API improvement umbrella ticket (Spark 1.5)

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6116:
---
Priority: Critical  (was: Blocker)

 DataFrame API improvement umbrella ticket (Spark 1.5)
 -

 Key: SPARK-6116
 URL: https://issues.apache.org/jira/browse/SPARK-6116
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
  Labels: DataFrame

 An umbrella ticket to track improvements and changes needed to make DataFrame 
 API non-experimental.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9572) Add StreamingContext.getActiveOrCreate() to python API

2015-08-03 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-9572:
-
Target Version/s: 1.4.2, 1.5.0  (was: 1.5.0)

 Add StreamingContext.getActiveOrCreate() to python API
 --

 Key: SPARK-9572
 URL: https://issues.apache.org/jira/browse/SPARK-9572
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9579) Improve Word2Vec unit tests

2015-08-03 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-9579:


 Summary: Improve Word2Vec unit tests
 Key: SPARK-9579
 URL: https://issues.apache.org/jira/browse/SPARK-9579
 Project: Spark
  Issue Type: Test
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor


Word2Vec unit tests should be improved in a few ways:
* Test individual components of the algorithm.  This may mean breaking the code 
into smaller methods which can be tested individually.
* Test vs another library, if possible.  Following the example of unit tests 
for LogisticRegression, create robust unit tests making sure the two 
implementations produce similar results.  This may be too hard to do robustly 
(and deterministically).  In that case, the first improvement will suffice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9323) DataFrame.orderBy gives confusing analysis errors when ordering based on nested columns

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9323:
---
Target Version/s: 1.6.0  (was: 1.5.0)

 DataFrame.orderBy gives confusing analysis errors when ordering based on 
 nested columns
 ---

 Key: SPARK-9323
 URL: https://issues.apache.org/jira/browse/SPARK-9323
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1, 1.4.1, 1.5.0
Reporter: Josh Rosen

 The following two queries should be equivalent, but the second crashes:
 {code}
 sqlContext.read.json(sqlContext.sparkContext.makeRDD(
 {a: {b: 1, a: {a: 1}}, c: [{d: 1}]} :: Nil))
   .registerTempTable(nestedOrder)
checkAnswer(sql(SELECT a.b FROM nestedOrder ORDER BY a.b), Row(1))
checkAnswer(sql(select * from nestedOrder).select(a.b).orderBy(a.b), 
 Row(1))
 {code}
 Here's the stacktrace:
 {code}
 Cannot resolve column name a.b among (b);
 org.apache.spark.sql.AnalysisException: Cannot resolve column name a.b 
 among (b);
   at 
 org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
   at 
 org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:651)
   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:640)
   at 
 org.apache.spark.sql.DataFrame$$anonfun$sort$1.apply(DataFrame.scala:593)
   at 
 org.apache.spark.sql.DataFrame$$anonfun$sort$1.apply(DataFrame.scala:593)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at org.apache.spark.sql.DataFrame.sort(DataFrame.scala:593)
   at org.apache.spark.sql.DataFrame.orderBy(DataFrame.scala:624)
   at 
 org.apache.spark.sql.SQLQuerySuite$$anonfun$96.apply$mcV$sp(SQLQuerySuite.scala:1389)
 {code}
 Per [~marmbrus], the problem may be that {{DataFrame.resolve}} calls 
 {{resolveQuoted}}, causing the nested field to be treated as a single field 
 named {{a.b}}.
 UPDATE: here's a shorter one-liner reproduction:
 {code}
 val df = sqlContext.read.json(sqlContext.sparkContext.makeRDD({a: 
 {b: 1}} :: Nil))
 checkAnswer(df.select(a.b).filter(a.b = a.b), Row(1))
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7659) Sort by attributes that are not present in the SELECT clause when there is windowfunction analysis error

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7659:
---
Target Version/s: 1.6.0  (was: 1.5.0)

 Sort by attributes that are not present in the SELECT clause when there is 
 windowfunction analysis error
 

 Key: SPARK-7659
 URL: https://issues.apache.org/jira/browse/SPARK-7659
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.3.1
Reporter: Fei Wang

 flowing sql get error:
 select month,
 sum(product) over (partition by month)
 from windowData order by area



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7821) Hide private SQL JDBC classes from Javadoc

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-7821:
--

Assignee: Reynold Xin

 Hide private SQL JDBC classes from Javadoc
 --

 Key: SPARK-7821
 URL: https://issues.apache.org/jira/browse/SPARK-7821
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Reporter: Josh Rosen
Assignee: Reynold Xin

 We should hide {{private\[sql\]}} JDBC classes from the generated Javadoc, 
 since showing these internal classes can be confusing to users.  This is 
 especially important for the SQL {{jdbc}} package because it contains an 
 internal JDBCRDD class which is easily confused with the public JdbcRDD class 
 in Spark Core (see SPARK-7804 for an example of this).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9263) Add Spark Submit flag to exclude dependencies when using --packages

2015-08-03 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-9263.
---
   Resolution: Fixed
 Assignee: Burak Yavuz
Fix Version/s: 1.5.0

 Add Spark Submit flag to exclude dependencies when using --packages
 ---

 Key: SPARK-9263
 URL: https://issues.apache.org/jira/browse/SPARK-9263
 Project: Spark
  Issue Type: New Feature
  Components: Spark Submit
Reporter: Burak Yavuz
Assignee: Burak Yavuz
 Fix For: 1.5.0


 While the functionality is there to exclude packages, there are no flags that 
 allow users to exclude dependencies, in case of dependency conflicts. We 
 should provide users with a flag to add dependency exclusions in case the 
 packages are not resolved properly (or not available due to licensing).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9583) build/mvn script should not print debug messages to stdout

2015-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9583:
---

Assignee: Apache Spark

 build/mvn script should not print debug messages to stdout
 --

 Key: SPARK-9583
 URL: https://issues.apache.org/jira/browse/SPARK-9583
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.5.0
Reporter: Marcelo Vanzin
Assignee: Apache Spark
Priority: Minor

 Doing that means it cannot be used to run {{make-distribution.sh}}, which 
 parses the stdout of maven commands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9583) build/mvn script should not print debug messages to stdout

2015-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9583:
---

Assignee: (was: Apache Spark)

 build/mvn script should not print debug messages to stdout
 --

 Key: SPARK-9583
 URL: https://issues.apache.org/jira/browse/SPARK-9583
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.5.0
Reporter: Marcelo Vanzin
Priority: Minor

 Doing that means it cannot be used to run {{make-distribution.sh}}, which 
 parses the stdout of maven commands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7542) Support off-heap sort buffer in UnsafeExternalSorter

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7542:
---
Issue Type: New Feature  (was: Sub-task)
Parent: (was: SPARK-9457)

 Support off-heap sort buffer in UnsafeExternalSorter
 

 Key: SPARK-7542
 URL: https://issues.apache.org/jira/browse/SPARK-7542
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Josh Rosen

 {{UnsafeExternalSorter}}, introduced in SPARK-7081, uses on-heap {{long[]}} 
 arrays as its sort buffers.  When records are small, the sorting array might 
 be as large as the data pages, so it would be useful to be able to allocate 
 this array off-heap (using our unsafe LongArray).  Unfortunately, we can't 
 currently do this because TimSort calls {{allocate()}} to create data buffers 
 but doesn't call any corresponding cleanup methods to free them.
 We should look into extending TimSort with buffer freeing methods, then 
 consider switching to LongArray in UnsafeShuffleSortDataFormat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9583) build/mvn script should not print debug messages to stdout

2015-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652880#comment-14652880
 ] 

Apache Spark commented on SPARK-9583:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7915

 build/mvn script should not print debug messages to stdout
 --

 Key: SPARK-9583
 URL: https://issues.apache.org/jira/browse/SPARK-9583
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.5.0
Reporter: Marcelo Vanzin
Priority: Minor

 Doing that means it cannot be used to run {{make-distribution.sh}}, which 
 parses the stdout of maven commands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9457) Sorting improvements

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9457.

   Resolution: Fixed
 Assignee: Reynold Xin
Fix Version/s: 1.5.0

 Sorting improvements
 

 Key: SPARK-9457
 URL: https://issues.apache.org/jira/browse/SPARK-9457
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0


 An umbrella ticket to improve sorting in Tungsten.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9585) HiveHBaseTableInputFormat can'be cached

2015-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9585:
---

Assignee: Apache Spark

 HiveHBaseTableInputFormat can'be cached
 ---

 Key: SPARK-9585
 URL: https://issues.apache.org/jira/browse/SPARK-9585
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula
Assignee: Apache Spark

 Below exception occurs in Spark On HBase function.
 {quote}
 java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException: 
 Task 
 org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@11c6577
  rejected from java.util.concurrent.ThreadPoolExecutor@3414350b[Terminated, 
 pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 17451]
 {quote}
 When an executor has many cores, the tasks belongs to same RDD will using the 
 same InputFormat. But the HiveHBaseTableInputFormat is not thread safety.
 So I think we should add a config to enable cache InputFormat or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9585) HiveHBaseTableInputFormat can'be cached

2015-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9585:
---

Assignee: (was: Apache Spark)

 HiveHBaseTableInputFormat can'be cached
 ---

 Key: SPARK-9585
 URL: https://issues.apache.org/jira/browse/SPARK-9585
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula

 Below exception occurs in Spark On HBase function.
 {quote}
 java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException: 
 Task 
 org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@11c6577
  rejected from java.util.concurrent.ThreadPoolExecutor@3414350b[Terminated, 
 pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 17451]
 {quote}
 When an executor has many cores, the tasks belongs to same RDD will using the 
 same InputFormat. But the HiveHBaseTableInputFormat is not thread safety.
 So I think we should add a config to enable cache InputFormat or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8064) Upgrade Hive to 1.2

2015-08-03 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-8064.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7191
[https://github.com/apache/spark/pull/7191]

 Upgrade Hive to 1.2
 ---

 Key: SPARK-8064
 URL: https://issues.apache.org/jira/browse/SPARK-8064
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Steve Loughran
Priority: Blocker
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9516) Improve Thread Dump page

2015-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652701#comment-14652701
 ] 

Apache Spark commented on SPARK-9516:
-

User 'CodingCat' has created a pull request for this issue:
https://github.com/apache/spark/pull/7910

 Improve Thread Dump page
 

 Key: SPARK-9516
 URL: https://issues.apache.org/jira/browse/SPARK-9516
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Reporter: Nan Zhu

 Originally proposed by [~irashid] in 
 https://github.com/apache/spark/pull/7808#issuecomment-126788335:
 we can enhance the current thread dump page with at least the following two 
 new features:
 1) sort threads by thread status, 
 2) a filter to grep the threads



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2870:
---
Parent Issue: SPARK-9576  (was: SPARK-6116)

 Thorough schema inference directly on RDDs of Python dictionaries
 -

 Key: SPARK-2870
 URL: https://issues.apache.org/jira/browse/SPARK-2870
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Reporter: Nicholas Chammas

 h4. Background
 I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
 They process JSON text directly and infer a schema that covers the entire 
 source data set. 
 This is very important with semi-structured data like JSON since individual 
 elements in the data set are free to have different structures. Matching 
 fields across elements may even have different value types.
 For example:
 {code}
 {a: 5}
 {a: cow}
 {code}
 To get a queryable schema that covers the whole data set, you need to infer a 
 schema by looking at the whole data set. The aforementioned 
 {{SQLContext.json...()}} methods do this very well. 
 h4. Feature Request
 What we need is for {{SQlContext.inferSchema()}} to do this, too. 
 Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
 Python dictionaries and does something functionally equivalent to this:
 {code}
 SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
 {code}
 As of 1.0.2, 
 [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
  just looks at the first element in the data set. This won't help much when 
 the structure of the elements in the target RDD is variable.
 h4. Example Use Case
 * You have some JSON text data that you want to analyze using Spark SQL. 
 * You would use one of the {{SQLContext.json...()}} methods, but you need to 
 do some filtering on the data first to remove bad elements--basically, some 
 minimal schema validation.
 * You deserialize the JSON objects to Python {{dict}} s and filter out the 
 bad ones. You now have an RDD of dictionaries.
 * From this RDD, you want a SchemaRDD that captures the schema for the whole 
 data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9392) Dataframe drop should work on unresolved columns

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9392:
---
Parent Issue: SPARK-9576  (was: SPARK-6116)

 Dataframe drop should work on unresolved columns
 

 Key: SPARK-9392
 URL: https://issues.apache.org/jira/browse/SPARK-9392
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical

 i.e. I would expect `df.drop($colName)` to work.  Another example is the 
 test case here: 
 https://github.com/apache/spark/pull/6585/files#diff-5d2ebf4e9ca5a990136b276859769289R355
  which I would expect to not be a no-op.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8000) SQLContext.read.load() should be able to auto-detect input data

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8000:
---
Parent Issue: SPARK-9576  (was: SPARK-6116)

 SQLContext.read.load() should be able to auto-detect input data
 ---

 Key: SPARK-8000
 URL: https://issues.apache.org/jira/browse/SPARK-8000
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 If it is a parquet file, use parquet. If it is a JSON file, use JSON. If it 
 is an ORC file, use ORC. If it is a CSV file, use CSV.
 Maybe Spark SQL can also write an output metadata file to specify the schema 
  data source that's used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7160) Support converting DataFrames to typed RDDs.

2015-08-03 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652851#comment-14652851
 ] 

Michael Armbrust commented on SPARK-7160:
-

I spent about and hour trying to fix conflicts and get the tests to pass, but 
unfortunately I think this is going to miss the release as a lot of stuff has 
changed now that we are using {{InternalRow}}.  This would be a really good 
feature to have so we should sync up around the beginning of 1.6 if you have 
time to update [~rayortigas] and we can make sure to merge it quickly so 
conflicts don't accumulate again.

 Support converting DataFrames to typed RDDs.
 

 Key: SPARK-7160
 URL: https://issues.apache.org/jira/browse/SPARK-7160
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.3.1
Reporter: Ray Ortigas
Assignee: Ray Ortigas
Priority: Critical

 As a Spark user still working with RDDs, I'd like the ability to convert a 
 DataFrame to a typed RDD.
 For example, if I've converted RDDs to DataFrames so that I could save them 
 as Parquet or CSV files, I would like to rebuild the RDD from those files 
 automatically rather than writing the row-to-type conversion myself.
 {code}
 val rdd0 = sc.parallelize(Seq(Food(apple, 1), Food(banana, 2), 
 Food(cherry, 3)))
 val df0 = rdd0.toDF()
 df0.save(foods.parquet)
 val df1 = sqlContext.load(foods.parquet)
 val rdd1 = df1.toTypedRDD[Food]()
 // rdd0 and rdd1 should have the same elements
 {code}
 I originally submitted a smaller PR for spark-csv 
 https://github.com/databricks/spark-csv/pull/52, but Reynold Xin suggested 
 that converting a DataFrame to a typed RDD wasn't something specific to 
 spark-csv.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9580) Refactor TestSQLContext to make it non-singleton

2015-08-03 Thread Andrew Or (JIRA)

Andrew Or created SPARK-9580:


 Summary: Refactor TestSQLContext to make it non-singleton
 Key: SPARK-9580
 URL: https://issues.apache.org/jira/browse/SPARK-9580
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker


Because the TestSQLContext is a singleton object, there is literally no way to 
start a SparkContext in the SQL tests since we disallow multiple SparkContexts 
in the same JVM. Starting a custom SparkContext is useful when we want to run 
Spark in local-cluster mode or enable the UI, which is normally disabled.

This is a blocker for 1.5 because we currently have tests entirely commented 
out due to this limitation.

https://github.com/apache/spark/blob/7abaaad5b169520fbf7299808b2bafde089a16a2/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9582) Improve clarity of LocalLDAModel log likelihood methods

2015-08-03 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-9582:


 Summary: Improve clarity of LocalLDAModel log likelihood methods
 Key: SPARK-9582
 URL: https://issues.apache.org/jira/browse/SPARK-9582
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley


LocalLDAModel.logLikelihood resembles that for gensim, but it is not analogous 
to DistributedLDAModel.likelihood.  The former includes the log likelihood of 
the inferred topics, but the latter does not.  This JIRA is for refactoring the 
former to separate out the log likelihood of the inferred topics.

CC: [~fliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-9372) For a join operator, rows with null equal join key expression can be filtered out early

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reopened SPARK-9372:


I reverted the merged patch since it had a few problems.


 For a join operator, rows with null equal join key expression can be filtered 
 out early
 ---

 Key: SPARK-9372
 URL: https://issues.apache.org/jira/browse/SPARK-9372
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai

 Taking {{select ... from A join B on (A.key = B.key)}} as an example, we can 
 filter out rows that have null values for column A.key/B.key because those 
 rows do not contribute to the result of the output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9372) For a join operator, rows with null equal join key expression can be filtered out early

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9372:
---
Target Version/s: 1.6.0

 For a join operator, rows with null equal join key expression can be filtered 
 out early
 ---

 Key: SPARK-9372
 URL: https://issues.apache.org/jira/browse/SPARK-9372
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai

 Taking {{select ... from A join B on (A.key = B.key)}} as an example, we can 
 filter out rows that have null values for column A.key/B.key because those 
 rows do not contribute to the result of the output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9372) For a join operator, rows with null equal join key expression can be filtered out early

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9372:
---
Fix Version/s: (was: 1.5.0)

 For a join operator, rows with null equal join key expression can be filtered 
 out early
 ---

 Key: SPARK-9372
 URL: https://issues.apache.org/jira/browse/SPARK-9372
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai

 Taking {{select ... from A join B on (A.key = B.key)}} as an example, we can 
 filter out rows that have null values for column A.key/B.key because those 
 rows do not contribute to the result of the output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9575) Add documentation around Mesos shuffle service and dynamic allocation

2015-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9575:
---

Assignee: Apache Spark

 Add documentation around Mesos shuffle service and dynamic allocation
 -

 Key: SPARK-9575
 URL: https://issues.apache.org/jira/browse/SPARK-9575
 Project: Spark
  Issue Type: Documentation
  Components: Mesos
Reporter: Timothy Chen
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9575) Add documentation around Mesos shuffle service and dynamic allocation

2015-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652634#comment-14652634
 ] 

Apache Spark commented on SPARK-9575:
-

User 'tnachen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7907

 Add documentation around Mesos shuffle service and dynamic allocation
 -

 Key: SPARK-9575
 URL: https://issues.apache.org/jira/browse/SPARK-9575
 Project: Spark
  Issue Type: Documentation
  Components: Mesos
Reporter: Timothy Chen





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9575) Add documentation around Mesos shuffle service and dynamic allocation

2015-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9575:
---

Assignee: (was: Apache Spark)

 Add documentation around Mesos shuffle service and dynamic allocation
 -

 Key: SPARK-9575
 URL: https://issues.apache.org/jira/browse/SPARK-9575
 Project: Spark
  Issue Type: Documentation
  Components: Mesos
Reporter: Timothy Chen





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9575) Add documentation around Mesos shuffle service and dynamic allocation

2015-08-03 Thread Timothy Chen (JIRA)

Timothy Chen created SPARK-9575:
---

 Summary: Add documentation around Mesos shuffle service and 
dynamic allocation
 Key: SPARK-9575
 URL: https://issues.apache.org/jira/browse/SPARK-9575
 Project: Spark
  Issue Type: Documentation
  Components: Mesos
Reporter: Timothy Chen






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7791) Set user for executors in standalone-mode

2015-08-03 Thread Niels Becker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652660#comment-14652660
 ] 

Niels Becker commented on SPARK-7791:
-

We endet up using your workaround. But since our Spark-Slaves are running 
inside docker containers and GlusterFS is mounted on the host mashine, we were 
able to only mount the according user folders into the docker container by 
setting {{spark.mesos.executor.docker.volumes}}. This way Spark is not able to 
write to other users. 

 Set user for executors in standalone-mode
 -

 Key: SPARK-7791
 URL: https://issues.apache.org/jira/browse/SPARK-7791
 Project: Spark
  Issue Type: Wish
  Components: Spark Core
Reporter: Tomasz Früboes

 I'm opening this following a discussion in 
 https://www.mail-archive.com/user@spark.apache.org/msg28633.html
  Our setup was following. Spark (1.3.1, prebuilt for hadoop 2.6, also 2.4) 
 was installed in the standalone mode and started manually from the root 
 account. Everything worked properly apart of operations  such us
 rdd.saveAsPickleFile(ofile)
 which end with exception:
 py4j.protocol.Py4JJavaError: An error occurred while calling o27.save.
 : java.io.IOException: Failed to rename 
 DeprecatedRawLocalFileStatus{path=file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/_temporary/0/task_201505191540_0009_r_01/part-r-2.parquet;
  isDirectory=false; length=534; replication=1; blocksize=33554432; 
 modification_time=1432042832000; access_time=0; owner=; group=; 
 permission=rw-rw-rw-; isSymlink=false} to 
 file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/part-r-2.parquet
  at 
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:346)
 (files created in _temporary were owned by user root). It would be great if 
 spark could set the user for the executor also in standalone mode. Setting 
 SPARK_USER has no effect here.
 BTW it may be a good idea to add some warning (e.g. during spark startup) 
 that running from root account is not very healthy idea. E.g. mapping this 
 function 
 def test(x):
f = open('/etc/testTMF.txt', 'w')
return 0
 on a rdd creates a file in /etc/ (surprisingly calls like f.Write(text) end 
 with an exception)
 Thanks,
   Tomasz



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9228) Combine unsafe and codegen into a single option

2015-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9228:
---

Assignee: Michael Armbrust  (was: Apache Spark)

 Combine unsafe and codegen into a single option
 ---

 Key: SPARK-9228
 URL: https://issues.apache.org/jira/browse/SPARK-9228
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker

 Before QA, lets flip on features and consolidate unsafe and codegen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9482) flaky test: org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.semijoin

2015-08-03 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652659#comment-14652659
 ] 

Davies Liu commented on SPARK-9482:
---

The Physical plan looks very strange, it use unsafe BroadcastHashOuterJoin and 
BroadcastLeftSemiJoinHash, but use safe Projection (which should be 
TungstenProjection). I tried locally, it does use TungstenProject.

Is it possible that conf.unsafeEnabled is flaky? (changed by some tests)

 flaky test: 
 org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.semijoin
 ---

 Key: SPARK-9482
 URL: https://issues.apache.org/jira/browse/SPARK-9482
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Yin Huai
Priority: Blocker
  Labels: flaky-test

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39059/testReport/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/semijoin/
 {code}
 Regression
 org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.semijoin
 Failing for the past 1 build (Since Failed#39059 )
 Took 7.7 sec.
 Error Message
  Results do not match for semijoin: == Parsed Logical Plan == 'Sort ['a.key 
 ASC], false  'Project [unresolvedalias('a.key)]   'Join RightOuter, 
 Some(('a.key = 'c.key))'Join LeftSemi, Some(('a.key = 'b.key)) 
 'UnresolvedRelation [t3], Some(a) 'UnresolvedRelation [t2], Some(b)
 'UnresolvedRelation [t1], Some(c)  == Analyzed Logical Plan == key: int Sort 
 [key#176228 ASC], false  Project [key#176228]   Join RightOuter, 
 Some((key#176228 = key#176232))Join LeftSemi, Some((key#176228 = 
 key#176230)) MetastoreRelation default, t3, Some(a) MetastoreRelation 
 default, t2, Some(b)MetastoreRelation default, t1, Some(c)  == Optimized 
 Logical Plan == Sort [key#176228 ASC], false  Project [key#176228]   Join 
 RightOuter, Some((key#176228 = key#176232))Project [key#176228] Join 
 LeftSemi, Some((key#176228 = key#176230))  Project [key#176228]   
 MetastoreRelation default, t3, Some(a)  Project [key#176230]   
 MetastoreRelation default, t2, Some(b)Project [key#176232] 
 MetastoreRelation default, t1, Some(c)  == Physical Plan == ExternalSort 
 [key#176228 ASC], false  Project [key#176228]   ConvertToSafe
 BroadcastHashOuterJoin [key#176228], [key#176232], RightOuter, None 
 ConvertToUnsafe  Project [key#176228]   ConvertToSafe
 BroadcastLeftSemiJoinHash [key#176228], [key#176230], None 
 ConvertToUnsafe  HiveTableScan [key#176228], (MetastoreRelation 
 default, t3, Some(a)) ConvertToUnsafe  HiveTableScan 
 [key#176230], (MetastoreRelation default, t2, Some(b)) ConvertToUnsafe
   HiveTableScan [key#176232], (MetastoreRelation default, t1, Some(c))  Code 
 Generation: true == RDD == key !== HIVE - 31 row(s) ==   == CATALYST - 30 
 row(s) ==  00  00  0  
   0  00  00  0
 0  00  00 
  00  00  0
 0  00  00  0  
   0  00  00  0
 0  00  10   10  
 10   10  10   10  10  
  10 !48 !48 !8
 NULL !8NULL  NULL 
 NULL  NULL NULL  NULL NULL  NULL  
NULL !NULL  
 Stacktrace
 sbt.ForkMain$ForkError: 
 Results do not match for semijoin:
 == Parsed Logical Plan ==
 'Sort ['a.key ASC], false
  'Project [unresolvedalias('a.key)]
   'Join RightOuter, Some(('a.key = 'c.key))
'Join LeftSemi, Some(('a.key = 'b.key))
 'UnresolvedRelation [t3], Some(a)
 'UnresolvedRelation [t2], Some(b)
'UnresolvedRelation [t1], Some(c)
 == Analyzed Logical Plan ==
 key: int
 Sort [key#176228 ASC], false
  Project [key#176228]
   Join RightOuter, Some((key#176228 = key#176232))
Join LeftSemi, Some((key#176228 = key#176230))
 MetastoreRelation default, t3, Some(a)
 MetastoreRelation default, t2, Some(b)
MetastoreRelation default, t1, Some(c)
 == Optimized Logical Plan ==
 Sort [key#176228 ASC], false
  Project [key#176228]
   Join RightOuter, Some((key#176228 = key#176232))
Project [key#176228]
 Join LeftSemi, Some((key#176228 =

[jira] [Commented] (SPARK-9228) Combine unsafe and codegen into a single option

2015-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652661#comment-14652661
 ] 

Apache Spark commented on SPARK-9228:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/7908

 Combine unsafe and codegen into a single option
 ---

 Key: SPARK-9228
 URL: https://issues.apache.org/jira/browse/SPARK-9228
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker

 Before QA, lets flip on features and consolidate unsafe and codegen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9516) Improve Thread Dump page

2015-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9516:
---

Assignee: (was: Apache Spark)

 Improve Thread Dump page
 

 Key: SPARK-9516
 URL: https://issues.apache.org/jira/browse/SPARK-9516
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Reporter: Nan Zhu

 Originally proposed by [~irashid] in 
 https://github.com/apache/spark/pull/7808#issuecomment-126788335:
 we can enhance the current thread dump page with at least the following two 
 new features:
 1) sort threads by thread status, 
 2) a filter to grep the threads



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9516) Improve Thread Dump page

2015-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9516:
---

Assignee: Apache Spark

 Improve Thread Dump page
 

 Key: SPARK-9516
 URL: https://issues.apache.org/jira/browse/SPARK-9516
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Reporter: Nan Zhu
Assignee: Apache Spark

 Originally proposed by [~irashid] in 
 https://github.com/apache/spark/pull/7808#issuecomment-126788335:
 we can enhance the current thread dump page with at least the following two 
 new features:
 1) sort threads by thread status, 
 2) a filter to grep the threads



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-9054) Rename RowOrdering to InterpretedOrdering and use newOrdering to build orderings

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-9054.
--
Resolution: Won't Fix

[~joshrosen] closing this as won't fix for now. We can reopen later if needed.


 Rename RowOrdering to InterpretedOrdering and use newOrdering to build 
 orderings
 

 Key: SPARK-9054
 URL: https://issues.apache.org/jira/browse/SPARK-9054
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen

 There are a few places where we still manually construct RowOrdering instead 
 of using SparkPlan.newOrdering.  We should update these to use newOrdering 
 and should rename RowOrdering to InterpretedOrdering to make its function 
 slightly more obvious.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9482) flaky test: org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.semijoin

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9482:
---
Sprint: Week 32

 flaky test: 
 org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.semijoin
 ---

 Key: SPARK-9482
 URL: https://issues.apache.org/jira/browse/SPARK-9482
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Yin Huai
Priority: Blocker
  Labels: flaky-test

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39059/testReport/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/semijoin/
 {code}
 Regression
 org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.semijoin
 Failing for the past 1 build (Since Failed#39059 )
 Took 7.7 sec.
 Error Message
  Results do not match for semijoin: == Parsed Logical Plan == 'Sort ['a.key 
 ASC], false  'Project [unresolvedalias('a.key)]   'Join RightOuter, 
 Some(('a.key = 'c.key))'Join LeftSemi, Some(('a.key = 'b.key)) 
 'UnresolvedRelation [t3], Some(a) 'UnresolvedRelation [t2], Some(b)
 'UnresolvedRelation [t1], Some(c)  == Analyzed Logical Plan == key: int Sort 
 [key#176228 ASC], false  Project [key#176228]   Join RightOuter, 
 Some((key#176228 = key#176232))Join LeftSemi, Some((key#176228 = 
 key#176230)) MetastoreRelation default, t3, Some(a) MetastoreRelation 
 default, t2, Some(b)MetastoreRelation default, t1, Some(c)  == Optimized 
 Logical Plan == Sort [key#176228 ASC], false  Project [key#176228]   Join 
 RightOuter, Some((key#176228 = key#176232))Project [key#176228] Join 
 LeftSemi, Some((key#176228 = key#176230))  Project [key#176228]   
 MetastoreRelation default, t3, Some(a)  Project [key#176230]   
 MetastoreRelation default, t2, Some(b)Project [key#176232] 
 MetastoreRelation default, t1, Some(c)  == Physical Plan == ExternalSort 
 [key#176228 ASC], false  Project [key#176228]   ConvertToSafe
 BroadcastHashOuterJoin [key#176228], [key#176232], RightOuter, None 
 ConvertToUnsafe  Project [key#176228]   ConvertToSafe
 BroadcastLeftSemiJoinHash [key#176228], [key#176230], None 
 ConvertToUnsafe  HiveTableScan [key#176228], (MetastoreRelation 
 default, t3, Some(a)) ConvertToUnsafe  HiveTableScan 
 [key#176230], (MetastoreRelation default, t2, Some(b)) ConvertToUnsafe
   HiveTableScan [key#176232], (MetastoreRelation default, t1, Some(c))  Code 
 Generation: true == RDD == key !== HIVE - 31 row(s) ==   == CATALYST - 30 
 row(s) ==  00  00  0  
   0  00  00  0
 0  00  00 
  00  00  0
 0  00  00  0  
   0  00  00  0
 0  00  10   10  
 10   10  10   10  10  
  10 !48 !48 !8
 NULL !8NULL  NULL 
 NULL  NULL NULL  NULL NULL  NULL  
NULL !NULL  
 Stacktrace
 sbt.ForkMain$ForkError: 
 Results do not match for semijoin:
 == Parsed Logical Plan ==
 'Sort ['a.key ASC], false
  'Project [unresolvedalias('a.key)]
   'Join RightOuter, Some(('a.key = 'c.key))
'Join LeftSemi, Some(('a.key = 'b.key))
 'UnresolvedRelation [t3], Some(a)
 'UnresolvedRelation [t2], Some(b)
'UnresolvedRelation [t1], Some(c)
 == Analyzed Logical Plan ==
 key: int
 Sort [key#176228 ASC], false
  Project [key#176228]
   Join RightOuter, Some((key#176228 = key#176232))
Join LeftSemi, Some((key#176228 = key#176230))
 MetastoreRelation default, t3, Some(a)
 MetastoreRelation default, t2, Some(b)
MetastoreRelation default, t1, Some(c)
 == Optimized Logical Plan ==
 Sort [key#176228 ASC], false
  Project [key#176228]
   Join RightOuter, Some((key#176228 = key#176232))
Project [key#176228]
 Join LeftSemi, Some((key#176228 = key#176230))
  Project [key#176228]
   MetastoreRelation default, t3, Some(a)
  Project [key#176230]
   MetastoreRelation default, t2, Some(b)
Project [key#176232]
 MetastoreRelation default, t1, Some(c)
 == Physical Plan ==
 ExternalSort [key#176228 ASC], false
  Project [key#176228]
   ConvertToSafe

[jira] [Commented] (SPARK-8064) Upgrade Hive to 1.2

2015-08-03 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652730#comment-14652730
 ] 

Steve Loughran commented on SPARK-8064:
---

Also: we had to produce a custom release of hive-exec 1.2.1 with

# The same version of Kryo as that used in Chill (2.21)
# protobuf shaded (needed for co-existed with protobuf 2.4 on Hadoop 1.x)

The source for this is at 
https://github.com/pwendell/hive/tree/release-1.2.1-spark

 Upgrade Hive to 1.2
 ---

 Key: SPARK-8064
 URL: https://issues.apache.org/jira/browse/SPARK-8064
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Steve Loughran
Priority: Blocker
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9578) Stemmer feature transformer

2015-08-03 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-9578:


 Summary: Stemmer feature transformer
 Key: SPARK-9578
 URL: https://issues.apache.org/jira/browse/SPARK-9578
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor


Transformer mentioned first in [SPARK-5571] based on suggestion from 
[~aloknsingh].  Very standard NLP preprocessing task.

From [~aloknsingh]:
{quote}
We have one scala stemmer in scalanlp%chalk 
https://github.com/scalanlp/chalk/tree/master/src/main/scala/chalk/text/analyze 
 which can easily copied (as it is apache project) and is in scala too.
I think this will be better alternative than lucene englishAnalyzer or opennlp.
Note: we already use the scalanlp%breeze via the maven dependency so I think 
adding scalanlp%chalk dependency is also the options. But as you had said we 
can copy the code as it is small.
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-08-03 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652787#comment-14652787
 ] 

Joseph K. Bradley commented on SPARK-5571:
--

The stopwords transformer made it for 1.5, but the stemmer will need to be in 
1.6.  Just linked them.

 LDA should handle text as well
 --

 Key: SPARK-5571
 URL: https://issues.apache.org/jira/browse/SPARK-5571
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
 counts.  It should also supporting training and prediction using text 
 (Strings).
 This plan is sketched in the [original LDA design 
 doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
 There should be:
 * runWithText() method which takes an RDD with a collection of Strings (bags 
 of words).  This will also index terms and compute a dictionary.
 * dictionary parameter for when LDA is run with word count vectors
 * prediction/feedback methods returning Strings (such as 
 describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8887) Explicitly define which data types can be used as dynamic partition columns

2015-08-03 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8887:
--
Target Version/s: 1.5.0  (was: 1.6.0)

 Explicitly define which data types can be used as dynamic partition columns
 ---

 Key: SPARK-8887
 URL: https://issues.apache.org/jira/browse/SPARK-8887
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian

 {{InsertIntoHadoopFsRelation}} implements Hive compatible dynamic 
 partitioning insertion, which uses {{String.valueOf}} to write encode 
 partition column values into dynamic partition directories. This actually 
 limits the data types that can be used in partition column. For example, 
 string representation of {{StructType}} values is not well defined. However, 
 this limitation is not explicitly enforced.
 There are several things we can improve:
 # Enforce dynamic column data type requirements by adding analysis rules and 
 throws {{AnalysisException}} when violation occurs.
 # Abstract away string representation of various data types, so that we don't 
 need to convert internal representation types (e.g. {{UTF8String}}) to 
 external types (e.g. {{String}}). A set of Hive compatible implementations 
 should be provided to ensure compatibility with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8887) Explicitly define which data types can be used as dynamic partition columns

2015-08-03 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-8887:
-

Assignee: Cheng Lian

 Explicitly define which data types can be used as dynamic partition columns
 ---

 Key: SPARK-8887
 URL: https://issues.apache.org/jira/browse/SPARK-8887
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian

 {{InsertIntoHadoopFsRelation}} implements Hive compatible dynamic 
 partitioning insertion, which uses {{String.valueOf}} to write encode 
 partition column values into dynamic partition directories. This actually 
 limits the data types that can be used in partition column. For example, 
 string representation of {{StructType}} values is not well defined. However, 
 this limitation is not explicitly enforced.
 There are several things we can improve:
 # Enforce dynamic column data type requirements by adding analysis rules and 
 throws {{AnalysisException}} when violation occurs.
 # Abstract away string representation of various data types, so that we don't 
 need to convert internal representation types (e.g. {{UTF8String}}) to 
 external types (e.g. {{String}}). A set of Hive compatible implementations 
 should be provided to ensure compatibility with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9257) Fix the false negative of Aggregate2Sort and FinalAndCompleteAggregate2Sort's missingInput

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9257:
---
Assignee: Yin Huai

 Fix the false negative of Aggregate2Sort and FinalAndCompleteAggregate2Sort's 
 missingInput
 --

 Key: SPARK-9257
 URL: https://issues.apache.org/jira/browse/SPARK-9257
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Minor

 {code}
 sqlContext.sql(

   |SELECT sum(value)
   |FROM agg1
   |GROUP BY key
   .stripMargin).explain()
 == Physical Plan ==
 Aggregate2Sort Some(List(key#510)), [key#510], [(sum(CAST(value#511, 
 LongType))2,mode=Final,isDistinct=false)], [sum(CAST(value#511, 
 LongType))#1435L], [sum(CAST(value#511, LongType))#1435L AS _c0#1426L]
  ExternalSort [key#510 ASC], false
   Exchange hashpartitioning(key#510)
Aggregate2Sort None, [key#510], [(sum(CAST(value#511, 
 LongType))2,mode=Partial,isDistinct=false)], [currentSum#1433L], 
 [key#510,currentSum#1433L]
 ExternalSort [key#510 ASC], false
  PhysicalRDD [key#510,value#511], MapPartitionsRDD[97] at apply at 
 Transformer.scala:22
 sqlContext.sql(
   
   |SELECT sum(distinct value)
   |FROM agg1
   |GROUP BY key
   .stripMargin).explain()
 == Physical Plan ==
 !FinalAndCompleteAggregate2Sort [key#510,CAST(value#511, LongType)#1446L], 
 [key#510], [(sum(CAST(value#511, 
 LongType)#1446L)2,mode=Complete,isDistinct=false)], [sum(CAST(value#511, 
 LongType))#1445L], [sum(CAST(value#511, LongType))#1445L AS _c0#1438L]
  Aggregate2Sort Some(List(key#510)), [key#510,CAST(value#511, 
 LongType)#1446L], [key#510,CAST(value#511, LongType)#1446L]
   ExternalSort [key#510 ASC,CAST(value#511, LongType)#1446L ASC], false
Exchange hashpartitioning(key#510)
 !Aggregate2Sort None, [key#510,CAST(value#511, LongType) AS 
 CAST(value#511, LongType)#1446L], [key#510,CAST(value#511, LongType)#1446L]
  ExternalSort [key#510 ASC,CAST(value#511, LongType) AS CAST(value#511, 
 LongType)#1446L ASC], false
   PhysicalRDD [key#510,value#511], MapPartitionsRDD[102] at apply at 
 Transformer.scala:22
 {code}
 For examples shown above, you can see there is a {{!}} at the bingeing of the 
 operator's {{simpleString}}), which indicates that its {{missingInput}} is 
 not empty. Actually, it is a false negative and we need to fix it.
 Also, it will be good to make these two operators' {{simpleString}} more 
 reader friendly (people can tell what are grouping expressions, what are 
 aggregate functions, and what is the mode of an aggregate function).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9251) do not order by expressions which still need evaluation

2015-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652675#comment-14652675
 ] 

Apache Spark commented on SPARK-9251:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/7906

 do not order by expressions which still need evaluation
 ---

 Key: SPARK-9251
 URL: https://issues.apache.org/jira/browse/SPARK-9251
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9513) Create Python API for all SQL functions

2015-08-03 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-9513:
-

Assignee: Davies Liu

 Create Python API for all SQL functions
 ---

 Key: SPARK-9513
 URL: https://issues.apache.org/jira/browse/SPARK-9513
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker

 Check all the SQL functions, make sure they have python API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6116) DataFrame API improvement umbrella ticket (Spark 1.5)

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6116:
---
Summary: DataFrame API improvement umbrella ticket (Spark 1.5)  (was: 
DataFrame API improvement umbrella ticket)

 DataFrame API improvement umbrella ticket (Spark 1.5)
 -

 Key: SPARK-6116
 URL: https://issues.apache.org/jira/browse/SPARK-6116
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
  Labels: DataFrame

 An umbrella ticket to track improvements and changes needed to make DataFrame 
 API non-experimental.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8416) Thread dump page should highlight Spark executor threads

2015-08-03 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-8416.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7808
[https://github.com/apache/spark/pull/7808]

 Thread dump page should highlight Spark executor threads
 

 Key: SPARK-8416
 URL: https://issues.apache.org/jira/browse/SPARK-8416
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Josh Rosen
 Fix For: 1.5.0


 On the Spark thread dump page, it's hard to pick out executor threads from 
 other system threads.  The UI should employ some color coding or highlighting 
 to make this more apparent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8466) Bug in SQL Optimizer: Unresolved Attribute after pushing Filter below Project

2015-08-03 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8466:

Description: 
Input Data: a parquet file stored in hdfs:///data with two columns 
(lifeAverageBitrateKbps int, playtimems int)

=
Scripts used in spark-shell:
{code}
val df = sqlContext.parquetFile(hdfs:///data)
import org.apache.spark.sql.types._
val cols = df.schema.fields.map { f =
  val dataType = f.dataType match {
case DoubleType | FloatType = DecimalType.Unlimited
case t = t
  }
  df.col(f.name).cast(dataType).as(f.name)
}
df.select(cols: _*).registerTempTable(t)

val query = 

|SELECT avg(cleanedplaytimems),
|   count(1)
|FROM
|  (SELECT 0 key,
|  avg(lifeAverageBitrateKbps) avgbitrate
|   FROM anon_sdm2_ss
|   WHERE lifeAverageBitrateKbps  0) t1,
|  (SELECT 0 key,
|  lifeAverageBitrateKbps,
|  if(playtimems  0, playtimems, 0) cleanedplaytimems
|   FROM anon_sdm2_ss
|   WHERE lifeAverageBitrateKbps  0) t2
|WHERE t1.key=t2.key
|  AND t2.lifeAverageBitrateKbps  0.5 * t1.avgbitrate
.stripMargin

sqlContext.sql(query).explain(true)
{code}
===
Output:

{code}
== Analyzed Logical Plan ==
Aggregate [], [AVG(CAST(cleanedplaytimems#110, LongType)) AS _c0#111,COUNT(1) 
AS _c1#112L]
 Filter ((key#107 = key#109)  (CAST(lifeAverageBitrateKbps#105, DoubleType)  
(0.5 * avgbitrate#108)))
  Join Inner, None
   Subquery t1
Aggregate [], [0 AS key#107,AVG(CAST(lifeAverageBitrateKbps#105, LongType)) 
AS avgbitrate#108]
 Filter (lifeAverageBitrateKbps#105  0)
  Subquery anon_sdm2_ss
   Project [CAST(lifeaveragebitratekbps#27, IntegerType) AS 
lifeaveragebitratekbps#105,CAST(playtimems#89, IntegerType) AS playtimems#106]
Relation[lifeaveragebitratekbps#27,playtimems#89] 
ParquetRelation2(WrappedArray(hdfs:///data),Map(),None,None)
   Subquery t2
Project [0 AS 
key#109,lifeAverageBitrateKbps#105,HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf((playtimems#106
  0),playtimems#106,0) AS cleanedplaytimems#110]
 Filter (lifeAverageBitrateKbps#105  0)
  Subquery anon_sdm2_ss
   Project [CAST(lifeaveragebitratekbps#27, IntegerType) AS 
lifeaveragebitratekbps#105,CAST(playtimems#89, IntegerType) AS playtimems#106]
Relation[lifeaveragebitratekbps#27,playtimems#89] 
ParquetRelation2(WrappedArray(hdfs:///data),Map(),None,None)

== Optimized Logical Plan ==
Aggregate [], [AVG(CAST(cleanedplaytimems#110, LongType)) AS _c0#111,COUNT(1) 
AS _c1#112L]
 Project [cleanedplaytimems#110]
  Join Inner, Some(((key#107 = key#109)  (CAST(lifeAverageBitrateKbps#105, 
DoubleType)  (0.5 * avgbitrate#108
   Aggregate [], [0 AS key#107,AVG(CAST(lifeAverageBitrateKbps#105, LongType)) 
AS avgbitrate#108]
Project [lifeaveragebitratekbps#27 AS lifeaveragebitratekbps#105]
 !Filter (lifeAverageBitrateKbps#105  0)
  Relation[lifeaveragebitratekbps#27,playtimems#89] 
ParquetRelation2(WrappedArray(hdfs:///data),Map(),None,None)
   Project [0 AS key#109,lifeaveragebitratekbps#27 AS 
lifeaveragebitratekbps#105,HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf((playtimems#89
 AS playtimems#106  0),playtimems#89 AS playtimems#106,0) AS 
cleanedplaytimems#110]
!Filter (lifeAverageBitrateKbps#105  0)
 Relation[lifeaveragebitratekbps#27,playtimems#89] 
ParquetRelation2(WrappedArray(hdfs:///data),Map(),None,None)
{code}

Note:
Filter is unresolved




  was:
Input Data: a parquet file stored in hdfs:///data with two columns 
(lifeAverageBitrateKbps int, playtimems int)

=
Scripts used in spark-shell:

val df = sqlContext.parquetFile(hdfs:///data)
import org.apache.spark.sql.types._
val cols = df.schema.fields.map { f =
  val dataType = f.dataType match {
case DoubleType | FloatType = DecimalType.Unlimited
case t = t
  }
  df.col(f.name).cast(dataType).as(f.name)
}
df.select(cols: _*).registerTempTable(t)

val query = 

|SELECT avg(cleanedplaytimems),
|   count(1)
|FROM
|  (SELECT 0 key,
|  avg(lifeAverageBitrateKbps) avgbitrate
|   FROM anon_sdm2_ss
|   WHERE lifeAverageBitrateKbps  0) t1,
|  (SELECT 0 key,
|  lifeAverageBitrateKbps,
|  if(playtimems  0, playtimems, 0) cleanedplaytimems
|   FROM anon_sdm2_ss
|   WHERE lifeAverageBitrateKbps  0) t2
|WHERE t1.key=t2.key
|  AND t2.lifeAverageBitrateKbps  0.5 * t1.avgbitrate
.stripMargin

sqlContext.sql(query).explain(true)

===
Output:

.
== Analyzed Logical Plan ==
Aggregate [], [AVG(CAST(cleanedplaytimems#110, LongType)) AS _c0#111,COUNT(1) 
AS _c1#112L]
 Filter ((key#107 = key#109)  (CAST(lifeAverageBitrateKbps#105, DoubleType)  
(0.5 * avgbitrate#108)))
  Join Inner, None
   Subquery t1
Aggregate [], [0 AS key#107,AVG(CAST(lifeAverageBitrateKbps#105, LongType)) 
AS avgbitrate#108]

[jira] [Updated] (SPARK-9582) LDA cleanups

2015-08-03 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9582:
-
Description: 
Small cleanups to LDA code and recent additions

CC: [~fliang]

  was:
LocalLDAModel.logLikelihood resembles that for gensim, but it is not analogous 
to DistributedLDAModel.likelihood.  The former includes the log likelihood of 
the inferred topics, but the latter does not.  This JIRA is for refactoring the 
former to separate out the log likelihood of the inferred topics.

CC: [~fliang]


 LDA cleanups
 

 Key: SPARK-9582
 URL: https://issues.apache.org/jira/browse/SPARK-9582
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor

 Small cleanups to LDA code and recent additions
 CC: [~fliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9559) Worker redundancy/failover in spark stand-alone mode

2015-08-03 Thread partha bishnu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652933#comment-14652933
 ] 

partha bishnu commented on SPARK-9559:
--

We tested on 1.4.1 and got same results i.e.  a new executor JVM  did not get 
started on the other worker node after the node running the jobs stopped 
running. So it seems a like a major defect.

 Worker redundancy/failover in spark stand-alone mode
 

 Key: SPARK-9559
 URL: https://issues.apache.org/jira/browse/SPARK-9559
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: partha bishnu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9582) LDA cleanups

2015-08-03 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9582:
-
Priority: Minor  (was: Major)

 LDA cleanups
 

 Key: SPARK-9582
 URL: https://issues.apache.org/jira/browse/SPARK-9582
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor

 LocalLDAModel.logLikelihood resembles that for gensim, but it is not 
 analogous to DistributedLDAModel.likelihood.  The former includes the log 
 likelihood of the inferred topics, but the latter does not.  This JIRA is for 
 refactoring the former to separate out the log likelihood of the inferred 
 topics.
 CC: [~fliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9582) LDA cleanups

2015-08-03 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9582:
-
Summary: LDA cleanups  (was: Improve clarity of LocalLDAModel log 
likelihood methods)

 LDA cleanups
 

 Key: SPARK-9582
 URL: https://issues.apache.org/jira/browse/SPARK-9582
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 LocalLDAModel.logLikelihood resembles that for gensim, but it is not 
 analogous to DistributedLDAModel.likelihood.  The former includes the log 
 likelihood of the inferred topics, but the latter does not.  This JIRA is for 
 refactoring the former to separate out the log likelihood of the inferred 
 topics.
 CC: [~fliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9584) HiveHBaseTableInputFormat can'be cached

2015-08-03 Thread meiyoula (JIRA)

meiyoula created SPARK-9584:
---

 Summary: HiveHBaseTableInputFormat can'be cached
 Key: SPARK-9584
 URL: https://issues.apache.org/jira/browse/SPARK-9584
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula


Below exception occurs in Spark On HBase function.
{quote}
java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException: 
Task 
org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@11c6577
 rejected from java.util.concurrent.ThreadPoolExecutor@3414350b[Terminated, 
pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 17451]
{quote}

When an executor has many cores, the tasks belongs to same RDD will using the 
same InputFormat. But the HiveHBaseTableInputFormat is not thread safety.
So I think we should add a config to enable cache InputFormat or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9585) HiveHBaseTableInputFormat can'be cached

2015-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652941#comment-14652941
 ] 

Apache Spark commented on SPARK-9585:
-

User 'XuTingjun' has created a pull request for this issue:
https://github.com/apache/spark/pull/7918

 HiveHBaseTableInputFormat can'be cached
 ---

 Key: SPARK-9585
 URL: https://issues.apache.org/jira/browse/SPARK-9585
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula

 Below exception occurs in Spark On HBase function.
 {quote}
 java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException: 
 Task 
 org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@11c6577
  rejected from java.util.concurrent.ThreadPoolExecutor@3414350b[Terminated, 
 pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 17451]
 {quote}
 When an executor has many cores, the tasks belongs to same RDD will using the 
 same InputFormat. But the HiveHBaseTableInputFormat is not thread safety.
 So I think we should add a config to enable cache InputFormat or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9585) HiveHBaseTableInputFormat can'be cached

2015-08-03 Thread meiyoula (JIRA)

meiyoula created SPARK-9585:
---

 Summary: HiveHBaseTableInputFormat can'be cached
 Key: SPARK-9585
 URL: https://issues.apache.org/jira/browse/SPARK-9585
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula


Below exception occurs in Spark On HBase function.
{quote}
java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException: 
Task 
org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@11c6577
 rejected from java.util.concurrent.ThreadPoolExecutor@3414350b[Terminated, 
pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 17451]
{quote}

When an executor has many cores, the tasks belongs to same RDD will using the 
same InputFormat. But the HiveHBaseTableInputFormat is not thread safety.
So I think we should add a config to enable cache InputFormat or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9228) Combine unsafe and codegen into a single option

2015-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9228:
---

Assignee: Apache Spark  (was: Michael Armbrust)

 Combine unsafe and codegen into a single option
 ---

 Key: SPARK-9228
 URL: https://issues.apache.org/jira/browse/SPARK-9228
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Apache Spark
Priority: Blocker

 Before QA, lets flip on features and consolidate unsafe and codegen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7505) Update PySpark DataFrame docs: encourage getitem, mark as experimental, etc.

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7505:
---
Target Version/s: 1.5.0  (was: 1.6.0)

 Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, 
 etc.
 

 Key: SPARK-7505
 URL: https://issues.apache.org/jira/browse/SPARK-7505
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark, SQL
Affects Versions: 1.3.1
Reporter: Nicholas Chammas
Priority: Minor

 The PySpark docs for DataFrame need the following fixes and improvements:
 # Per [SPARK-7035], we should encourage the use of {{\_\_getitem\_\_}} over 
 {{\_\_getattr\_\_}} and change all our examples accordingly.
 # *We should say clearly that the API is experimental.* (That is currently 
 not the case for the PySpark docs.)
 # We should provide an example of how to join and select from 2 DataFrames 
 that have identically named columns, because it is not obvious:
   {code}
  df1 = sqlContext.jsonRDD(sc.parallelize(['{a: 4, other: I know}']))
  df2 = sqlContext.jsonRDD(sc.parallelize(['{a: 4, other: I dunno}']))
  df12 = df1.join(df2, df1['a'] == df2['a'])
  df12.select(df1['a'], df2['other']).show()
 a other   
 
 4 I dunno  {code}
 # 
 [{{DF.orderBy}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.orderBy]
  and 
 [{{DF.sort}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sort]
  should be marked as aliases if that's what they are.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7544) pyspark.sql.types.Row should implement getitem

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7544:
---
Parent Issue: SPARK-9576  (was: SPARK-6116)

 pyspark.sql.types.Row should implement __getitem__
 --

 Key: SPARK-7544
 URL: https://issues.apache.org/jira/browse/SPARK-7544
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Reporter: Nicholas Chammas
Priority: Minor

 Following from the related discussions in [SPARK-7505] and [SPARK-7133], the 
 {{Row}} type should implement {{\_\_getitem\_\_}} so that people can do this
 {code}
 row['field']
 {code}
 instead of this:
 {code}
 row.field
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5517) Add input types for Java UDFs

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5517:
---
Parent Issue: SPARK-9576  (was: SPARK-6116)

 Add input types for Java UDFs
 -

 Key: SPARK-5517
 URL: https://issues.apache.org/jira/browse/SPARK-5517
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.3.0
Reporter: Michael Armbrust
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7400) PortableDataStream UDT

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7400:
---
Parent Issue: SPARK-9576  (was: SPARK-6116)

 PortableDataStream UDT
 --

 Key: SPARK-7400
 URL: https://issues.apache.org/jira/browse/SPARK-7400
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Eron Wright 

 Improve support for PortableDataStream in a DataFrame by implementing 
 PortableDataStreamUDT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8802) Decimal.apply(BigDecimal).toBigDecimal may throw NumberFormatException

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8802:
---
Target Version/s: 1.6.0  (was: 1.5.0)

 Decimal.apply(BigDecimal).toBigDecimal may throw NumberFormatException
 --

 Key: SPARK-8802
 URL: https://issues.apache.org/jira/browse/SPARK-8802
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Minor

 There exist certain BigDecimals that can be converted into Spark SQL's 
 Decimal class but which produce Decimals that cannot be converted back to 
 BigDecimal without throwing NumberFormatException.
 For instance:
 {code}
 val x = BigDecimal(BigInt(18889465931478580854784), -2147483648)
 assert(Decimal(x).toBigDecimal === x)
 {code}
 will fail with an exception:
 {code}
 java.lang.NumberFormatException
   at java.math.BigDecimal.init(BigDecimal.java:511)
   at java.math.BigDecimal.init(BigDecimal.java:757)
   at scala.math.BigDecimal$.apply(BigDecimal.scala:119)
   at scala.math.BigDecimal.apply(BigDecimal.scala:324)
   at org.apache.spark.sql.types.Decimal.toBigDecimal(Decimal.scala:142)
   at 
 org.apache.spark.sql.types.decimal.DecimalSuite$$anonfun$2.apply$mcV$sp(DecimalSuite.scala:62)
   at 
 org.apache.spark.sql.types.decimal.DecimalSuite$$anonfun$2.apply(DecimalSuite.scala:60)
   at 
 org.apache.spark.sql.types.decimal.DecimalSuite$$anonfun$2.apply(DecimalSuite.scala:60)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9577) Surface concrete iterator types in various sort classes

2015-08-03 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-9577:
--

 Summary: Surface concrete iterator types in various sort classes
 Key: SPARK-9577
 URL: https://issues.apache.org/jira/browse/SPARK-9577
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


We often return abstract iterator types in various sort-related classes (e.g. 
UnsafeKVExternalSorter). 

It is actually better to return a more concrete type, so the callsite uses that 
type and JIT can inline the iterator calls.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9577) Surface concrete iterator types in various sort classes

2015-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652768#comment-14652768
 ] 

Apache Spark commented on SPARK-9577:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7911

 Surface concrete iterator types in various sort classes
 ---

 Key: SPARK-9577
 URL: https://issues.apache.org/jira/browse/SPARK-9577
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0


 We often return abstract iterator types in various sort-related classes (e.g. 
 UnsafeKVExternalSorter). 
 It is actually better to return a more concrete type, so the callsite uses 
 that type and JIT can inline the iterator calls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9577) Surface concrete iterator types in various sort classes

2015-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9577:
---

Assignee: Reynold Xin  (was: Apache Spark)

 Surface concrete iterator types in various sort classes
 ---

 Key: SPARK-9577
 URL: https://issues.apache.org/jira/browse/SPARK-9577
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0


 We often return abstract iterator types in various sort-related classes (e.g. 
 UnsafeKVExternalSorter). 
 It is actually better to return a more concrete type, so the callsite uses 
 that type and JIT can inline the iterator calls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8874) Add missing methods in Word2Vec ML

2015-08-03 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8874:
-
Shepherd: Joseph K. Bradley
Target Version/s: 1.5.0

 Add missing methods in Word2Vec ML
 --

 Key: SPARK-8874
 URL: https://issues.apache.org/jira/browse/SPARK-8874
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Manoj Kumar
Assignee: Manoj Kumar

 Add getVectors and findSynonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9483) UTF8String.getPrefix only works in little-endian order

2015-08-03 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9483.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7902
[https://github.com/apache/spark/pull/7902]

 UTF8String.getPrefix only works in little-endian order
 --

 Key: SPARK-9483
 URL: https://issues.apache.org/jira/browse/SPARK-9483
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Matthew Brandyberry
Priority: Critical
 Fix For: 1.5.0


 There are 2 bit masking and a reverse bytes that should probably be handled 
 differently on big-endian order. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-8891) Calling aggregation expressions on null literals fails at runtime

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-8891.
--
   Resolution: Fixed
 Assignee: Yin Huai  (was: Josh Rosen)
Fix Version/s: 1.5.0

Fixed by Yin in new aggregates.


 Calling aggregation expressions on null literals fails at runtime
 -

 Key: SPARK-8891
 URL: https://issues.apache.org/jira/browse/SPARK-8891
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1, 1.4.0, 1.4.1, 1.5.0
Reporter: Josh Rosen
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.5.0


 Queries that call aggregate expressions with null literals, such as {{select 
 avg(null)}} or {{select sum(null)}} fail with various errors due to 
 mishandling of the internal NullType type.
 For instance, with codegen disabled on a recent 1.5 master:
 {code}
 scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$)
   at 
 org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$cast(Cast.scala:407)
   at 
 org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:426)
   at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:426)
   at 
 org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:428)
   at 
 org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:196)
   at 
 org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:48)
   at 
 org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:268)
   at 
 org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:48)
   at 
 org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:147)
   at 
 org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:536)
   at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:132)
   at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:125)
   at 
 org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
   at 
 org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 When codegen is enabled, the resulting code fails to compile.
 The fix for this issue involves changes to Cast and Sum.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9526) Utilize randomized tests to reveal potential bugs in sql expressions

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9526:
---
Shepherd: Josh Rosen
Assignee: Yijie Shen

 Utilize randomized tests to reveal potential bugs in sql expressions
 

 Key: SPARK-9526
 URL: https://issues.apache.org/jira/browse/SPARK-9526
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yijie Shen
Assignee: Yijie Shen





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9403) Implement code generation for In / InSet

2015-08-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9403:
---
Shepherd: Davies Liu

 Implement code generation for In / InSet
 

 Key: SPARK-9403
 URL: https://issues.apache.org/jira/browse/SPARK-9403
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 In expression doesn't have any code generation. Would be great to code gen 
 those. Note that we should also optimize the generated code for literal types 
 (InSet).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9581) Add test for JSON UDTs

2015-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9581:
---

Assignee: Apache Spark  (was: Reynold Xin)

 Add test for JSON UDTs
 --

 Key: SPARK-9581
 URL: https://issues.apache.org/jira/browse/SPARK-9581
 Project: Spark
  Issue Type: Test
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9581) Add test for JSON UDTs

2015-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9581:
---

Assignee: Reynold Xin  (was: Apache Spark)

 Add test for JSON UDTs
 --

 Key: SPARK-9581
 URL: https://issues.apache.org/jira/browse/SPARK-9581
 Project: Spark
  Issue Type: Test
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9581) Add test for JSON UDTs

2015-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652899#comment-14652899
 ] 

Apache Spark commented on SPARK-9581:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7917

 Add test for JSON UDTs
 --

 Key: SPARK-9581
 URL: https://issues.apache.org/jira/browse/SPARK-9581
 Project: Spark
  Issue Type: Test
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7119) ScriptTransform doesn't consider the output data type

2015-08-03 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7119:

Target Version/s: 1.5.0  (was: 1.6.0)

 ScriptTransform doesn't consider the output data type
 -

 Key: SPARK-7119
 URL: https://issues.apache.org/jira/browse/SPARK-7119
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Cheng Hao
Priority: Critical

 {code:sql}
 from (from src select transform(key, value) using 'cat' as (thing1 int, 
 thing2 string)) t select thing1 + 2;
 {code}
 {noformat}
 15/04/24 00:58:55 ERROR CliDriver: org.apache.spark.SparkException: Job 
 aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent 
 failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): 
 java.lang.ClassCastException: org.apache.spark.sql.types.UTF8String cannot be 
 cast to java.lang.Integer
   at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
   at scala.math.Numeric$IntIsIntegral$.plus(Numeric.scala:57)
   at 
 org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:127)
   at 
 org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:118)
   at 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68)
   at 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1618)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1618)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:209)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
   at java.lang.Thread.run(Thread.java:722)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-7148) Configure Parquet block size (row group size) for ML model import/export

2015-08-03 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-7148:
---
Comment: was deleted

(was: [~josephkb] If you are busy with other issues, please don't hesitate to 
assign it to me.)

 Configure Parquet block size (row group size) for ML model import/export
 

 Key: SPARK-7148
 URL: https://issues.apache.org/jira/browse/SPARK-7148
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, SQL
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Joseph K. Bradley
Priority: Minor

 It would be nice if we could configure the Parquet buffer size when using 
 Parquet format for ML model import/export.  Currently, for some models (trees 
 and ensembles), the schema has 13+ columns.  With a default buffer size of 
 128MB (I think), that puts the allocated buffer way over the default memory 
 made available by run-example.  Because of this problem, users have to use 
 spark-submit and explicitly use a larger amount of memory in order to run 
 some ML examples.
 Is there a simple way to specify {{parquet.block.size}}?  I'm not familiar 
 with this part of SparkSQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 463 matches

Mail list logo