[jira] [Commented] (SPARK-19823) Support Gang Distribution of Task

2017-03-04 Thread DjvuLee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896107#comment-15896107
 ] 

DjvuLee commented on SPARK-19823:
-

[~zsxwing] Can you have a look at?

> Support Gang Distribution of Task 
> --
>
> Key: SPARK-19823
> URL: https://issues.apache.org/jira/browse/SPARK-19823
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.0.2
>Reporter: DjvuLee
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19823) Support Gang Distribution of Task

2017-03-04 Thread DjvuLee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896096#comment-15896096
 ] 

DjvuLee edited comment on SPARK-19823 at 3/5/17 7:19 AM:
-

When Spark distributes tasks to Executors,  it uses a Round-Robin way, this 
means we distribute tasks to all Executors as possible, this is a good strategy 
in most case.


However, when we introduce the dynamic allocation and concurrent  jobs, users 
may need the gang distribution for one single job, this means we allocate tasks 
for one job in few Executors, so we can release the unused resource.

Under current way, Executors may not get released since the long-running stage 
will distributed at lease one task in each Executor, even we have more cores in 
each Executor.

We can offer a configuration for users, and this will not introduce much 
complexity. 
we only nee to modify this:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L340

val shuffledOffers = shuffleOffers(filteredOffers)





was (Author: djvulee):
When Spark distributes tasks to Executors,  it uses a Round-Robin way, this 
means we distribute tasks to all Executors as possible, this is a good strategy 
in most case.


However, when we introduce the dynamic allocation and concurrent  jobs, users 
may need the gang distribution for one single job, this means we allocate tasks 
for one job in few Executors, so we can release the unused resource.

Under current way, Executors may not get released since the long-running stage 
will distributed at lease one task in each Executor, even we have more cores in 
each Executor.

We can offer a configuration for users, and this will not introduce much 
complexity. 




> Support Gang Distribution of Task 
> --
>
> Key: SPARK-19823
> URL: https://issues.apache.org/jira/browse/SPARK-19823
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.0.2
>Reporter: DjvuLee
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19823) Support Gang Distribution of Task

2017-03-04 Thread DjvuLee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896096#comment-15896096
 ] 

DjvuLee edited comment on SPARK-19823 at 3/5/17 7:10 AM:
-

When Spark distributes tasks to Executors,  it uses a Round-Robin way, this 
means we distribute tasks to all Executors as possible, this is a good strategy 
in most case.


However, when we introduce the dynamic allocation and concurrent  jobs, users 
may need the gang distribution for one single job, this means we allocate tasks 
for one job in few Executors, so we can release the unused resource.

Under current way, Executors may not get released since the long-running stage 
will distributed at lease one task in each Executor, even we have more cores in 
each Executor.

We can offer a configuration for users, and this will not introduce much 
complexity. 





was (Author: djvulee):
When Spark distributes tasks to Executors,  it uses a Round-Robin way, this 
means we distribute tasks to all Executors as possible, this is a good strategy 
in most case.


However, when we introduce the dynamic allocation and concurrent  jobs, users 
may need the gang distribution for one single job, this means we allocate tasks 
for one job in few Executors, so we can release the unused resource.

In current ways, Executors may not get released since the long-running stage 
will distributed at lease one task in each Executor, even we have more cores in 
each Executor.

We can offer a configuration for users, and this will not introduce much 
complexity. 




> Support Gang Distribution of Task 
> --
>
> Key: SPARK-19823
> URL: https://issues.apache.org/jira/browse/SPARK-19823
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.0.2
>Reporter: DjvuLee
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19823) Support Gang Distribution of Task

2017-03-04 Thread DjvuLee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896096#comment-15896096
 ] 

DjvuLee edited comment on SPARK-19823 at 3/5/17 7:10 AM:
-

When Spark distributes tasks to Executors,  it uses a Round-Robin way, this 
means we distribute tasks to all Executors as possible, this is a good strategy 
in most case.


However, when we introduce the dynamic allocation and concurrent  jobs, users 
may need the gang distribution for one single job, this means we allocate tasks 
for one job in few Executors, so we can release the unused resource.

In current ways, Executors may not get released since the long-running stage 
will distributed at lease one task in each Executor, even we have more cores in 
each Executor.

We can offer a configuration for users, and this will not introduce much 
complexity. 





was (Author: djvulee):
When Spark distributes tasks to Executors,  it uses a Round-Robin way, this 
means we distribute tasks to all Executors as possible, this is a good strategy 
in most case.


However, when we introduce the dynamic allocation and concurrent  jobs, users 
may need the gang distribution for one single job, this means we allocate tasks 
for one job in few Executors, so we can release the unused resource.

In current ways, Executors may not get released since the long-running stage 
will distributed at lease one task in each Executors, even we have more cores 
in each Executor.

We can offer a configuration for users, and this will not introduce much 
complexity. 




> Support Gang Distribution of Task 
> --
>
> Key: SPARK-19823
> URL: https://issues.apache.org/jira/browse/SPARK-19823
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.0.2
>Reporter: DjvuLee
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13156) JDBC using multiple partitions creates additional tasks but only executes on one

2017-03-04 Thread zhuo bao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896100#comment-15896100
 ] 

zhuo bao commented on SPARK-13156:
--

I had the same problem, but I found that it is the problem with 

val upperBound = (numPartitions-1).toLong

Since upperBound should 'EXCLUSIVE'. So the value here should be 
numPartition.toLong. They will find all thread/partitions are populated with 
records. 
Attention, please verify your record number at the end.

> JDBC using multiple partitions creates additional tasks but only executes on 
> one
> 
>
> Key: SPARK-13156
> URL: https://issues.apache.org/jira/browse/SPARK-13156
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.0
> Environment: Hadoop 2.6.0-cdh5.4.0, Teradata, yarn-client
>Reporter: Charles Drotar
>
> I can successfully kick off a query through JDBC to Teradata, and when it 
> runs it creates a task on each executor for every partition. The problem is 
> that all of the tasks except for one complete within a couple seconds and the 
> final task handles the entire dataset.
> Example Code:
> private val properties = new java.util.Properties()
> properties.setProperty("driver","com.teradata.jdbc.TeraDriver")
> properties.setProperty("username","foo")
> properties.setProperty("password","bar")
> val url = "jdbc:teradata://oneview/, TMODE=TERA,TYPE=FASTEXPORT,SESSIONS=10"
> val numPartitions = 5
> val dbTableTemp = "( SELECT  id MOD $numPartitions%d AS modulo, id FROM 
> db.table) AS TEMP_TABLE"
> val partitionColumn = "modulo"
> val lowerBound = 0.toLong
> val upperBound = (numPartitions-1).toLong
> val df = 
> sqlContext.read.jdbc(url,dbTableTemp,partitionColumn,lowerBound,upperBound,numPartitions,properties)
> df.write.parquet("/output/path/for/df/")
> When I look at the Spark UI I see the 5 tasks, but only 1 is actually 
> querying.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19823) Support Gang Distribution of Task

2017-03-04 Thread DjvuLee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896096#comment-15896096
 ] 

DjvuLee edited comment on SPARK-19823 at 3/5/17 7:10 AM:
-

When Spark distributes tasks to Executors,  it uses a Round-Robin way, this 
means we distribute tasks to all Executors as possible, this is a good strategy 
in most case.


However, when we introduce the dynamic allocation and concurrent  jobs, users 
may need the gang distribution for one single job, this means we allocate tasks 
for one job in few Executors, so we can release the unused resource.

In current ways, Executors may not get released since the long-running stage 
will distributed at lease one task in each Executors, even we have more cores 
in each Executor.

We can offer a configuration for users, and this will not introduce much 
complexity. 





was (Author: djvulee):
When Spark distributes tasks to Executors,  it uses a Round-Robin way, this 
means we distribute tasks to all Executors as possible, this is a good strategy 
in most case.


However, when we introduce the dynamic allocation and concurrent  jobs, users 
may need the gang distribution for one single job, this means we allocate tasks 
for one job in few Executors, so we can release the unused resource.

In current ways, Executors may not get released since the long-running stage 
will distributed on task in each Executors, even we have more cores in each 
Executor.

We can offer a configuration for users, and this will not introduce much 
complexity. 




> Support Gang Distribution of Task 
> --
>
> Key: SPARK-19823
> URL: https://issues.apache.org/jira/browse/SPARK-19823
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.0.2
>Reporter: DjvuLee
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19823) Support Gang Distribution of Task

2017-03-04 Thread DjvuLee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896096#comment-15896096
 ] 

DjvuLee commented on SPARK-19823:
-

When Spark distributes tasks to Executors,  it uses a Round-Robin way, this 
means we distribute tasks to all Executors as possible, this is a good strategy 
in most case.


However, when we introduce the dynamic allocation and concurrent  jobs, users 
may need the gang distribution for one single job, this means we allocate tasks 
for one job in few Executors, so we can release the unused resource.

In current ways, Executors may not get released since the long-running stage 
will distributed on task in each Executors, even we have more cores in each 
Executor.

We can offer a configuration for users, and this will not introduce much 
complexity. 




> Support Gang Distribution of Task 
> --
>
> Key: SPARK-19823
> URL: https://issues.apache.org/jira/browse/SPARK-19823
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.0.2
>Reporter: DjvuLee
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19823) Support Gang Distribution of Task

2017-03-04 Thread DjvuLee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896097#comment-15896097
 ] 

DjvuLee commented on SPARK-19823:
-

If this is a good advice, I will give a Pull Request.

> Support Gang Distribution of Task 
> --
>
> Key: SPARK-19823
> URL: https://issues.apache.org/jira/browse/SPARK-19823
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.0.2
>Reporter: DjvuLee
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19823) Support Gang Distribution of Task

2017-03-04 Thread DjvuLee (JIRA)
DjvuLee created SPARK-19823:
---

 Summary: Support Gang Distribution of Task 
 Key: SPARK-19823
 URL: https://issues.apache.org/jira/browse/SPARK-19823
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler, Spark Core
Affects Versions: 2.0.2
Reporter: DjvuLee






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19798) Query returns stale results when tables are modified on other sessions

2017-03-04 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19798:
-
Component/s: (was: Spark Core)
 SQL

> Query returns stale results when tables are modified on other sessions
> --
>
> Key: SPARK-19798
> URL: https://issues.apache.org/jira/browse/SPARK-19798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Giambattista
>
> I observed the problem on master branch with thrift server in multisession 
> mode (default), but I was able to replicate also with spark-shell as well 
> (see below the sequence for replicating).
> I observed cases where changes made in a session (table insert, table 
> renaming) are not visible to other derived sessions (created with 
> session.newSession).
> The problem seems due to the fact that each session has its own 
> tableRelationCache and it does not get refreshed.
> IMO tableRelationCache should be shared in sharedState, maybe in the 
> cacheManager so that refresh of caches for data that is not session-specific 
> such as temporary tables gets centralized.  
> --- Spark shell script
> val spark2 = spark.newSession
> spark.sql("CREATE TABLE test (a int) using parquet")
> spark2.sql("select * from test").show // OK returns empty
> spark.sql("select * from test").show // OK returns empty
> spark.sql("insert into TABLE test values 1,2,3")
> spark2.sql("select * from test").show // ERROR returns empty
> spark.sql("select * from test").show // OK returns 3,2,1
> spark.sql("create table test2 (a int) using parquet")
> spark.sql("insert into TABLE test2 values 4,5,6")
> spark2.sql("select * from test2").show // OK returns 6,4,5
> spark.sql("select * from test2").show // OK returns 6,4,5
> spark.sql("alter table test rename to test3")
> spark.sql("alter table test2 rename to test")
> spark.sql("alter table test3 rename to test2")
> spark2.sql("select * from test").show // ERROR returns empty
> spark.sql("select * from test").show // OK returns 6,4,5
> spark2.sql("select * from test2").show // ERROR throws 
> java.io.FileNotFoundException
> spark.sql("select * from test2").show // OK returns 3,1,2



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19821) Throw out the Read-only disk information when create file for Shuffle

2017-03-04 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19821:
-
Priority: Minor  (was: Major)

> Throw out the Read-only disk information when create file for Shuffle
> -
>
> Key: SPARK-19821
> URL: https://issues.apache.org/jira/browse/SPARK-19821
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.2
>Reporter: DjvuLee
>Priority: Minor
>
> java.io.FileNotFoundException: 
> /data01/yarn/nmdata/usercache/tiger/appcache/application_1486364177723_1047735/blockmgr-23098754-a97a-4673-ba73-3de5e167da87/2c/shuffle_55_47_0.index.0347f74b-a9c1-473e-b81f-40be394cc00f
>  (Input/output error)
>   at java.io.FileOutputStream.open0(Native Method)
>   at java.io.FileOutputStream.open(FileOutputStream.java:270)
>   at java.io.FileOutputStream.(FileOutputStream.java:213)
>   at java.io.FileOutputStream.(FileOutputStream.java:162)
>   at 
> org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:143)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:219)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:314)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19821) Throw out the Read-only disk information when create file for Shuffle

2017-03-04 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896079#comment-15896079
 ] 

Shixiong Zhu commented on SPARK-19821:
--

This is more like a Java issue.

> Throw out the Read-only disk information when create file for Shuffle
> -
>
> Key: SPARK-19821
> URL: https://issues.apache.org/jira/browse/SPARK-19821
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.2
>Reporter: DjvuLee
>Priority: Minor
>
> java.io.FileNotFoundException: 
> /data01/yarn/nmdata/usercache/tiger/appcache/application_1486364177723_1047735/blockmgr-23098754-a97a-4673-ba73-3de5e167da87/2c/shuffle_55_47_0.index.0347f74b-a9c1-473e-b81f-40be394cc00f
>  (Input/output error)
>   at java.io.FileOutputStream.open0(Native Method)
>   at java.io.FileOutputStream.open(FileOutputStream.java:270)
>   at java.io.FileOutputStream.(FileOutputStream.java:213)
>   at java.io.FileOutputStream.(FileOutputStream.java:162)
>   at 
> org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:143)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:219)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:314)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19822) CheckpointSuite.testCheckpointedOperation: should not check checkpointFilesOfLatestTime by the PATH string.

2017-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19822:


Assignee: Apache Spark

> CheckpointSuite.testCheckpointedOperation: should not check 
> checkpointFilesOfLatestTime by the PATH string.
> ---
>
> Key: SPARK-19822
> URL: https://issues.apache.org/jira/browse/SPARK-19822
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19822) CheckpointSuite.testCheckpointedOperation: should not check checkpointFilesOfLatestTime by the PATH string.

2017-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19822:


Assignee: (was: Apache Spark)

> CheckpointSuite.testCheckpointedOperation: should not check 
> checkpointFilesOfLatestTime by the PATH string.
> ---
>
> Key: SPARK-19822
> URL: https://issues.apache.org/jira/browse/SPARK-19822
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19822) CheckpointSuite.testCheckpointedOperation: should not check checkpointFilesOfLatestTime by the PATH string.

2017-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896078#comment-15896078
 ] 

Apache Spark commented on SPARK-19822:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/17167

> CheckpointSuite.testCheckpointedOperation: should not check 
> checkpointFilesOfLatestTime by the PATH string.
> ---
>
> Key: SPARK-19822
> URL: https://issues.apache.org/jira/browse/SPARK-19822
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19822) CheckpointSuite.testCheckpointedOperation: should not check checkpointFilesOfLatestTime by the PATH string.

2017-03-04 Thread Genmao Yu (JIRA)
Genmao Yu created SPARK-19822:
-

 Summary: CheckpointSuite.testCheckpointedOperation: should not 
check checkpointFilesOfLatestTime by the PATH string.
 Key: SPARK-19822
 URL: https://issues.apache.org/jira/browse/SPARK-19822
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.1.0, 2.0.2
Reporter: Genmao Yu
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications

2017-03-04 Thread DjvuLee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896070#comment-15896070
 ] 

DjvuLee commented on SPARK-18085:
-

[~vanzin] Thanks for your reply!

Does your new solution will generate a separate jar file,  I would like to try 
it if true.
The history problem is a issue for our production environment now.

> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19821) Throw out the Read-only disk information when create file for Shuffle

2017-03-04 Thread DjvuLee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896067#comment-15896067
 ] 

DjvuLee commented on SPARK-19821:
-

Currently, when the disk is just read-only, we will just throw out the 
FileNotFoundException, 
we can do better to give out the disk is  read-only information, and maybe we 
can achieve better fault tolerance for single task.

> Throw out the Read-only disk information when create file for Shuffle
> -
>
> Key: SPARK-19821
> URL: https://issues.apache.org/jira/browse/SPARK-19821
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.2
>Reporter: DjvuLee
>
> java.io.FileNotFoundException: 
> /data01/yarn/nmdata/usercache/tiger/appcache/application_1486364177723_1047735/blockmgr-23098754-a97a-4673-ba73-3de5e167da87/2c/shuffle_55_47_0.index.0347f74b-a9c1-473e-b81f-40be394cc00f
>  (Input/output error)
>   at java.io.FileOutputStream.open0(Native Method)
>   at java.io.FileOutputStream.open(FileOutputStream.java:270)
>   at java.io.FileOutputStream.(FileOutputStream.java:213)
>   at java.io.FileOutputStream.(FileOutputStream.java:162)
>   at 
> org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:143)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:219)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:314)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19821) Throw out the Read-only disk information when create file for Shuffle

2017-03-04 Thread DjvuLee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DjvuLee updated SPARK-19821:

Description: 
java.io.FileNotFoundException: 
/data01/yarn/nmdata/usercache/tiger/appcache/application_1486364177723_1047735/blockmgr-23098754-a97a-4673-ba73-3de5e167da87/2c/shuffle_55_47_0.index.0347f74b-a9c1-473e-b81f-40be394cc00f
 (Input/output error)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at java.io.FileOutputStream.(FileOutputStream.java:162)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:143)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:219)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:314)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

> Throw out the Read-only disk information when create file for Shuffle
> -
>
> Key: SPARK-19821
> URL: https://issues.apache.org/jira/browse/SPARK-19821
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.2
>Reporter: DjvuLee
>
> java.io.FileNotFoundException: 
> /data01/yarn/nmdata/usercache/tiger/appcache/application_1486364177723_1047735/blockmgr-23098754-a97a-4673-ba73-3de5e167da87/2c/shuffle_55_47_0.index.0347f74b-a9c1-473e-b81f-40be394cc00f
>  (Input/output error)
>   at java.io.FileOutputStream.open0(Native Method)
>   at java.io.FileOutputStream.open(FileOutputStream.java:270)
>   at java.io.FileOutputStream.(FileOutputStream.java:213)
>   at java.io.FileOutputStream.(FileOutputStream.java:162)
>   at 
> org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:143)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:219)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:314)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19821) Throw out the Read-only disk information when create file for Shuffle

2017-03-04 Thread DjvuLee (JIRA)
DjvuLee created SPARK-19821:
---

 Summary: Throw out the Read-only disk information when create file 
for Shuffle
 Key: SPARK-19821
 URL: https://issues.apache.org/jira/browse/SPARK-19821
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 2.0.2
Reporter: DjvuLee






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19145) Timestamp to String casting is slowing the query significantly

2017-03-04 Thread gagan taneja (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896061#comment-15896061
 ] 

gagan taneja commented on SPARK-19145:
--

I am suggesting following changes 
introduce the function to test perfactCast similar to 

private def perfectCast(expr: Literal, dataType: DataType): Boolean = {
  expr match {
case Literal(value, StringType) => scala.util.Try {
  Cast(expr, dataType).eval(null)
}.isSuccess
case _ => false
  }
}


And string promotion based on condition if input string can be perfectly casted 

// We should cast all relative timestamp/date/string comparison into string 
comparisons
  // This behaves as a user would expect because timestamp strings sort 
lexicographically.
  // i.e. TimeStamp(2013-01-01 00:00 ...) < "2014" = true
  // For cases where its a exact cast we should cast String type to 
timestamp time 
  // This would speed up the execution
  // i.e TimeStamp(2013-01-01 00:00T ...) < '2017-01-02 19:53:51' would 
translate to 
  // TimeStamp(2013-01-01 00:00T ...) < Timestamp(2017-01-02 19:53:51) 
would translate to
  case p @ BinaryComparison(left @ Literal(_, StringType), right @ 
DateType())
 if (acceptedDataTypes.contains( right) && 
perfectCast( left, right.dataType ) ) => 
 p.makeCopy( Array( Cast( left, right.dataType), right ))
  case p @ BinaryComparison(left @ StringType(), right @ DateType())
if acceptedDataTypes.contains( right) =>
p.makeCopy(Array(left, Cast(right, StringType)))
  case p @ BinaryComparison(left @ DateType(), right @ Literal(_, 
StringType)) 
if (acceptedDataTypes.contains( left) &&
   perfectCast( right, left.dataType )) =>
 p.makeCopy( Array( left, Cast( right, left.dataType) ))
  case p @ BinaryComparison(left @ DateType(), right @ StringType()) 
if acceptedDataTypes.contains( left) =>
 p.makeCopy(Array(Cast(left, StringType), right))

> Timestamp to String casting is slowing the query significantly
> --
>
> Key: SPARK-19145
> URL: https://issues.apache.org/jira/browse/SPARK-19145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: gagan taneja
>
> i have a time series table with timestamp column 
> Following query
> SELECT COUNT(*) AS `count`
>FROM `default`.`table`
>WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 5
> is significantly SLOWER than 
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD 
> HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD 
> HH24:MI:SS−0800') LIMIT 5
> After investigation i found that in the first query time colum is cast to 
> String before applying the filter 
> However in the second query no such casting is performed and its a filter 
> with long value 
> Below are the generate Physical plan for slower execution followed by 
> physical plan for faster execution 
> SELECT COUNT(*) AS `count`
>FROM `default`.`table`
>WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 5
> == Physical Plan ==
> CollectLimit 5
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L])
>+- Exchange SinglePartition
>   +- *HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#3339L])
>  +- *Project
> +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) 
> >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 
> 19:53:51))
>+- *FileScan parquet default.cstat[time#3314] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], 
> PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: 
> struct
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD 
> HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD 
> HH24:MI:SS−0800') LIMIT 5
> == Physical Plan ==
> CollectLimit 5
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L])
>+- Exchange SinglePartition
>   +- *HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#3287L])
>  +- *Project
> +- *Filter ((isnotnull(time#3262) && (time#3262 >= 
> 148340483100)) && (time#3262 <= 148400963100))
>+- *FileScan parquet default.cstat[time#3262] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[hdfs://10.65.55.220/user/

[jira] [Commented] (SPARK-19145) Timestamp to String casting is slowing the query significantly

2017-03-04 Thread gagan taneja (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896046#comment-15896046
 ] 

gagan taneja commented on SPARK-19145:
--

17/03/04 19:05:32 TRACE HiveSessionState$$anon$1: 
=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.TypeCoercion$PromoteStrings ===

Before
'Project [*]

!+- 'Filter (time#88 >= 2017-01-02 19:53:51)
 
+- SubqueryAlias person_view, `rule_test`.`person_view` 

   +- Project [gen_attr_0#83 AS name#86, gen_attr_1#84 AS age#87, 
gen_attr_2#85 AS time#88]   
  +- SubqueryAlias person_table 
   
 +- Project [gen_attr_0#83, gen_attr_1#84, gen_attr_2#85]   
 
+- Filter (gen_attr_1#84 > 10)  

   +- SubqueryAlias gen_subquery_0  
 
  +- Project [name#89 AS gen_attr_0#83, age#90 AS 
gen_attr_1#84, time#91 AS gen_attr_2#85]
 +- MetastoreRelation rule_test, person_table   
 

After
'Project [*]

+- Filter (cast(time#88 as string) >= 2017-01-02 19:53:51)
+- SubqueryAlias person_view, `rule_test`.`person_view`
+- Project [gen_attr_0#83 AS name#86, gen_attr_1#84 AS age#87, 
gen_attr_2#85 AS time#88]
+- SubqueryAlias person_table
  +- Project [gen_attr_0#83, gen_attr_1#84, gen_attr_2#85]
  +- Filter (gen_attr_1#84 > 10)
  +- SubqueryAlias gen_subquery
  +- Project [name#89 AS gen_attr_0#83, age#90 AS 
gen_attr_1#84, time#91 AS gen_attr_2#85]
  +- MetastoreRelation rule_test, person_table

> Timestamp to String casting is slowing the query significantly
> --
>
> Key: SPARK-19145
> URL: https://issues.apache.org/jira/browse/SPARK-19145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: gagan taneja
>
> i have a time series table with timestamp column 
> Following query
> SELECT COUNT(*) AS `count`
>FROM `default`.`table`
>WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 5
> is significantly SLOWER than 
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD 
> HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD 
> HH24:MI:SS−0800') LIMIT 5
> After investigation i found that in the first query time colum is cast to 
> String before applying the filter 
> However in the second query no such casting is performed and its a filter 
> with long value 
> Below are the generate Physical plan for slower execution followed by 
> physical plan for faster execution 
> SELECT COUNT(*) AS `count`
>FROM `default`.`table`
>WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 5
> == Physical Plan ==
> CollectLimit 5
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L])
>+- Exchange SinglePartition
>   +- *HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#3339L])
>  +- *Project
> +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) 
> >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 
> 19:53:51))
>+- *FileScan parquet default.cstat[time#3314] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], 
> PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: 
> struct
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD 
> HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD 
> HH24:MI:SS−0800') LIMIT 5
> == Physical Plan ==
> CollectLimit 5
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L])
>+- Exchange SinglePartition
>   +- *HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#3287L])
>  +- *Project
> +- *Filt

[jira] [Commented] (SPARK-19145) Timestamp to String casting is slowing the query significantly

2017-03-04 Thread gagan taneja (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896040#comment-15896040
 ] 

gagan taneja commented on SPARK-19145:
--

Code responsible for this belongs to class amd the code below is responsible to 
type coercion. Although this is logically correct its also slowing down binary 
comparison because during execution Interval types are casted to String and the 
comparision is done on string operator resulting in 10x slower performance

org.apache.spark.sql.catalyst.analysis.TypeCoercion {
.

// We should cast all relative timestamp/date/string comparison into string 
comparisons
  // This behaves as a user would expect because timestamp strings sort 
lexicographically.
  // i.e. TimeStamp(2013-01-01 00:00 ...) < "2014" = true
  case p @ BinaryComparison(left @ StringType(), right @ DateType()) =>
p.makeCopy(Array(left, Cast(right, StringType)))
  case p @ BinaryComparison(left @ DateType(), right @ StringType()) =>
p.makeCopy(Array(Cast(left, StringType), right))
  case p @ BinaryComparison(left @ StringType(), right @ TimestampType()) =>
p.makeCopy(Array(left, Cast(right, StringType)))
  case p @ BinaryComparison(left @ TimestampType(), right @ StringType()) =>
p.makeCopy(Array(Cast(left, StringType), right))

  // Comparisons between dates and timestamps.
  case p @ BinaryComparison(left @ TimestampType(), right @ DateType()) =>
p.makeCopy(Array(Cast(left, StringType), Cast(right, StringType)))
  case p @ BinaryComparison(left @ DateType(), right @ TimestampType()) =>
p.makeCopy(Array(Cast(left, StringType), Cast(right, StringType)))


> Timestamp to String casting is slowing the query significantly
> --
>
> Key: SPARK-19145
> URL: https://issues.apache.org/jira/browse/SPARK-19145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: gagan taneja
>
> i have a time series table with timestamp column 
> Following query
> SELECT COUNT(*) AS `count`
>FROM `default`.`table`
>WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 5
> is significantly SLOWER than 
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD 
> HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD 
> HH24:MI:SS−0800') LIMIT 5
> After investigation i found that in the first query time colum is cast to 
> String before applying the filter 
> However in the second query no such casting is performed and its a filter 
> with long value 
> Below are the generate Physical plan for slower execution followed by 
> physical plan for faster execution 
> SELECT COUNT(*) AS `count`
>FROM `default`.`table`
>WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 5
> == Physical Plan ==
> CollectLimit 5
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L])
>+- Exchange SinglePartition
>   +- *HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#3339L])
>  +- *Project
> +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) 
> >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 
> 19:53:51))
>+- *FileScan parquet default.cstat[time#3314] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], 
> PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: 
> struct
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD 
> HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD 
> HH24:MI:SS−0800') LIMIT 5
> == Physical Plan ==
> CollectLimit 5
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L])
>+- Exchange SinglePartition
>   +- *HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#3287L])
>  +- *Project
> +- *Filter ((isnotnull(time#3262) && (time#3262 >= 
> 148340483100)) && (time#3262 <= 148400963100))
>+- *FileScan parquet default.cstat[time#3262] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], 
> PartitionFilters: [], PushedFilters: [IsNotNull(time), 
> GreaterThanOrEqual(time,2017-01-02 19:53:51.0), 
> LessThanOrEqual(time,2017-01-09..., ReadSchema: struct
> In Impala both query run efficiently without and performance difference
> Spark should be able to parse the Date string and conv

[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2017-03-04 Thread Daniel Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896033#comment-15896033
 ] 

Daniel Li commented on SPARK-6407:
--

Appreciate the quick reply, [~srowen].

Yeah, we'd be recomputing them, but not from scratch since we'd be starting 
with optimized _U_ and _V_.  It would likely take only one or two iterations 
before reconvergence.  Would this still be considered too expensive?

The thing I hesitate about regarding fold-in updating is that the assumption 
that only the corresponding user row and item row will change may be too 
simplifying (since, of course, there's a "rippling out" effect—all items the 
user rated previous need to be updated, then all users that rated any of those 
items would need updating, etc.).  Then again, even if we take this rippling 
into account the computation may not be too expensive, since a single update 
likely won't affect the RMSE enough to delay convergence.  (Though I haven't 
worked out the math showing this; it's just a hunch.)

Do you have any insights into this?

> Streaming ALS for Collaborative Filtering
> -
>
> Key: SPARK-6407
> URL: https://issues.apache.org/jira/browse/SPARK-6407
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Reporter: Felix Cheung
>Priority: Minor
>
> Like MLLib's ALS implementation for recommendation, and applying to streaming.
> Similar to streaming linear regression, logistic regression, could we apply 
> gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19541) High Availability support for ThriftServer

2017-03-04 Thread gagan taneja (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896032#comment-15896032
 ] 

gagan taneja commented on SPARK-19541:
--

This would a great improvement as we are also leveraging Thriftserver for 
JDBC/ODBC connectivity


> High Availability support for ThriftServer
> --
>
> Key: SPARK-19541
> URL: https://issues.apache.org/jira/browse/SPARK-19541
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: LvDongrong
>
> Currently, We use the spark ThriftServer frequently, and there are many 
> connects between the client and only ThriftServer.When the ThriftServer is 
> down ,we cannot get the service again until the server is restarted .So we 
> need to consider the ThriftServer HA as well as HiveServer HA. For 
> ThriftServer, we want to import the pattern of HiveServer HA to provide 
> ThriftServer HA. Therefore, we need to start multiple thrift server which 
> register on the zookeeper. Then the client  can find the thrift server by 
> just connecting to the zookeeper.So the beeline can get the service from 
> other thrift server when one is down.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19705) Preferred location supporting HDFS Cache for FileScanRDD

2017-03-04 Thread gagan taneja (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gagan taneja updated SPARK-19705:
-
Shepherd: Herman van Hovell

> Preferred location supporting HDFS Cache for FileScanRDD
> 
>
> Key: SPARK-19705
> URL: https://issues.apache.org/jira/browse/SPARK-19705
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: gagan taneja
>
> Although NewHadoopRDD and HadoopRdd considers HDFS cache while calculating 
> preferredLocations, FileScanRDD do not take into account HDFS cache while 
> calculating preferredLocations
> The enhancement can be easily implemented for large files where FilePartition 
> only contains single HDFS file
> The enhancement will also result in significant performance improvement for 
> cached hdfs partitions



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19705) Preferred location supporting HDFS Cache for FileScanRDD

2017-03-04 Thread gagan taneja (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896026#comment-15896026
 ] 

gagan taneja commented on SPARK-19705:
--

Herman,
Can you help me with this enhancement
thanks

> Preferred location supporting HDFS Cache for FileScanRDD
> 
>
> Key: SPARK-19705
> URL: https://issues.apache.org/jira/browse/SPARK-19705
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: gagan taneja
>
> Although NewHadoopRDD and HadoopRdd considers HDFS cache while calculating 
> preferredLocations, FileScanRDD do not take into account HDFS cache while 
> calculating preferredLocations
> The enhancement can be easily implemented for large files where FilePartition 
> only contains single HDFS file
> The enhancement will also result in significant performance improvement for 
> cached hdfs partitions



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-04 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895998#comment-15895998
 ] 

Eric Maynard edited comment on SPARK-19656 at 3/5/17 12:58 AM:
---

Normally after getting the `datum` you should call `asInstanceOf` to cast it 
properly.
  
In any event in Spark 2.0 the easier way to achieve what you want is probably 
something like this:

{code:java}
import com.databricks.spark.avro._
val df = spark.read.avro("file.avro")
val extracted = df.map(row => (row(0).asInstanceOf[MyCustomClass]))
{code}


was (Author: emaynard):
Normally after getting the `datum` you should call `asInstanceOf` to cast it 
properly.
  
In any event in Spark 2.0 the easier way to achieve what you want is probably 
something like this:

{code:scala}
import com.databricks.spark.avro._
val df = spark.read.avro("file.avro")
val extracted = df.map(row => (row(0).asInstanceOf[MyCustomClass]))
{code}

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-04 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895998#comment-15895998
 ] 

Eric Maynard commented on SPARK-19656:
--

Normally after getting the `datum` you should call `asInstanceOf` to cast it 
properly.
  
In any event in Spark 2.0 the easier way to achieve what you want is probably 
something like this:

{code:scala}
import com.databricks.spark.avro._
val df = spark.read.avro("file.avro")
val extracted = df.map(row => (row(0).asInstanceOf[MyCustomClass]))
{code}

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19713) saveAsTable

2017-03-04 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895991#comment-15895991
 ] 

Eric Maynard commented on SPARK-19713:
--

In general instead of using `DataFrameWriter.saveAsTable`, I find it's better 
to create the table in advance and then insert data into it 
`DataFrameWriter.insertInto`. If you use choose to use 
`DataFrameWriter.saveAsTable` there is a chance of the folder being created and 
the Hive table not being updated, but as a developer you can handle these 
errors with `HiveContext.refreshTable` or by using `FileSystem.delete`. I think 
this is not an issue.

> saveAsTable
> ---
>
> Key: SPARK-19713
> URL: https://issues.apache.org/jira/browse/SPARK-19713
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Balaram R Gadiraju
>
> Hi,
> I just observed that when we use dataframe.saveAsTable("table") -- In 
> oldversions
> and dataframe.write.saveAsTable("table") -- in the newer versions
> When using the method “df3.saveAsTable("brokentable")” in 
> scale code. This creates a folder in hdfs and doesn’t update hive-metastore 
> that it plans to create the table. So if anything goes wrong in between the 
> folder still exists and hive is not aware of the folder creation. This will 
> block the users from creating the table “brokentable” as the folder already 
> exists, we can remove the folder using “hadoop fs –rmr 
> /data/hive/databases/testdb.db/brokentable”.  So below is the workaround 
> which will enable to you to continue the development work.
> Current Code:
> val df3 = sqlContext.sql("select * fromtesttable")
> df3.saveAsTable("brokentable")
> THE WORKAROUND:
> By registering the DataFrame as table and then using sql command to load the 
> data will resolve the issue. EX:
> val df3 = sqlContext.sql("select * from testtable").registerTempTable("df3")
> sqlContext.sql("CREATE TABLE brokentable AS SELECT * FROM df3")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19820) Allow reason to be specified for task kill

2017-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19820:


Assignee: (was: Apache Spark)

> Allow reason to be specified for task kill
> --
>
> Key: SPARK-19820
> URL: https://issues.apache.org/jira/browse/SPARK-19820
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19820) Allow reason to be specified for task kill

2017-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895954#comment-15895954
 ] 

Apache Spark commented on SPARK-19820:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/17166

> Allow reason to be specified for task kill
> --
>
> Key: SPARK-19820
> URL: https://issues.apache.org/jira/browse/SPARK-19820
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19820) Allow reason to be specified for task kill

2017-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19820:


Assignee: Apache Spark

> Allow reason to be specified for task kill
> --
>
> Key: SPARK-19820
> URL: https://issues.apache.org/jira/browse/SPARK-19820
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19820) Allow reason to be specified for task kill

2017-03-04 Thread Eric Liang (JIRA)
Eric Liang created SPARK-19820:
--

 Summary: Allow reason to be specified for task kill
 Key: SPARK-19820
 URL: https://issues.apache.org/jira/browse/SPARK-19820
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Eric Liang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8556) Beeline script throws ClassNotFoundException

2017-03-04 Thread Arvind Surve (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895898#comment-15895898
 ] 

Arvind Surve commented on SPARK-8556:
-

Hi Cheng,

Would you mind sharing configuration issue you had?

-Arvind

> Beeline script throws ClassNotFoundException
> 
>
> Key: SPARK-8556
> URL: https://issues.apache.org/jira/browse/SPARK-8556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Priority: Blocker
>
> 1.5.0-SNAPSHOT, commit 1dfb0f7b2aed5ee6d07543fdeac8ff7c777b63b9
> Build Spark with:
> {noformat}
> $ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1
> {noformat}
> Start HiveThriftServer2 with:
> {noformat}
> $ ./sbin/start-thriftserver.sh
> {noformat}
> Run Beeline and quit immediately:
> {noformat}
> $ ./bin/beeline -u jdbc:hive2://localhost:1
> Connecting to jdbc:hive2://localhost:1
> org/apache/hive/service/cli/thrift/TCLIService$Iface
> Beeline version 1.5.0-SNAPSHOT by Apache Hive
> 0: jdbc:hive2://localhost:1> Exception in thread "main" 
> java.lang.NoClassDefFoundError: 
> org/apache/hive/service/cli/thrift/TCLIService$Iface
> at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
> at java.sql.DriverManager.getConnection(DriverManager.java:664)
> at java.sql.DriverManager.getConnection(DriverManager.java:208)
> at 
> org.apache.hive.beeline.DatabaseConnection.connect(DatabaseConnection.java:145)
> at 
> org.apache.hive.beeline.DatabaseConnection.getConnection(DatabaseConnection.java:186)
> at org.apache.hive.beeline.Commands.close(Commands.java:802)
> at org.apache.hive.beeline.Commands.closeall(Commands.java:784)
> at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:673)
> at 
> org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:368)
> at org.apache.hive.beeline.BeeLine.main(BeeLine.java:351)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hive.service.cli.thrift.TCLIService$Iface
> at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 10 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16844) Generate code for sort based aggregation

2017-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895762#comment-15895762
 ] 

Apache Spark commented on SPARK-16844:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17164

> Generate code for sort based aggregation
> 
>
> Key: SPARK-16844
> URL: https://issues.apache.org/jira/browse/SPARK-16844
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: yucai
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16617) Upgrade to Avro 1.8.x

2017-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895756#comment-15895756
 ] 

Apache Spark commented on SPARK-16617:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/17163

> Upgrade to Avro 1.8.x
> -
>
> Key: SPARK-16617
> URL: https://issues.apache.org/jira/browse/SPARK-16617
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ben McCann
>
> Avro 1.8 makes Avro objects serializable so that you can easily have an RDD 
> containing Avro objects.
> See https://issues.apache.org/jira/browse/AVRO-1502



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19503) Execution Plan Optimizer: avoid sort or shuffle when it does not change end result such as df.sort(...).count()

2017-03-04 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895740#comment-15895740
 ] 

Kazuaki Ishizaki commented on SPARK-19503:
--

Is it better to control whether we prune local and global sorts by using an 
option (property) in SQLConf?

> Execution Plan Optimizer: avoid sort or shuffle when it does not change end 
> result such as df.sort(...).count()
> ---
>
> Key: SPARK-19503
> URL: https://issues.apache.org/jira/browse/SPARK-19503
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
> Environment: Perhaps only a pyspark or databricks AWS issue
>Reporter: R
>Priority: Minor
>  Labels: execution, optimizer, plan, query
>
> df.sort(...).count()
> performs shuffle and sort and then count! This is wasteful as sort is not 
> required here and makes me wonder how smart the algebraic optimiser is 
> indeed! The data may be partitioned by known count (such as parquet files) 
> and we should not shuffle to just perform count.
> This may look trivial, but if optimiser fails to recognise this, I wonder 
> what else is it missing especially in more complex operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19550) Remove reflection, docs, build elements related to Java 7

2017-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895677#comment-15895677
 ] 

Apache Spark commented on SPARK-19550:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/17162

> Remove reflection, docs, build elements related to Java 7
> -
>
> Key: SPARK-19550
> URL: https://issues.apache.org/jira/browse/SPARK-19550
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Documentation, Spark Core
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Assignee: Sean Owen
> Fix For: 2.2.0
>
>
> - Move external/java8-tests tests into core, streaming, sql and remove
> - Remove MaxPermGen and related options
> - Fix some reflection / TODOs around Java 8+ methods
> - Update doc references to 1.7/1.8 differences
> - Remove Java 7/8 related build profiles
> - Update some plugins for better Java 8 compatibility
> - Fix a few Java-related warnings



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?

2017-03-04 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895629#comment-15895629
 ] 

Nick Pentreath commented on SPARK-7146:
---

Personally I support developer API - these are going to be used by developers 
of extensions & custom pipeline components and I think we can make best effort 
not to break things but don't need to absolutely guarantee that.

> Should ML sharedParams be a public API?
> ---
>
> Key: SPARK-7146
> URL: https://issues.apache.org/jira/browse/SPARK-7146
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Proposal: Make most of the Param traits in sharedParams.scala public.  Mark 
> them as DeveloperApi.
> Pros:
> * Sharing the Param traits helps to encourage standardized Param names and 
> documentation.
> Cons:
> * Users have to be careful since parameters can have different meanings for 
> different algorithms.
> * If the shared Params are public, then implementations could test for the 
> traits.  It is unclear if we want users to rely on these traits, which are 
> somewhat experimental.
> Currently, the shared params are private.
> h3. UPDATED proposal
> * Some Params are clearly safe to make public.  We will do so.
> * Some Params could be made public but may require caveats in the trait doc.
> * Some Params have turned out not to be shared in practice.  We can move 
> those Params to the classes which use them.
> *Public shared params*:
> * I/O column params
> ** HasFeaturesCol
> ** HasInputCol
> ** HasInputCols
> ** HasLabelCol
> ** HasOutputCol
> ** HasPredictionCol
> ** HasProbabilityCol
> ** HasRawPredictionCol
> ** HasVarianceCol
> ** HasWeightCol
> * Algorithm settings
> ** HasCheckpointInterval
> ** HasElasticNetParam
> ** HasFitIntercept
> ** HasMaxIter
> ** HasRegParam
> ** HasSeed
> ** HasStandardization (less common)
> ** HasStepSize
> ** HasTol
> *Questionable params*:
> * HasHandleInvalid (only used in StringIndexer, but might be more widely used 
> later on)
> * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but 
> same meaning as Optimizer in LDA)
> *Params to be removed from sharedParams*:
> * HasThreshold (only used in LogisticRegression)
> * HasThresholds (only used in ProbabilisticClassifier)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-04 Thread Danilo Ascione (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895619#comment-15895619
 ] 

Danilo Ascione commented on SPARK-14409:


I can help with both PR. Please consider that the solution in [PR 
16618|https://github.com/apache/spark/pull/16618] is a Dataframe api based 
version of that in [PR 12461|https://github.com/apache/spark/pull/12461]. Any 
way, I'd like to help review an alternative solution. Thanks!

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14273) Add FileFormat.isSplittable to indicate whether a format is splittable

2017-03-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14273.
---
Resolution: Duplicate

> Add FileFormat.isSplittable to indicate whether a format is splittable
> --
>
> Key: SPARK-14273
> URL: https://issues.apache.org/jira/browse/SPARK-14273
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{FileSourceStrategy}} assumes that all data source formats are splittable 
> and always splits data files by fixed partition size. However, not all HDSF 
> based formats are splittable. We need a flag to indicate that and ensure that 
> non-splittable files won't be split into multiple Spark partitions.
> (PS: Is it "splitable" or "splittable"? Probably the latter one? Hadoop uses 
> the former one though...)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2017-03-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895605#comment-15895605
 ] 

Sean Owen commented on SPARK-6407:
--

How is it different from recomputing all of U and V?
Doing anything to all of the matrices is probably out of the question for an 
online update. 
The point of fold-in is to update only the two affected rows and make the 
simplifying assumption that nothing else changes, because it would be too 
expensive to recompute anything.
If you mean batch together enough to make it worthwhile to update, then yes at 
some point that's worth it, but it just reduces to re-running the batch 
algorithm for a few iterations again.

> Streaming ALS for Collaborative Filtering
> -
>
> Key: SPARK-6407
> URL: https://issues.apache.org/jira/browse/SPARK-6407
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Reporter: Felix Cheung
>Priority: Minor
>
> Like MLLib's ALS implementation for recommendation, and applying to streaming.
> Similar to streaming linear regression, logistic regression, could we apply 
> gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19819) Use concrete data in SparkR DataFrame examples

2017-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19819:


Assignee: Apache Spark

> Use concrete data in SparkR DataFrame examples 
> ---
>
> Key: SPARK-19819
> URL: https://issues.apache.org/jira/browse/SPARK-19819
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> Many examples in SparkDataFrame methods uses: 
> {code}
> path <- "path/to/file.json"
> df <- read.json(path)
> {code}
> This is not directly runnable. Replace this with real numerical examples so 
> that users can directly execute the examples. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19819) Use concrete data in SparkR DataFrame examples

2017-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19819:


Assignee: (was: Apache Spark)

> Use concrete data in SparkR DataFrame examples 
> ---
>
> Key: SPARK-19819
> URL: https://issues.apache.org/jira/browse/SPARK-19819
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>Priority: Minor
>
> Many examples in SparkDataFrame methods uses: 
> {code}
> path <- "path/to/file.json"
> df <- read.json(path)
> {code}
> This is not directly runnable. Replace this with real numerical examples so 
> that users can directly execute the examples. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19819) Use concrete data in SparkR DataFrame examples

2017-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895599#comment-15895599
 ] 

Apache Spark commented on SPARK-19819:
--

User 'actuaryzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/17161

> Use concrete data in SparkR DataFrame examples 
> ---
>
> Key: SPARK-19819
> URL: https://issues.apache.org/jira/browse/SPARK-19819
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>Priority: Minor
>
> Many examples in SparkDataFrame methods uses: 
> {code}
> path <- "path/to/file.json"
> df <- read.json(path)
> {code}
> This is not directly runnable. Replace this with real numerical examples so 
> that users can directly execute the examples. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19819) Use concrete data in SparkR DataFrame examples

2017-03-04 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-19819:
---

 Summary: Use concrete data in SparkR DataFrame examples 
 Key: SPARK-19819
 URL: https://issues.apache.org/jira/browse/SPARK-19819
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Wayne Zhang
Priority: Minor


Many examples in SparkDataFrame methods uses: 
{code}
path <- "path/to/file.json"
df <- read.json(path)
{code}

This is not directly runnable. Replace this with real numerical examples so 
that users can directly execute the examples. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2017-03-04 Thread Daniel Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895589#comment-15895589
 ] 

Daniel Li commented on SPARK-6407:
--

Reviving this thread since I'm interested in implementing streaming CF for 
Spark.

bq. Using ALS for online updates is expensive.

Recomputing the factor matrices _U_ and _V_ from scratch for every update would 
be terribly expensive, but what about keeping _U_ and _V_ around and simply 
recomputing another round or two after each new rating that comes in?  The 
algorithm would simply be continually following a moving optimum.  I can't 
imagine the RMSE changing much due to small updates if we use a convergence 
threshold _à la_ [Y. Zhou, et al., “Large-Scale Parallel Collaborative 
Filtering for the Netflix Prize”|http://dl.acm.org/citation.cfm?id=1424269] 
instead of a fixed number of iterations.

(In fact, since calculating _(U^T) * V_ would probably take a nontrivial slice 
of time, new updates that come in during a round of calculation could be 
"batched" into the next round of calculation, increasing efficiency.)

Thoughts?

> Streaming ALS for Collaborative Filtering
> -
>
> Key: SPARK-6407
> URL: https://issues.apache.org/jira/browse/SPARK-6407
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Reporter: Felix Cheung
>Priority: Minor
>
> Like MLLib's ALS implementation for recommendation, and applying to streaming.
> Similar to streaming linear regression, logistic regression, could we apply 
> gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19792) In the Master Page,the column named “Memory per Node” ,I think it is not all right

2017-03-04 Thread liuxian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895576#comment-15895576
 ] 

liuxian commented on SPARK-19792:
-

I think it refers to the memory allocated to each executor by the worker.
>From the code, the value of this column is 
>"{Utils.megabytesToString(app.desc.memoryPerExecutorMB)}"

> In the Master Page,the column named “Memory per Node” ,I think  it is not all 
> right
> ---
>
> Key: SPARK-19792
> URL: https://issues.apache.org/jira/browse/SPARK-19792
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: liuxian
>Priority: Trivial
>
> Open the spark web page,in the Master Page ,have two tables:Running 
> Applications table and  Completed Applications table, to the column named 
> “Memory per Node” ,I think it is not all right ,because a node may be not 
> have only one executor.So I think that should be named as “Memory per 
> Executor”.Otherwise easy to let the user misunderstanding



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org