[jira] [Commented] (SPARK-19823) Support Gang Distribution of Task
[ https://issues.apache.org/jira/browse/SPARK-19823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896107#comment-15896107 ] DjvuLee commented on SPARK-19823: - [~zsxwing] Can you have a look at? > Support Gang Distribution of Task > -- > > Key: SPARK-19823 > URL: https://issues.apache.org/jira/browse/SPARK-19823 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.0.2 >Reporter: DjvuLee > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19823) Support Gang Distribution of Task
[ https://issues.apache.org/jira/browse/SPARK-19823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896096#comment-15896096 ] DjvuLee edited comment on SPARK-19823 at 3/5/17 7:19 AM: - When Spark distributes tasks to Executors, it uses a Round-Robin way, this means we distribute tasks to all Executors as possible, this is a good strategy in most case. However, when we introduce the dynamic allocation and concurrent jobs, users may need the gang distribution for one single job, this means we allocate tasks for one job in few Executors, so we can release the unused resource. Under current way, Executors may not get released since the long-running stage will distributed at lease one task in each Executor, even we have more cores in each Executor. We can offer a configuration for users, and this will not introduce much complexity. we only nee to modify this: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L340 val shuffledOffers = shuffleOffers(filteredOffers) was (Author: djvulee): When Spark distributes tasks to Executors, it uses a Round-Robin way, this means we distribute tasks to all Executors as possible, this is a good strategy in most case. However, when we introduce the dynamic allocation and concurrent jobs, users may need the gang distribution for one single job, this means we allocate tasks for one job in few Executors, so we can release the unused resource. Under current way, Executors may not get released since the long-running stage will distributed at lease one task in each Executor, even we have more cores in each Executor. We can offer a configuration for users, and this will not introduce much complexity. > Support Gang Distribution of Task > -- > > Key: SPARK-19823 > URL: https://issues.apache.org/jira/browse/SPARK-19823 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.0.2 >Reporter: DjvuLee > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19823) Support Gang Distribution of Task
[ https://issues.apache.org/jira/browse/SPARK-19823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896096#comment-15896096 ] DjvuLee edited comment on SPARK-19823 at 3/5/17 7:10 AM: - When Spark distributes tasks to Executors, it uses a Round-Robin way, this means we distribute tasks to all Executors as possible, this is a good strategy in most case. However, when we introduce the dynamic allocation and concurrent jobs, users may need the gang distribution for one single job, this means we allocate tasks for one job in few Executors, so we can release the unused resource. Under current way, Executors may not get released since the long-running stage will distributed at lease one task in each Executor, even we have more cores in each Executor. We can offer a configuration for users, and this will not introduce much complexity. was (Author: djvulee): When Spark distributes tasks to Executors, it uses a Round-Robin way, this means we distribute tasks to all Executors as possible, this is a good strategy in most case. However, when we introduce the dynamic allocation and concurrent jobs, users may need the gang distribution for one single job, this means we allocate tasks for one job in few Executors, so we can release the unused resource. In current ways, Executors may not get released since the long-running stage will distributed at lease one task in each Executor, even we have more cores in each Executor. We can offer a configuration for users, and this will not introduce much complexity. > Support Gang Distribution of Task > -- > > Key: SPARK-19823 > URL: https://issues.apache.org/jira/browse/SPARK-19823 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.0.2 >Reporter: DjvuLee > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19823) Support Gang Distribution of Task
[ https://issues.apache.org/jira/browse/SPARK-19823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896096#comment-15896096 ] DjvuLee edited comment on SPARK-19823 at 3/5/17 7:10 AM: - When Spark distributes tasks to Executors, it uses a Round-Robin way, this means we distribute tasks to all Executors as possible, this is a good strategy in most case. However, when we introduce the dynamic allocation and concurrent jobs, users may need the gang distribution for one single job, this means we allocate tasks for one job in few Executors, so we can release the unused resource. In current ways, Executors may not get released since the long-running stage will distributed at lease one task in each Executor, even we have more cores in each Executor. We can offer a configuration for users, and this will not introduce much complexity. was (Author: djvulee): When Spark distributes tasks to Executors, it uses a Round-Robin way, this means we distribute tasks to all Executors as possible, this is a good strategy in most case. However, when we introduce the dynamic allocation and concurrent jobs, users may need the gang distribution for one single job, this means we allocate tasks for one job in few Executors, so we can release the unused resource. In current ways, Executors may not get released since the long-running stage will distributed at lease one task in each Executors, even we have more cores in each Executor. We can offer a configuration for users, and this will not introduce much complexity. > Support Gang Distribution of Task > -- > > Key: SPARK-19823 > URL: https://issues.apache.org/jira/browse/SPARK-19823 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.0.2 >Reporter: DjvuLee > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13156) JDBC using multiple partitions creates additional tasks but only executes on one
[ https://issues.apache.org/jira/browse/SPARK-13156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896100#comment-15896100 ] zhuo bao commented on SPARK-13156: -- I had the same problem, but I found that it is the problem with val upperBound = (numPartitions-1).toLong Since upperBound should 'EXCLUSIVE'. So the value here should be numPartition.toLong. They will find all thread/partitions are populated with records. Attention, please verify your record number at the end. > JDBC using multiple partitions creates additional tasks but only executes on > one > > > Key: SPARK-13156 > URL: https://issues.apache.org/jira/browse/SPARK-13156 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.5.0 > Environment: Hadoop 2.6.0-cdh5.4.0, Teradata, yarn-client >Reporter: Charles Drotar > > I can successfully kick off a query through JDBC to Teradata, and when it > runs it creates a task on each executor for every partition. The problem is > that all of the tasks except for one complete within a couple seconds and the > final task handles the entire dataset. > Example Code: > private val properties = new java.util.Properties() > properties.setProperty("driver","com.teradata.jdbc.TeraDriver") > properties.setProperty("username","foo") > properties.setProperty("password","bar") > val url = "jdbc:teradata://oneview/, TMODE=TERA,TYPE=FASTEXPORT,SESSIONS=10" > val numPartitions = 5 > val dbTableTemp = "( SELECT id MOD $numPartitions%d AS modulo, id FROM > db.table) AS TEMP_TABLE" > val partitionColumn = "modulo" > val lowerBound = 0.toLong > val upperBound = (numPartitions-1).toLong > val df = > sqlContext.read.jdbc(url,dbTableTemp,partitionColumn,lowerBound,upperBound,numPartitions,properties) > df.write.parquet("/output/path/for/df/") > When I look at the Spark UI I see the 5 tasks, but only 1 is actually > querying. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19823) Support Gang Distribution of Task
[ https://issues.apache.org/jira/browse/SPARK-19823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896096#comment-15896096 ] DjvuLee edited comment on SPARK-19823 at 3/5/17 7:10 AM: - When Spark distributes tasks to Executors, it uses a Round-Robin way, this means we distribute tasks to all Executors as possible, this is a good strategy in most case. However, when we introduce the dynamic allocation and concurrent jobs, users may need the gang distribution for one single job, this means we allocate tasks for one job in few Executors, so we can release the unused resource. In current ways, Executors may not get released since the long-running stage will distributed at lease one task in each Executors, even we have more cores in each Executor. We can offer a configuration for users, and this will not introduce much complexity. was (Author: djvulee): When Spark distributes tasks to Executors, it uses a Round-Robin way, this means we distribute tasks to all Executors as possible, this is a good strategy in most case. However, when we introduce the dynamic allocation and concurrent jobs, users may need the gang distribution for one single job, this means we allocate tasks for one job in few Executors, so we can release the unused resource. In current ways, Executors may not get released since the long-running stage will distributed on task in each Executors, even we have more cores in each Executor. We can offer a configuration for users, and this will not introduce much complexity. > Support Gang Distribution of Task > -- > > Key: SPARK-19823 > URL: https://issues.apache.org/jira/browse/SPARK-19823 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.0.2 >Reporter: DjvuLee > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19823) Support Gang Distribution of Task
[ https://issues.apache.org/jira/browse/SPARK-19823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896096#comment-15896096 ] DjvuLee commented on SPARK-19823: - When Spark distributes tasks to Executors, it uses a Round-Robin way, this means we distribute tasks to all Executors as possible, this is a good strategy in most case. However, when we introduce the dynamic allocation and concurrent jobs, users may need the gang distribution for one single job, this means we allocate tasks for one job in few Executors, so we can release the unused resource. In current ways, Executors may not get released since the long-running stage will distributed on task in each Executors, even we have more cores in each Executor. We can offer a configuration for users, and this will not introduce much complexity. > Support Gang Distribution of Task > -- > > Key: SPARK-19823 > URL: https://issues.apache.org/jira/browse/SPARK-19823 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.0.2 >Reporter: DjvuLee > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19823) Support Gang Distribution of Task
[ https://issues.apache.org/jira/browse/SPARK-19823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896097#comment-15896097 ] DjvuLee commented on SPARK-19823: - If this is a good advice, I will give a Pull Request. > Support Gang Distribution of Task > -- > > Key: SPARK-19823 > URL: https://issues.apache.org/jira/browse/SPARK-19823 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.0.2 >Reporter: DjvuLee > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19823) Support Gang Distribution of Task
DjvuLee created SPARK-19823: --- Summary: Support Gang Distribution of Task Key: SPARK-19823 URL: https://issues.apache.org/jira/browse/SPARK-19823 Project: Spark Issue Type: Improvement Components: Scheduler, Spark Core Affects Versions: 2.0.2 Reporter: DjvuLee -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19798) Query returns stale results when tables are modified on other sessions
[ https://issues.apache.org/jira/browse/SPARK-19798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19798: - Component/s: (was: Spark Core) SQL > Query returns stale results when tables are modified on other sessions > -- > > Key: SPARK-19798 > URL: https://issues.apache.org/jira/browse/SPARK-19798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Giambattista > > I observed the problem on master branch with thrift server in multisession > mode (default), but I was able to replicate also with spark-shell as well > (see below the sequence for replicating). > I observed cases where changes made in a session (table insert, table > renaming) are not visible to other derived sessions (created with > session.newSession). > The problem seems due to the fact that each session has its own > tableRelationCache and it does not get refreshed. > IMO tableRelationCache should be shared in sharedState, maybe in the > cacheManager so that refresh of caches for data that is not session-specific > such as temporary tables gets centralized. > --- Spark shell script > val spark2 = spark.newSession > spark.sql("CREATE TABLE test (a int) using parquet") > spark2.sql("select * from test").show // OK returns empty > spark.sql("select * from test").show // OK returns empty > spark.sql("insert into TABLE test values 1,2,3") > spark2.sql("select * from test").show // ERROR returns empty > spark.sql("select * from test").show // OK returns 3,2,1 > spark.sql("create table test2 (a int) using parquet") > spark.sql("insert into TABLE test2 values 4,5,6") > spark2.sql("select * from test2").show // OK returns 6,4,5 > spark.sql("select * from test2").show // OK returns 6,4,5 > spark.sql("alter table test rename to test3") > spark.sql("alter table test2 rename to test") > spark.sql("alter table test3 rename to test2") > spark2.sql("select * from test").show // ERROR returns empty > spark.sql("select * from test").show // OK returns 6,4,5 > spark2.sql("select * from test2").show // ERROR throws > java.io.FileNotFoundException > spark.sql("select * from test2").show // OK returns 3,1,2 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19821) Throw out the Read-only disk information when create file for Shuffle
[ https://issues.apache.org/jira/browse/SPARK-19821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19821: - Priority: Minor (was: Major) > Throw out the Read-only disk information when create file for Shuffle > - > > Key: SPARK-19821 > URL: https://issues.apache.org/jira/browse/SPARK-19821 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.0.2 >Reporter: DjvuLee >Priority: Minor > > java.io.FileNotFoundException: > /data01/yarn/nmdata/usercache/tiger/appcache/application_1486364177723_1047735/blockmgr-23098754-a97a-4673-ba73-3de5e167da87/2c/shuffle_55_47_0.index.0347f74b-a9c1-473e-b81f-40be394cc00f > (Input/output error) > at java.io.FileOutputStream.open0(Native Method) > at java.io.FileOutputStream.open(FileOutputStream.java:270) > at java.io.FileOutputStream.(FileOutputStream.java:213) > at java.io.FileOutputStream.(FileOutputStream.java:162) > at > org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:143) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:219) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:314) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19821) Throw out the Read-only disk information when create file for Shuffle
[ https://issues.apache.org/jira/browse/SPARK-19821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896079#comment-15896079 ] Shixiong Zhu commented on SPARK-19821: -- This is more like a Java issue. > Throw out the Read-only disk information when create file for Shuffle > - > > Key: SPARK-19821 > URL: https://issues.apache.org/jira/browse/SPARK-19821 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.0.2 >Reporter: DjvuLee >Priority: Minor > > java.io.FileNotFoundException: > /data01/yarn/nmdata/usercache/tiger/appcache/application_1486364177723_1047735/blockmgr-23098754-a97a-4673-ba73-3de5e167da87/2c/shuffle_55_47_0.index.0347f74b-a9c1-473e-b81f-40be394cc00f > (Input/output error) > at java.io.FileOutputStream.open0(Native Method) > at java.io.FileOutputStream.open(FileOutputStream.java:270) > at java.io.FileOutputStream.(FileOutputStream.java:213) > at java.io.FileOutputStream.(FileOutputStream.java:162) > at > org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:143) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:219) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:314) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19822) CheckpointSuite.testCheckpointedOperation: should not check checkpointFilesOfLatestTime by the PATH string.
[ https://issues.apache.org/jira/browse/SPARK-19822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19822: Assignee: Apache Spark > CheckpointSuite.testCheckpointedOperation: should not check > checkpointFilesOfLatestTime by the PATH string. > --- > > Key: SPARK-19822 > URL: https://issues.apache.org/jira/browse/SPARK-19822 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19822) CheckpointSuite.testCheckpointedOperation: should not check checkpointFilesOfLatestTime by the PATH string.
[ https://issues.apache.org/jira/browse/SPARK-19822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19822: Assignee: (was: Apache Spark) > CheckpointSuite.testCheckpointedOperation: should not check > checkpointFilesOfLatestTime by the PATH string. > --- > > Key: SPARK-19822 > URL: https://issues.apache.org/jira/browse/SPARK-19822 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19822) CheckpointSuite.testCheckpointedOperation: should not check checkpointFilesOfLatestTime by the PATH string.
[ https://issues.apache.org/jira/browse/SPARK-19822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896078#comment-15896078 ] Apache Spark commented on SPARK-19822: -- User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/17167 > CheckpointSuite.testCheckpointedOperation: should not check > checkpointFilesOfLatestTime by the PATH string. > --- > > Key: SPARK-19822 > URL: https://issues.apache.org/jira/browse/SPARK-19822 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19822) CheckpointSuite.testCheckpointedOperation: should not check checkpointFilesOfLatestTime by the PATH string.
Genmao Yu created SPARK-19822: - Summary: CheckpointSuite.testCheckpointedOperation: should not check checkpointFilesOfLatestTime by the PATH string. Key: SPARK-19822 URL: https://issues.apache.org/jira/browse/SPARK-19822 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.1.0, 2.0.2 Reporter: Genmao Yu Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896070#comment-15896070 ] DjvuLee commented on SPARK-18085: - [~vanzin] Thanks for your reply! Does your new solution will generate a separate jar file, I would like to try it if true. The history problem is a issue for our production environment now. > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19821) Throw out the Read-only disk information when create file for Shuffle
[ https://issues.apache.org/jira/browse/SPARK-19821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896067#comment-15896067 ] DjvuLee commented on SPARK-19821: - Currently, when the disk is just read-only, we will just throw out the FileNotFoundException, we can do better to give out the disk is read-only information, and maybe we can achieve better fault tolerance for single task. > Throw out the Read-only disk information when create file for Shuffle > - > > Key: SPARK-19821 > URL: https://issues.apache.org/jira/browse/SPARK-19821 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.0.2 >Reporter: DjvuLee > > java.io.FileNotFoundException: > /data01/yarn/nmdata/usercache/tiger/appcache/application_1486364177723_1047735/blockmgr-23098754-a97a-4673-ba73-3de5e167da87/2c/shuffle_55_47_0.index.0347f74b-a9c1-473e-b81f-40be394cc00f > (Input/output error) > at java.io.FileOutputStream.open0(Native Method) > at java.io.FileOutputStream.open(FileOutputStream.java:270) > at java.io.FileOutputStream.(FileOutputStream.java:213) > at java.io.FileOutputStream.(FileOutputStream.java:162) > at > org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:143) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:219) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:314) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19821) Throw out the Read-only disk information when create file for Shuffle
[ https://issues.apache.org/jira/browse/SPARK-19821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DjvuLee updated SPARK-19821: Description: java.io.FileNotFoundException: /data01/yarn/nmdata/usercache/tiger/appcache/application_1486364177723_1047735/blockmgr-23098754-a97a-4673-ba73-3de5e167da87/2c/shuffle_55_47_0.index.0347f74b-a9c1-473e-b81f-40be394cc00f (Input/output error) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.(FileOutputStream.java:213) at java.io.FileOutputStream.(FileOutputStream.java:162) at org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:143) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:219) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:314) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) > Throw out the Read-only disk information when create file for Shuffle > - > > Key: SPARK-19821 > URL: https://issues.apache.org/jira/browse/SPARK-19821 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.0.2 >Reporter: DjvuLee > > java.io.FileNotFoundException: > /data01/yarn/nmdata/usercache/tiger/appcache/application_1486364177723_1047735/blockmgr-23098754-a97a-4673-ba73-3de5e167da87/2c/shuffle_55_47_0.index.0347f74b-a9c1-473e-b81f-40be394cc00f > (Input/output error) > at java.io.FileOutputStream.open0(Native Method) > at java.io.FileOutputStream.open(FileOutputStream.java:270) > at java.io.FileOutputStream.(FileOutputStream.java:213) > at java.io.FileOutputStream.(FileOutputStream.java:162) > at > org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:143) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:219) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:314) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19821) Throw out the Read-only disk information when create file for Shuffle
DjvuLee created SPARK-19821: --- Summary: Throw out the Read-only disk information when create file for Shuffle Key: SPARK-19821 URL: https://issues.apache.org/jira/browse/SPARK-19821 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 2.0.2 Reporter: DjvuLee -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19145) Timestamp to String casting is slowing the query significantly
[ https://issues.apache.org/jira/browse/SPARK-19145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896061#comment-15896061 ] gagan taneja commented on SPARK-19145: -- I am suggesting following changes introduce the function to test perfactCast similar to private def perfectCast(expr: Literal, dataType: DataType): Boolean = { expr match { case Literal(value, StringType) => scala.util.Try { Cast(expr, dataType).eval(null) }.isSuccess case _ => false } } And string promotion based on condition if input string can be perfectly casted // We should cast all relative timestamp/date/string comparison into string comparisons // This behaves as a user would expect because timestamp strings sort lexicographically. // i.e. TimeStamp(2013-01-01 00:00 ...) < "2014" = true // For cases where its a exact cast we should cast String type to timestamp time // This would speed up the execution // i.e TimeStamp(2013-01-01 00:00T ...) < '2017-01-02 19:53:51' would translate to // TimeStamp(2013-01-01 00:00T ...) < Timestamp(2017-01-02 19:53:51) would translate to case p @ BinaryComparison(left @ Literal(_, StringType), right @ DateType()) if (acceptedDataTypes.contains( right) && perfectCast( left, right.dataType ) ) => p.makeCopy( Array( Cast( left, right.dataType), right )) case p @ BinaryComparison(left @ StringType(), right @ DateType()) if acceptedDataTypes.contains( right) => p.makeCopy(Array(left, Cast(right, StringType))) case p @ BinaryComparison(left @ DateType(), right @ Literal(_, StringType)) if (acceptedDataTypes.contains( left) && perfectCast( right, left.dataType )) => p.makeCopy( Array( left, Cast( right, left.dataType) )) case p @ BinaryComparison(left @ DateType(), right @ StringType()) if acceptedDataTypes.contains( left) => p.makeCopy(Array(Cast(left, StringType), right)) > Timestamp to String casting is slowing the query significantly > -- > > Key: SPARK-19145 > URL: https://issues.apache.org/jira/browse/SPARK-19145 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: gagan taneja > > i have a time series table with timestamp column > Following query > SELECT COUNT(*) AS `count` >FROM `default`.`table` >WHERE `time` >= '2017-01-02 19:53:51' > AND `time` <= '2017-01-09 19:53:51' LIMIT 5 > is significantly SLOWER than > SELECT COUNT(*) AS `count` > FROM `default`.`table` > WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD > HH24:MI:SS−0800') > AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD > HH24:MI:SS−0800') LIMIT 5 > After investigation i found that in the first query time colum is cast to > String before applying the filter > However in the second query no such casting is performed and its a filter > with long value > Below are the generate Physical plan for slower execution followed by > physical plan for faster execution > SELECT COUNT(*) AS `count` >FROM `default`.`table` >WHERE `time` >= '2017-01-02 19:53:51' > AND `time` <= '2017-01-09 19:53:51' LIMIT 5 > == Physical Plan == > CollectLimit 5 > +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L]) >+- Exchange SinglePartition > +- *HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#3339L]) > +- *Project > +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) > >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 > 19:53:51)) >+- *FileScan parquet default.cstat[time#3314] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], > PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: > struct > SELECT COUNT(*) AS `count` > FROM `default`.`table` > WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD > HH24:MI:SS−0800') > AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD > HH24:MI:SS−0800') LIMIT 5 > == Physical Plan == > CollectLimit 5 > +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L]) >+- Exchange SinglePartition > +- *HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#3287L]) > +- *Project > +- *Filter ((isnotnull(time#3262) && (time#3262 >= > 148340483100)) && (time#3262 <= 148400963100)) >+- *FileScan parquet default.cstat[time#3262] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[hdfs://10.65.55.220/user/
[jira] [Commented] (SPARK-19145) Timestamp to String casting is slowing the query significantly
[ https://issues.apache.org/jira/browse/SPARK-19145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896046#comment-15896046 ] gagan taneja commented on SPARK-19145: -- 17/03/04 19:05:32 TRACE HiveSessionState$$anon$1: === Applying Rule org.apache.spark.sql.catalyst.analysis.TypeCoercion$PromoteStrings === Before 'Project [*] !+- 'Filter (time#88 >= 2017-01-02 19:53:51) +- SubqueryAlias person_view, `rule_test`.`person_view` +- Project [gen_attr_0#83 AS name#86, gen_attr_1#84 AS age#87, gen_attr_2#85 AS time#88] +- SubqueryAlias person_table +- Project [gen_attr_0#83, gen_attr_1#84, gen_attr_2#85] +- Filter (gen_attr_1#84 > 10) +- SubqueryAlias gen_subquery_0 +- Project [name#89 AS gen_attr_0#83, age#90 AS gen_attr_1#84, time#91 AS gen_attr_2#85] +- MetastoreRelation rule_test, person_table After 'Project [*] +- Filter (cast(time#88 as string) >= 2017-01-02 19:53:51) +- SubqueryAlias person_view, `rule_test`.`person_view` +- Project [gen_attr_0#83 AS name#86, gen_attr_1#84 AS age#87, gen_attr_2#85 AS time#88] +- SubqueryAlias person_table +- Project [gen_attr_0#83, gen_attr_1#84, gen_attr_2#85] +- Filter (gen_attr_1#84 > 10) +- SubqueryAlias gen_subquery +- Project [name#89 AS gen_attr_0#83, age#90 AS gen_attr_1#84, time#91 AS gen_attr_2#85] +- MetastoreRelation rule_test, person_table > Timestamp to String casting is slowing the query significantly > -- > > Key: SPARK-19145 > URL: https://issues.apache.org/jira/browse/SPARK-19145 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: gagan taneja > > i have a time series table with timestamp column > Following query > SELECT COUNT(*) AS `count` >FROM `default`.`table` >WHERE `time` >= '2017-01-02 19:53:51' > AND `time` <= '2017-01-09 19:53:51' LIMIT 5 > is significantly SLOWER than > SELECT COUNT(*) AS `count` > FROM `default`.`table` > WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD > HH24:MI:SS−0800') > AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD > HH24:MI:SS−0800') LIMIT 5 > After investigation i found that in the first query time colum is cast to > String before applying the filter > However in the second query no such casting is performed and its a filter > with long value > Below are the generate Physical plan for slower execution followed by > physical plan for faster execution > SELECT COUNT(*) AS `count` >FROM `default`.`table` >WHERE `time` >= '2017-01-02 19:53:51' > AND `time` <= '2017-01-09 19:53:51' LIMIT 5 > == Physical Plan == > CollectLimit 5 > +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L]) >+- Exchange SinglePartition > +- *HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#3339L]) > +- *Project > +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) > >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 > 19:53:51)) >+- *FileScan parquet default.cstat[time#3314] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], > PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: > struct > SELECT COUNT(*) AS `count` > FROM `default`.`table` > WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD > HH24:MI:SS−0800') > AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD > HH24:MI:SS−0800') LIMIT 5 > == Physical Plan == > CollectLimit 5 > +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L]) >+- Exchange SinglePartition > +- *HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#3287L]) > +- *Project > +- *Filt
[jira] [Commented] (SPARK-19145) Timestamp to String casting is slowing the query significantly
[ https://issues.apache.org/jira/browse/SPARK-19145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896040#comment-15896040 ] gagan taneja commented on SPARK-19145: -- Code responsible for this belongs to class amd the code below is responsible to type coercion. Although this is logically correct its also slowing down binary comparison because during execution Interval types are casted to String and the comparision is done on string operator resulting in 10x slower performance org.apache.spark.sql.catalyst.analysis.TypeCoercion { . // We should cast all relative timestamp/date/string comparison into string comparisons // This behaves as a user would expect because timestamp strings sort lexicographically. // i.e. TimeStamp(2013-01-01 00:00 ...) < "2014" = true case p @ BinaryComparison(left @ StringType(), right @ DateType()) => p.makeCopy(Array(left, Cast(right, StringType))) case p @ BinaryComparison(left @ DateType(), right @ StringType()) => p.makeCopy(Array(Cast(left, StringType), right)) case p @ BinaryComparison(left @ StringType(), right @ TimestampType()) => p.makeCopy(Array(left, Cast(right, StringType))) case p @ BinaryComparison(left @ TimestampType(), right @ StringType()) => p.makeCopy(Array(Cast(left, StringType), right)) // Comparisons between dates and timestamps. case p @ BinaryComparison(left @ TimestampType(), right @ DateType()) => p.makeCopy(Array(Cast(left, StringType), Cast(right, StringType))) case p @ BinaryComparison(left @ DateType(), right @ TimestampType()) => p.makeCopy(Array(Cast(left, StringType), Cast(right, StringType))) > Timestamp to String casting is slowing the query significantly > -- > > Key: SPARK-19145 > URL: https://issues.apache.org/jira/browse/SPARK-19145 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: gagan taneja > > i have a time series table with timestamp column > Following query > SELECT COUNT(*) AS `count` >FROM `default`.`table` >WHERE `time` >= '2017-01-02 19:53:51' > AND `time` <= '2017-01-09 19:53:51' LIMIT 5 > is significantly SLOWER than > SELECT COUNT(*) AS `count` > FROM `default`.`table` > WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD > HH24:MI:SS−0800') > AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD > HH24:MI:SS−0800') LIMIT 5 > After investigation i found that in the first query time colum is cast to > String before applying the filter > However in the second query no such casting is performed and its a filter > with long value > Below are the generate Physical plan for slower execution followed by > physical plan for faster execution > SELECT COUNT(*) AS `count` >FROM `default`.`table` >WHERE `time` >= '2017-01-02 19:53:51' > AND `time` <= '2017-01-09 19:53:51' LIMIT 5 > == Physical Plan == > CollectLimit 5 > +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L]) >+- Exchange SinglePartition > +- *HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#3339L]) > +- *Project > +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) > >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 > 19:53:51)) >+- *FileScan parquet default.cstat[time#3314] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], > PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: > struct > SELECT COUNT(*) AS `count` > FROM `default`.`table` > WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD > HH24:MI:SS−0800') > AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD > HH24:MI:SS−0800') LIMIT 5 > == Physical Plan == > CollectLimit 5 > +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L]) >+- Exchange SinglePartition > +- *HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#3287L]) > +- *Project > +- *Filter ((isnotnull(time#3262) && (time#3262 >= > 148340483100)) && (time#3262 <= 148400963100)) >+- *FileScan parquet default.cstat[time#3262] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], > PartitionFilters: [], PushedFilters: [IsNotNull(time), > GreaterThanOrEqual(time,2017-01-02 19:53:51.0), > LessThanOrEqual(time,2017-01-09..., ReadSchema: struct > In Impala both query run efficiently without and performance difference > Spark should be able to parse the Date string and conv
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896033#comment-15896033 ] Daniel Li commented on SPARK-6407: -- Appreciate the quick reply, [~srowen]. Yeah, we'd be recomputing them, but not from scratch since we'd be starting with optimized _U_ and _V_. It would likely take only one or two iterations before reconvergence. Would this still be considered too expensive? The thing I hesitate about regarding fold-in updating is that the assumption that only the corresponding user row and item row will change may be too simplifying (since, of course, there's a "rippling out" effect—all items the user rated previous need to be updated, then all users that rated any of those items would need updating, etc.). Then again, even if we take this rippling into account the computation may not be too expensive, since a single update likely won't affect the RMSE enough to delay convergence. (Though I haven't worked out the math showing this; it's just a hunch.) Do you have any insights into this? > Streaming ALS for Collaborative Filtering > - > > Key: SPARK-6407 > URL: https://issues.apache.org/jira/browse/SPARK-6407 > Project: Spark > Issue Type: New Feature > Components: DStreams >Reporter: Felix Cheung >Priority: Minor > > Like MLLib's ALS implementation for recommendation, and applying to streaming. > Similar to streaming linear regression, logistic regression, could we apply > gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19541) High Availability support for ThriftServer
[ https://issues.apache.org/jira/browse/SPARK-19541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896032#comment-15896032 ] gagan taneja commented on SPARK-19541: -- This would a great improvement as we are also leveraging Thriftserver for JDBC/ODBC connectivity > High Availability support for ThriftServer > -- > > Key: SPARK-19541 > URL: https://issues.apache.org/jira/browse/SPARK-19541 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: LvDongrong > > Currently, We use the spark ThriftServer frequently, and there are many > connects between the client and only ThriftServer.When the ThriftServer is > down ,we cannot get the service again until the server is restarted .So we > need to consider the ThriftServer HA as well as HiveServer HA. For > ThriftServer, we want to import the pattern of HiveServer HA to provide > ThriftServer HA. Therefore, we need to start multiple thrift server which > register on the zookeeper. Then the client can find the thrift server by > just connecting to the zookeeper.So the beeline can get the service from > other thrift server when one is down. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19705) Preferred location supporting HDFS Cache for FileScanRDD
[ https://issues.apache.org/jira/browse/SPARK-19705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gagan taneja updated SPARK-19705: - Shepherd: Herman van Hovell > Preferred location supporting HDFS Cache for FileScanRDD > > > Key: SPARK-19705 > URL: https://issues.apache.org/jira/browse/SPARK-19705 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: gagan taneja > > Although NewHadoopRDD and HadoopRdd considers HDFS cache while calculating > preferredLocations, FileScanRDD do not take into account HDFS cache while > calculating preferredLocations > The enhancement can be easily implemented for large files where FilePartition > only contains single HDFS file > The enhancement will also result in significant performance improvement for > cached hdfs partitions -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19705) Preferred location supporting HDFS Cache for FileScanRDD
[ https://issues.apache.org/jira/browse/SPARK-19705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896026#comment-15896026 ] gagan taneja commented on SPARK-19705: -- Herman, Can you help me with this enhancement thanks > Preferred location supporting HDFS Cache for FileScanRDD > > > Key: SPARK-19705 > URL: https://issues.apache.org/jira/browse/SPARK-19705 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: gagan taneja > > Although NewHadoopRDD and HadoopRdd considers HDFS cache while calculating > preferredLocations, FileScanRDD do not take into account HDFS cache while > calculating preferredLocations > The enhancement can be easily implemented for large files where FilePartition > only contains single HDFS file > The enhancement will also result in significant performance improvement for > cached hdfs partitions -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895998#comment-15895998 ] Eric Maynard edited comment on SPARK-19656 at 3/5/17 12:58 AM: --- Normally after getting the `datum` you should call `asInstanceOf` to cast it properly. In any event in Spark 2.0 the easier way to achieve what you want is probably something like this: {code:java} import com.databricks.spark.avro._ val df = spark.read.avro("file.avro") val extracted = df.map(row => (row(0).asInstanceOf[MyCustomClass])) {code} was (Author: emaynard): Normally after getting the `datum` you should call `asInstanceOf` to cast it properly. In any event in Spark 2.0 the easier way to achieve what you want is probably something like this: {code:scala} import com.databricks.spark.avro._ val df = spark.read.avro("file.avro") val extracted = df.map(row => (row(0).asInstanceOf[MyCustomClass])) {code} > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat{ > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895998#comment-15895998 ] Eric Maynard commented on SPARK-19656: -- Normally after getting the `datum` you should call `asInstanceOf` to cast it properly. In any event in Spark 2.0 the easier way to achieve what you want is probably something like this: {code:scala} import com.databricks.spark.avro._ val df = spark.read.avro("file.avro") val extracted = df.map(row => (row(0).asInstanceOf[MyCustomClass])) {code} > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat{ > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19713) saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-19713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895991#comment-15895991 ] Eric Maynard commented on SPARK-19713: -- In general instead of using `DataFrameWriter.saveAsTable`, I find it's better to create the table in advance and then insert data into it `DataFrameWriter.insertInto`. If you use choose to use `DataFrameWriter.saveAsTable` there is a chance of the folder being created and the Hive table not being updated, but as a developer you can handle these errors with `HiveContext.refreshTable` or by using `FileSystem.delete`. I think this is not an issue. > saveAsTable > --- > > Key: SPARK-19713 > URL: https://issues.apache.org/jira/browse/SPARK-19713 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Balaram R Gadiraju > > Hi, > I just observed that when we use dataframe.saveAsTable("table") -- In > oldversions > and dataframe.write.saveAsTable("table") -- in the newer versions > When using the method “df3.saveAsTable("brokentable")” in > scale code. This creates a folder in hdfs and doesn’t update hive-metastore > that it plans to create the table. So if anything goes wrong in between the > folder still exists and hive is not aware of the folder creation. This will > block the users from creating the table “brokentable” as the folder already > exists, we can remove the folder using “hadoop fs –rmr > /data/hive/databases/testdb.db/brokentable”. So below is the workaround > which will enable to you to continue the development work. > Current Code: > val df3 = sqlContext.sql("select * fromtesttable") > df3.saveAsTable("brokentable") > THE WORKAROUND: > By registering the DataFrame as table and then using sql command to load the > data will resolve the issue. EX: > val df3 = sqlContext.sql("select * from testtable").registerTempTable("df3") > sqlContext.sql("CREATE TABLE brokentable AS SELECT * FROM df3") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19820) Allow reason to be specified for task kill
[ https://issues.apache.org/jira/browse/SPARK-19820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19820: Assignee: (was: Apache Spark) > Allow reason to be specified for task kill > -- > > Key: SPARK-19820 > URL: https://issues.apache.org/jira/browse/SPARK-19820 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19820) Allow reason to be specified for task kill
[ https://issues.apache.org/jira/browse/SPARK-19820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895954#comment-15895954 ] Apache Spark commented on SPARK-19820: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/17166 > Allow reason to be specified for task kill > -- > > Key: SPARK-19820 > URL: https://issues.apache.org/jira/browse/SPARK-19820 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19820) Allow reason to be specified for task kill
[ https://issues.apache.org/jira/browse/SPARK-19820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19820: Assignee: Apache Spark > Allow reason to be specified for task kill > -- > > Key: SPARK-19820 > URL: https://issues.apache.org/jira/browse/SPARK-19820 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Eric Liang >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19820) Allow reason to be specified for task kill
Eric Liang created SPARK-19820: -- Summary: Allow reason to be specified for task kill Key: SPARK-19820 URL: https://issues.apache.org/jira/browse/SPARK-19820 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.0 Reporter: Eric Liang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8556) Beeline script throws ClassNotFoundException
[ https://issues.apache.org/jira/browse/SPARK-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895898#comment-15895898 ] Arvind Surve commented on SPARK-8556: - Hi Cheng, Would you mind sharing configuration issue you had? -Arvind > Beeline script throws ClassNotFoundException > > > Key: SPARK-8556 > URL: https://issues.apache.org/jira/browse/SPARK-8556 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Priority: Blocker > > 1.5.0-SNAPSHOT, commit 1dfb0f7b2aed5ee6d07543fdeac8ff7c777b63b9 > Build Spark with: > {noformat} > $ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 > {noformat} > Start HiveThriftServer2 with: > {noformat} > $ ./sbin/start-thriftserver.sh > {noformat} > Run Beeline and quit immediately: > {noformat} > $ ./bin/beeline -u jdbc:hive2://localhost:1 > Connecting to jdbc:hive2://localhost:1 > org/apache/hive/service/cli/thrift/TCLIService$Iface > Beeline version 1.5.0-SNAPSHOT by Apache Hive > 0: jdbc:hive2://localhost:1> Exception in thread "main" > java.lang.NoClassDefFoundError: > org/apache/hive/service/cli/thrift/TCLIService$Iface > at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) > at java.sql.DriverManager.getConnection(DriverManager.java:664) > at java.sql.DriverManager.getConnection(DriverManager.java:208) > at > org.apache.hive.beeline.DatabaseConnection.connect(DatabaseConnection.java:145) > at > org.apache.hive.beeline.DatabaseConnection.getConnection(DatabaseConnection.java:186) > at org.apache.hive.beeline.Commands.close(Commands.java:802) > at org.apache.hive.beeline.Commands.closeall(Commands.java:784) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:673) > at > org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:368) > at org.apache.hive.beeline.BeeLine.main(BeeLine.java:351) > Caused by: java.lang.ClassNotFoundException: > org.apache.hive.service.cli.thrift.TCLIService$Iface > at java.net.URLClassLoader$1.run(URLClassLoader.java:372) > at java.net.URLClassLoader$1.run(URLClassLoader.java:361) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:360) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 10 more > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16844) Generate code for sort based aggregation
[ https://issues.apache.org/jira/browse/SPARK-16844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895762#comment-15895762 ] Apache Spark commented on SPARK-16844: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/17164 > Generate code for sort based aggregation > > > Key: SPARK-16844 > URL: https://issues.apache.org/jira/browse/SPARK-16844 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: yucai > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16617) Upgrade to Avro 1.8.x
[ https://issues.apache.org/jira/browse/SPARK-16617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895756#comment-15895756 ] Apache Spark commented on SPARK-16617: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/17163 > Upgrade to Avro 1.8.x > - > > Key: SPARK-16617 > URL: https://issues.apache.org/jira/browse/SPARK-16617 > Project: Spark > Issue Type: Improvement >Reporter: Ben McCann > > Avro 1.8 makes Avro objects serializable so that you can easily have an RDD > containing Avro objects. > See https://issues.apache.org/jira/browse/AVRO-1502 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19503) Execution Plan Optimizer: avoid sort or shuffle when it does not change end result such as df.sort(...).count()
[ https://issues.apache.org/jira/browse/SPARK-19503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895740#comment-15895740 ] Kazuaki Ishizaki commented on SPARK-19503: -- Is it better to control whether we prune local and global sorts by using an option (property) in SQLConf? > Execution Plan Optimizer: avoid sort or shuffle when it does not change end > result such as df.sort(...).count() > --- > > Key: SPARK-19503 > URL: https://issues.apache.org/jira/browse/SPARK-19503 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0 > Environment: Perhaps only a pyspark or databricks AWS issue >Reporter: R >Priority: Minor > Labels: execution, optimizer, plan, query > > df.sort(...).count() > performs shuffle and sort and then count! This is wasteful as sort is not > required here and makes me wonder how smart the algebraic optimiser is > indeed! The data may be partitioned by known count (such as parquet files) > and we should not shuffle to just perform count. > This may look trivial, but if optimiser fails to recognise this, I wonder > what else is it missing especially in more complex operations. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19550) Remove reflection, docs, build elements related to Java 7
[ https://issues.apache.org/jira/browse/SPARK-19550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895677#comment-15895677 ] Apache Spark commented on SPARK-19550: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/17162 > Remove reflection, docs, build elements related to Java 7 > - > > Key: SPARK-19550 > URL: https://issues.apache.org/jira/browse/SPARK-19550 > Project: Spark > Issue Type: Sub-task > Components: Build, Documentation, Spark Core >Affects Versions: 2.2.0 >Reporter: Sean Owen >Assignee: Sean Owen > Fix For: 2.2.0 > > > - Move external/java8-tests tests into core, streaming, sql and remove > - Remove MaxPermGen and related options > - Fix some reflection / TODOs around Java 8+ methods > - Update doc references to 1.7/1.8 differences > - Remove Java 7/8 related build profiles > - Update some plugins for better Java 8 compatibility > - Fix a few Java-related warnings -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?
[ https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895629#comment-15895629 ] Nick Pentreath commented on SPARK-7146: --- Personally I support developer API - these are going to be used by developers of extensions & custom pipeline components and I think we can make best effort not to break things but don't need to absolutely guarantee that. > Should ML sharedParams be a public API? > --- > > Key: SPARK-7146 > URL: https://issues.apache.org/jira/browse/SPARK-7146 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Joseph K. Bradley > > Proposal: Make most of the Param traits in sharedParams.scala public. Mark > them as DeveloperApi. > Pros: > * Sharing the Param traits helps to encourage standardized Param names and > documentation. > Cons: > * Users have to be careful since parameters can have different meanings for > different algorithms. > * If the shared Params are public, then implementations could test for the > traits. It is unclear if we want users to rely on these traits, which are > somewhat experimental. > Currently, the shared params are private. > h3. UPDATED proposal > * Some Params are clearly safe to make public. We will do so. > * Some Params could be made public but may require caveats in the trait doc. > * Some Params have turned out not to be shared in practice. We can move > those Params to the classes which use them. > *Public shared params*: > * I/O column params > ** HasFeaturesCol > ** HasInputCol > ** HasInputCols > ** HasLabelCol > ** HasOutputCol > ** HasPredictionCol > ** HasProbabilityCol > ** HasRawPredictionCol > ** HasVarianceCol > ** HasWeightCol > * Algorithm settings > ** HasCheckpointInterval > ** HasElasticNetParam > ** HasFitIntercept > ** HasMaxIter > ** HasRegParam > ** HasSeed > ** HasStandardization (less common) > ** HasStepSize > ** HasTol > *Questionable params*: > * HasHandleInvalid (only used in StringIndexer, but might be more widely used > later on) > * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but > same meaning as Optimizer in LDA) > *Params to be removed from sharedParams*: > * HasThreshold (only used in LogisticRegression) > * HasThresholds (only used in ProbabilisticClassifier) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895619#comment-15895619 ] Danilo Ascione commented on SPARK-14409: I can help with both PR. Please consider that the solution in [PR 16618|https://github.com/apache/spark/pull/16618] is a Dataframe api based version of that in [PR 12461|https://github.com/apache/spark/pull/12461]. Any way, I'd like to help review an alternative solution. Thanks! > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14273) Add FileFormat.isSplittable to indicate whether a format is splittable
[ https://issues.apache.org/jira/browse/SPARK-14273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-14273. --- Resolution: Duplicate > Add FileFormat.isSplittable to indicate whether a format is splittable > -- > > Key: SPARK-14273 > URL: https://issues.apache.org/jira/browse/SPARK-14273 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > {{FileSourceStrategy}} assumes that all data source formats are splittable > and always splits data files by fixed partition size. However, not all HDSF > based formats are splittable. We need a flag to indicate that and ensure that > non-splittable files won't be split into multiple Spark partitions. > (PS: Is it "splitable" or "splittable"? Probably the latter one? Hadoop uses > the former one though...) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895605#comment-15895605 ] Sean Owen commented on SPARK-6407: -- How is it different from recomputing all of U and V? Doing anything to all of the matrices is probably out of the question for an online update. The point of fold-in is to update only the two affected rows and make the simplifying assumption that nothing else changes, because it would be too expensive to recompute anything. If you mean batch together enough to make it worthwhile to update, then yes at some point that's worth it, but it just reduces to re-running the batch algorithm for a few iterations again. > Streaming ALS for Collaborative Filtering > - > > Key: SPARK-6407 > URL: https://issues.apache.org/jira/browse/SPARK-6407 > Project: Spark > Issue Type: New Feature > Components: DStreams >Reporter: Felix Cheung >Priority: Minor > > Like MLLib's ALS implementation for recommendation, and applying to streaming. > Similar to streaming linear regression, logistic regression, could we apply > gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19819) Use concrete data in SparkR DataFrame examples
[ https://issues.apache.org/jira/browse/SPARK-19819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19819: Assignee: Apache Spark > Use concrete data in SparkR DataFrame examples > --- > > Key: SPARK-19819 > URL: https://issues.apache.org/jira/browse/SPARK-19819 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Wayne Zhang >Assignee: Apache Spark >Priority: Minor > > Many examples in SparkDataFrame methods uses: > {code} > path <- "path/to/file.json" > df <- read.json(path) > {code} > This is not directly runnable. Replace this with real numerical examples so > that users can directly execute the examples. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19819) Use concrete data in SparkR DataFrame examples
[ https://issues.apache.org/jira/browse/SPARK-19819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19819: Assignee: (was: Apache Spark) > Use concrete data in SparkR DataFrame examples > --- > > Key: SPARK-19819 > URL: https://issues.apache.org/jira/browse/SPARK-19819 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Wayne Zhang >Priority: Minor > > Many examples in SparkDataFrame methods uses: > {code} > path <- "path/to/file.json" > df <- read.json(path) > {code} > This is not directly runnable. Replace this with real numerical examples so > that users can directly execute the examples. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19819) Use concrete data in SparkR DataFrame examples
[ https://issues.apache.org/jira/browse/SPARK-19819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895599#comment-15895599 ] Apache Spark commented on SPARK-19819: -- User 'actuaryzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/17161 > Use concrete data in SparkR DataFrame examples > --- > > Key: SPARK-19819 > URL: https://issues.apache.org/jira/browse/SPARK-19819 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Wayne Zhang >Priority: Minor > > Many examples in SparkDataFrame methods uses: > {code} > path <- "path/to/file.json" > df <- read.json(path) > {code} > This is not directly runnable. Replace this with real numerical examples so > that users can directly execute the examples. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19819) Use concrete data in SparkR DataFrame examples
Wayne Zhang created SPARK-19819: --- Summary: Use concrete data in SparkR DataFrame examples Key: SPARK-19819 URL: https://issues.apache.org/jira/browse/SPARK-19819 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 2.1.0 Reporter: Wayne Zhang Priority: Minor Many examples in SparkDataFrame methods uses: {code} path <- "path/to/file.json" df <- read.json(path) {code} This is not directly runnable. Replace this with real numerical examples so that users can directly execute the examples. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895589#comment-15895589 ] Daniel Li commented on SPARK-6407: -- Reviving this thread since I'm interested in implementing streaming CF for Spark. bq. Using ALS for online updates is expensive. Recomputing the factor matrices _U_ and _V_ from scratch for every update would be terribly expensive, but what about keeping _U_ and _V_ around and simply recomputing another round or two after each new rating that comes in? The algorithm would simply be continually following a moving optimum. I can't imagine the RMSE changing much due to small updates if we use a convergence threshold _à la_ [Y. Zhou, et al., “Large-Scale Parallel Collaborative Filtering for the Netflix Prize”|http://dl.acm.org/citation.cfm?id=1424269] instead of a fixed number of iterations. (In fact, since calculating _(U^T) * V_ would probably take a nontrivial slice of time, new updates that come in during a round of calculation could be "batched" into the next round of calculation, increasing efficiency.) Thoughts? > Streaming ALS for Collaborative Filtering > - > > Key: SPARK-6407 > URL: https://issues.apache.org/jira/browse/SPARK-6407 > Project: Spark > Issue Type: New Feature > Components: DStreams >Reporter: Felix Cheung >Priority: Minor > > Like MLLib's ALS implementation for recommendation, and applying to streaming. > Similar to streaming linear regression, logistic regression, could we apply > gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19792) In the Master Page,the column named “Memory per Node” ,I think it is not all right
[ https://issues.apache.org/jira/browse/SPARK-19792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895576#comment-15895576 ] liuxian commented on SPARK-19792: - I think it refers to the memory allocated to each executor by the worker. >From the code, the value of this column is >"{Utils.megabytesToString(app.desc.memoryPerExecutorMB)}" > In the Master Page,the column named “Memory per Node” ,I think it is not all > right > --- > > Key: SPARK-19792 > URL: https://issues.apache.org/jira/browse/SPARK-19792 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: liuxian >Priority: Trivial > > Open the spark web page,in the Master Page ,have two tables:Running > Applications table and Completed Applications table, to the column named > “Memory per Node” ,I think it is not all right ,because a node may be not > have only one executor.So I think that should be named as “Memory per > Executor”.Otherwise easy to let the user misunderstanding -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org