[jira] [Updated] (SPARK-12309) Use sqlContext from MLlibTestSparkContext for spark.ml test suites
[ https://issues.apache.org/jira/browse/SPARK-12309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-12309: Description: Use sqlContext from MLlibTestSparkContext rather than creating new one for spark.ml test cases. (was: Use sqlContext from MLlibTestSparkContext rather than creating new one for each spark.ml test cases.) > Use sqlContext from MLlibTestSparkContext for spark.ml test suites > -- > > Key: SPARK-12309 > URL: https://issues.apache.org/jira/browse/SPARK-12309 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang > > Use sqlContext from MLlibTestSparkContext rather than creating new one for > spark.ml test cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12309) Use sqlContext from MLlibTestSparkContext for spark.ml test suites
[ https://issues.apache.org/jira/browse/SPARK-12309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12309: Assignee: (was: Apache Spark) > Use sqlContext from MLlibTestSparkContext for spark.ml test suites > -- > > Key: SPARK-12309 > URL: https://issues.apache.org/jira/browse/SPARK-12309 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang > > Use sqlContext from MLlibTestSparkContext rather than creating new one for > each spark.ml test cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12309) Use sqlContext from MLlibTestSparkContext for spark.ml test suites
[ https://issues.apache.org/jira/browse/SPARK-12309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12309: Assignee: Apache Spark > Use sqlContext from MLlibTestSparkContext for spark.ml test suites > -- > > Key: SPARK-12309 > URL: https://issues.apache.org/jira/browse/SPARK-12309 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Apache Spark > > Use sqlContext from MLlibTestSparkContext rather than creating new one for > each spark.ml test cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12309) Use sqlContext from MLlibTestSparkContext for spark.ml test suites
[ https://issues.apache.org/jira/browse/SPARK-12309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-12309: Description: Use sqlContext from MLlibTestSparkContext rather than creating new one for each spark.ml test cases. (was: Use sqlContext from MLlibTestSparkContext for spark.ml test suites) > Use sqlContext from MLlibTestSparkContext for spark.ml test suites > -- > > Key: SPARK-12309 > URL: https://issues.apache.org/jira/browse/SPARK-12309 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang > > Use sqlContext from MLlibTestSparkContext rather than creating new one for > each spark.ml test cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12309) Use sqlContext from MLlibTestSparkContext for spark.ml test suites
Yanbo Liang created SPARK-12309: --- Summary: Use sqlContext from MLlibTestSparkContext for spark.ml test suites Key: SPARK-12309 URL: https://issues.apache.org/jira/browse/SPARK-12309 Project: Spark Issue Type: Improvement Components: ML Reporter: Yanbo Liang Use sqlContext from MLlibTestSparkContext for spark.ml test suites -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12062) Master rebuilding historical SparkUI should be asynchronous
[ https://issues.apache.org/jira/browse/SPARK-12062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054829#comment-15054829 ] Andrew Or commented on SPARK-12062: --- I see, if you already have a patch then this is worth fixing. However I don't think we should introduce yet another configuration. It's best if the serving is asynchronous before we remove it completely. > Master rebuilding historical SparkUI should be asynchronous > --- > > Key: SPARK-12062 > URL: https://issues.apache.org/jira/browse/SPARK-12062 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Andrew Or >Assignee: Bryan Cutler > > When a long-running application finishes, it takes a while (sometimes > minutes) to rebuild the SparkUI. However, in Master.scala this is currently > done within the RPC event loop, which runs only in 1 thread. Thus, in the > mean time no other applications can register with this master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12308) Better onDisconnected logic for Master
Shixiong Zhu created SPARK-12308: Summary: Better onDisconnected logic for Master Key: SPARK-12308 URL: https://issues.apache.org/jira/browse/SPARK-12308 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu Pasting [~vanzin]'s comment from GitHub(https://github.com/apache/spark/pull/10261#issuecomment-164107825): bq. That's because we took opposite approaches. Your PR's approach is "if this connection says the sender address is a listening socket, then also consider that address when sending events about the remote process". bq. My PR takes the opposite approach: the address of the remote process is always the address of the socket used to connect, regardless of whether its also listening in another socket. bq. I think my approach is in the end more correct, but requires more code to fix existing code. In my view, RpcCallContext.senderAddress should be the address of the socket that sent the message, not the address the remote process is listening on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12267) Standalone master keeps references to disassociated workers until they sent no heartbeats
[ https://issues.apache.org/jira/browse/SPARK-12267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-12267. -- Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 2.0.0 1.6.1 Set 1.6.1 as one of fix versions. If there is RC3, should change it to 1.6.0. > Standalone master keeps references to disassociated workers until they sent > no heartbeats > - > > Key: SPARK-12267 > URL: https://issues.apache.org/jira/browse/SPARK-12267 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Jacek Laskowski >Assignee: Shixiong Zhu > Fix For: 1.6.1, 2.0.0 > > > While toying with Spark Standalone I've noticed the following messages > in the logs of the master: > {code} > INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB RAM > INFO Master: localhost:59920 got disassociated, removing it. > ... > WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because > we got no heartbeat in 60 seconds > INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919 > on 192.168.1.6:59919 > {code} > Why does the message "WARN Master: Removing > worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in > 60 seconds" appear when the worker should've been removed already (as > pointed out in "INFO Master: localhost:59920 got disassociated, > removing it.")? > Could it be that the ids are different - 192.168.1.6:59919 vs localhost:59920? > I started master using {{./sbin/start-master.sh -h localhost}} and the > workers {{./sbin/start-slave.sh spark://localhost:7077}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12199) Follow-up: Refine example code in ml-features.md
[ https://issues.apache.org/jira/browse/SPARK-12199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-12199. --- Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 10193 [https://github.com/apache/spark/pull/10193] > Follow-up: Refine example code in ml-features.md > > > Key: SPARK-12199 > URL: https://issues.apache.org/jira/browse/SPARK-12199 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xusen Yin > Labels: starter > Fix For: 2.0.0, 1.6.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12307) ParquetFormat options should be exposed through the DataFrameReader/Writer options API
holdenk created SPARK-12307: --- Summary: ParquetFormat options should be exposed through the DataFrameReader/Writer options API Key: SPARK-12307 URL: https://issues.apache.org/jira/browse/SPARK-12307 Project: Spark Issue Type: Improvement Components: SQL Reporter: holdenk Priority: Trivial Currently many options for loading/saving Parquet need to be set globally on the SparkContext. It would be useful to also provide support for setting these options through the DataFrameReader/DataFrameWriter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7425) spark.ml Predictor should support other numeric types for label
[ https://issues.apache.org/jira/browse/SPARK-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054664#comment-15054664 ] Benjamin Fradet commented on SPARK-7425: Is there anyone working on this? Because I'm considering taking over this jira. I started writing some unit tests for a few predictors and I'm wondering if I should write unit tests for all the predictors? Input welcome. > spark.ml Predictor should support other numeric types for label > --- > > Key: SPARK-7425 > URL: https://issues.apache.org/jira/browse/SPARK-7425 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > Labels: starter > > Currently, the Predictor abstraction expects the input labelCol type to be > DoubleType, but we should support other numeric types. This will involve > updating the PredictorParams.validateAndTransformSchema method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9686) Spark Thrift server doesn't return correct JDBC metadata
[ https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-9686: Target Version/s: (was: 1.6.0) > Spark Thrift server doesn't return correct JDBC metadata > - > > Key: SPARK-9686 > URL: https://issues.apache.org/jira/browse/SPARK-9686 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2 >Reporter: pin_zhang >Assignee: Cheng Lian >Priority: Critical > Attachments: SPARK-9686.1.patch.txt > > > 1. Start start-thriftserver.sh > 2. connect with beeline > 3. create table > 4.show tables, the new created table returned > 5. > Class.forName("org.apache.hive.jdbc.HiveDriver"); > String URL = "jdbc:hive2://localhost:1/default"; >Properties info = new Properties(); > Connection conn = DriverManager.getConnection(URL, info); > ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(), >null, null, null); > Problem: >No tables with returned this API, that work in spark1.3 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11785) When deployed against remote Hive metastore with lower versions, JDBC metadata calls throws exception
[ https://issues.apache.org/jira/browse/SPARK-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-11785: - Target Version/s: (was: 1.6.0) > When deployed against remote Hive metastore with lower versions, JDBC > metadata calls throws exception > - > > Key: SPARK-11785 > URL: https://issues.apache.org/jira/browse/SPARK-11785 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1, 1.6.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Critical > > To reproduce this issue with 1.7-SNAPSHOT > # Start Hive 0.13.1 metastore service using {{$HIVE_HOME/bin/hive --service > metastore}} > # Configures remote Hive metastore in {{conf/hive-site.xml}} by pointing > {{hive.metastore.uris}} to metastore endpoint (e.g. > {{thrift://localhost:9083}}) > # Set {{spark.sql.hive.metastore.version}} to {{0.13.1}} and > {{spark.sql.hive.metastore.jars}} to {{maven}} in {{conf/spark-defaults.conf}} > # Start Thrift server using {{$SPARK_HOME/sbin/start-thriftserver.sh}} > # Run the testing JDBC client program attached at the end > Exception thrown from client side: > {noformat} > java.sql.SQLException: Could not create ResultSet: Required field > 'operationHandle' is unset! > Struct:TGetResultSetMetadataReq(operationHandle:null) > java.sql.SQLException: Could not create ResultSet: Required field > 'operationHandle' is unset! > Struct:TGetResultSetMetadataReq(operationHandle:null) > at > org.apache.hive.jdbc.HiveQueryResultSet.retrieveSchema(HiveQueryResultSet.java:273) > at > org.apache.hive.jdbc.HiveQueryResultSet.(HiveQueryResultSet.java:188) > at > org.apache.hive.jdbc.HiveQueryResultSet$Builder.build(HiveQueryResultSet.java:170) > at > org.apache.hive.jdbc.HiveDatabaseMetaData.getColumns(HiveDatabaseMetaData.java:222) > at JDBCExperiments$.main(JDBCExperiments.scala:28) > at JDBCExperiments.main(JDBCExperiments.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > Caused by: org.apache.thrift.protocol.TProtocolException: Required field > 'operationHandle' is unset! > Struct:TGetResultSetMetadataReq(operationHandle:null) > at > org.apache.hive.service.cli.thrift.TGetResultSetMetadataReq.validate(TGetResultSetMetadataReq.java:290) > at > org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args.validate(TCLIService.java:12041) > at > org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args$GetResultSetMetadata_argsStandardScheme.write(TCLIService.java:12098) > at > org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args$GetResultSetMetadata_argsStandardScheme.write(TCLIService.java:12067) > at > org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args.write(TCLIService.java:12018) > at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:63) > at > org.apache.hive.service.cli.thrift.TCLIService$Client.send_GetResultSetMetadata(TCLIService.java:472) > at > org.apache.hive.service.cli.thrift.TCLIService$Client.GetResultSetMetadata(TCLIService.java:464) > at > org.apache.hive.jdbc.HiveQueryResultSet.retrieveSchema(HiveQueryResultSet.java:242) > at > org.apache.hive.jdbc.HiveQueryResultSet.(HiveQueryResultSet.java:188) > at > org.apache.hive.jdbc.HiveQueryResultSet$Builder.build(HiveQueryResultSet.java:170) > at > org.apache.hive.jdbc.HiveDatabaseMetaData.getColumns(HiveDatabaseMetaData.java:222) > at JDBCExperiments$.main(JDBCExperiments.scala:28) > at JDBCExperiments.main(JDBCExperiments.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > {noformat} > Exception thrown from server side: > {noformat} > 15/11/18 02:27:01 WARN RetryingMetaStoreClient: MetaStoreClient lost > connection. Attempting to reconnect. > org.apache.thrift.TApplicationException: Invalid method name: > 'get_schema_with_environment_context' > at > org.apache.thrift.TApplicationException.read(TApplicationException.java:111) > at > org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71) > at
[jira] [Updated] (SPARK-12199) Follow-up: Refine example code in ml-features.md
[ https://issues.apache.org/jira/browse/SPARK-12199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-12199: -- Shepherd: Joseph K. Bradley (was: Xiangrui Meng) > Follow-up: Refine example code in ml-features.md > > > Key: SPARK-12199 > URL: https://issues.apache.org/jira/browse/SPARK-12199 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xusen Yin > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"
[ https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12218: Assignee: Apache Spark > Boolean logic in sql does not work "not (A and B)" is not the same as "(not > A) or (not B)" > > > Key: SPARK-12218 > URL: https://issues.apache.org/jira/browse/SPARK-12218 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Irakli Machabeli >Assignee: Apache Spark >Priority: Blocker > > Two identical queries produce different results > In [2]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( > PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', > 'PreviouslyChargedOff'))").count() > Out[2]: 18 > In [3]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and ( > not(PaymentsReceived=0) or not (ExplicitRoll in ('PreviouslyPaidOff', > 'PreviouslyChargedOff')))").count() > Out[3]: 28 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"
[ https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12218: Assignee: (was: Apache Spark) > Boolean logic in sql does not work "not (A and B)" is not the same as "(not > A) or (not B)" > > > Key: SPARK-12218 > URL: https://issues.apache.org/jira/browse/SPARK-12218 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Irakli Machabeli >Priority: Blocker > > Two identical queries produce different results > In [2]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( > PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', > 'PreviouslyChargedOff'))").count() > Out[2]: 18 > In [3]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and ( > not(PaymentsReceived=0) or not (ExplicitRoll in ('PreviouslyPaidOff', > 'PreviouslyChargedOff')))").count() > Out[3]: 28 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12303) Configuration parameter by which can choose if we want the REPL generated class directory name to be random or fixed name.
[ https://issues.apache.org/jira/browse/SPARK-12303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054515#comment-15054515 ] Neelesh Srinivas Salian commented on SPARK-12303: - Hi piyush, Thanks for opening the JIRA. Going by your description, I see you intend to change: {code} private val SPARK_DEBUG_REPL: Boolean = (System.getenv("SPARK_DEBUG_REPL") == "1") /** Local directory to save .class files too */ private lazy val outputDir = { val tmp = System.getProperty("java.io.tmpdir") val rootDir = conf.get("spark.repl.classdir", tmp) Utils.createTempDir(rootDir) } if (SPARK_DEBUG_REPL) { echo("Output directory: " + outputDir) } {code} Could you please elaborate on what the benefit of having that option would do? It will create the directory based on your conf settings. > Configuration parameter by which can choose if we want the REPL generated > class directory name to be random or fixed name. > --- > > Key: SPARK-12303 > URL: https://issues.apache.org/jira/browse/SPARK-12303 > Project: Spark > Issue Type: New Feature > Components: Spark Shell >Reporter: piyush >Priority: Minor > > .class generated by spark REPL are stored in a temp directory with random > name. > Configuration parameter by which can choose if we want the REPL generated > class directory name to be random or fixed name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4621) Shuffle index can be cached for SortShuffleManager in ExternalShuffle in order to reduce indexFile's io
[ https://issues.apache.org/jira/browse/SPARK-4621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-4621: Description: in ExternalShuffle, we can use LRUCache to store recently finished shuffle index and that can reduce indexFile's io. At first, i implement it for ExternalShuffle. Latter i will add it to IndexShuffleBlockResolver if i can. (was: in IndexShuffleBlockManager, we can use LRUCache to store recently finished shuffle index and that can reduce indexFile's io.) > Shuffle index can be cached for SortShuffleManager in ExternalShuffle in > order to reduce indexFile's io > > > Key: SPARK-4621 > URL: https://issues.apache.org/jira/browse/SPARK-4621 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: Lianhui Wang >Assignee: Apache Spark >Priority: Minor > > in ExternalShuffle, we can use LRUCache to store recently finished shuffle > index and that can reduce indexFile's io. At first, i implement it for > ExternalShuffle. Latter i will add it to IndexShuffleBlockResolver if i can. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4621) Shuffle index can be cached for SortShuffleManager in ExternalShuffle in order to reduce indexFile's io
[ https://issues.apache.org/jira/browse/SPARK-4621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4621: --- Assignee: Apache Spark > Shuffle index can be cached for SortShuffleManager in ExternalShuffle in > order to reduce indexFile's io > > > Key: SPARK-4621 > URL: https://issues.apache.org/jira/browse/SPARK-4621 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: Lianhui Wang >Assignee: Apache Spark >Priority: Minor > > in IndexShuffleBlockManager, we can use LRUCache to store recently finished > shuffle index and that can reduce indexFile's io. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4621) Shuffle index can be cached for SortShuffleManager in ExternalShuffle in order to reduce indexFile's io
[ https://issues.apache.org/jira/browse/SPARK-4621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054346#comment-15054346 ] Apache Spark commented on SPARK-4621: - User 'lianhuiwang' has created a pull request for this issue: https://github.com/apache/spark/pull/10277 > Shuffle index can be cached for SortShuffleManager in ExternalShuffle in > order to reduce indexFile's io > > > Key: SPARK-4621 > URL: https://issues.apache.org/jira/browse/SPARK-4621 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: Lianhui Wang >Priority: Minor > > in IndexShuffleBlockManager, we can use LRUCache to store recently finished > shuffle index and that can reduce indexFile's io. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4621) Shuffle index can be cached for SortShuffleManager in ExternalShuffle in order to reduce indexFile's io
[ https://issues.apache.org/jira/browse/SPARK-4621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-4621: Summary: Shuffle index can be cached for SortShuffleManager in ExternalShuffle in order to reduce indexFile's io (was: Shuffle index can cached for SortShuffleManager in ExternalShuffle in order to reduce indexFile's io) > Shuffle index can be cached for SortShuffleManager in ExternalShuffle in > order to reduce indexFile's io > > > Key: SPARK-4621 > URL: https://issues.apache.org/jira/browse/SPARK-4621 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: Lianhui Wang >Priority: Minor > > in IndexShuffleBlockManager, we can use LRUCache to store recently finished > shuffle index and that can reduce indexFile's io. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12137) Spark Streaming State Recovery limitations
[ https://issues.apache.org/jira/browse/SPARK-12137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054339#comment-15054339 ] Ravindar edited comment on SPARK-12137 at 12/12/15 2:55 PM: Sean, thanks for the clarification on the current recovery functionality as applied to system failure only. One has to manually delete the checkpoint directory when the processing steps change as a part of upgrade; this has to be treated separately as application functionality. For application state continuity in upgrade scenario, the application has to explicitly save last state for *updateStateByKey* in each iteration and then restore if last saved exists else create a default value. Or this state is already there in existing *checkpointing* that you can lookup and retrieve I am looking a best practice in this scenario (any streaming examples?) with following questions 1. Do you serialize/deserialize to/from HDFS with key as file name and state as content 2. Do you serialize/deserialize to/from Cassandra with key, content was (Author: rroopreddy): Sean, thanks for the clarification on the current functionality. One has to manually delete the checkpoint directory when the processing steps change as a part of upgrade. For state continuity in upgrade scenario, the application has to explicitly save last state for *updateStateByKey* in each iteration and then restore if last saved exists else create a default value. Or this state is already there in existing *checkpointing* that you can lookup and retrieve I am looking a best practice in this scenario (any streaming examples?) with following questions 1. Do you serialize/deserialize to/from HDFS with key as file name and state as content 2. Do you serialize/deserialize to/from Cassandra with key, content > Spark Streaming State Recovery limitations > -- > > Key: SPARK-12137 > URL: https://issues.apache.org/jira/browse/SPARK-12137 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.4.1 >Reporter: Ravindar >Priority: Critical > > There was multiple threads in forums asking similar question without a clear > answer and hence entering it here. > We have a streaming application that goes through multi-step processing. In > some of these steps stateful operations like *updateStateByKey* are used to > maintain an accumulated running state (and other state info) with incoming > RDD streams. As streaming application is incremental, it is imperative that > we recover/restore from previous known state in the following two scenarios > 1. On spark driver/streaming application failure. > In this scenario the driver/streaming application shutdown and > restarted. The recommended approach is enable the *checkpoint(checkpointDir)* > and use *StreamingContext.getOrCreate* to restore the context from checkpoint > state. > 2. Upgrade driver/streaming application with additional steps in the > processing > In this scenario, we introduced new steps with downstream processing for > new functionality without changes to existing steps. Upgrading the streaming > application with the new fails on *StreamingContext.getOrCreate* as there is > mismatch in checkpoint saved. > Both of the above scenarios needs a unified approach where accumulated state > has to be saved and restored. The first approach of restoring from checkpoint > works for driver failure but not code upgrade. When the application code > changed, there is a recommendation to delete checkpoint data when new code is > deployed. If so, how do you reconstitute all of the stateful (e.g: > updateStateByKey) information from the last run. Every streaming application > has to save up-to-date state for each session represented by key and then > initialize it from this when a new session starts for the same key. Does > every application have to create their own mechanism given this is very > similar to current state checkpointing to HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4621) Shuffle index can cached for SortShuffleManager in ExternalShuffle in order to reduce indexFile's io
[ https://issues.apache.org/jira/browse/SPARK-4621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-4621: Summary: Shuffle index can cached for SortShuffleManager in ExternalShuffle in order to reduce indexFile's io (was: when sort- based shuffle, Cache recently finished shuffle index can reduce indexFile's io) > Shuffle index can cached for SortShuffleManager in ExternalShuffle in order > to reduce indexFile's io > - > > Key: SPARK-4621 > URL: https://issues.apache.org/jira/browse/SPARK-4621 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: Lianhui Wang >Priority: Minor > > in IndexShuffleBlockManager, we can use LRUCache to store recently finished > shuffle index and that can reduce indexFile's io. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12137) Spark Streaming State Recovery limitations
[ https://issues.apache.org/jira/browse/SPARK-12137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054339#comment-15054339 ] Ravindar commented on SPARK-12137: -- Sean, thanks for the clarification on the current functionality. One has to manually delete the checkpoint directory when the processing steps change as a part of upgrade. For state continuity in upgrade scenario, the application has to explicitly save last state for *updateStateByKey* in each iteration and then restore if last saved exists else create a default value. Or this state is already there in existing *checkpointing* that you can lookup and retrieve I am looking a best practice in this scenario (any streaming examples?) with following questions 1. Do you serialize/deserialize to/from HDFS with key as file name and state as content 2. Do you serialize/deserialize to/from Cassandra with key, content > Spark Streaming State Recovery limitations > -- > > Key: SPARK-12137 > URL: https://issues.apache.org/jira/browse/SPARK-12137 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.4.1 >Reporter: Ravindar >Priority: Critical > > There was multiple threads in forums asking similar question without a clear > answer and hence entering it here. > We have a streaming application that goes through multi-step processing. In > some of these steps stateful operations like *updateStateByKey* are used to > maintain an accumulated running state (and other state info) with incoming > RDD streams. As streaming application is incremental, it is imperative that > we recover/restore from previous known state in the following two scenarios > 1. On spark driver/streaming application failure. > In this scenario the driver/streaming application shutdown and > restarted. The recommended approach is enable the *checkpoint(checkpointDir)* > and use *StreamingContext.getOrCreate* to restore the context from checkpoint > state. > 2. Upgrade driver/streaming application with additional steps in the > processing > In this scenario, we introduced new steps with downstream processing for > new functionality without changes to existing steps. Upgrading the streaming > application with the new fails on *StreamingContext.getOrCreate* as there is > mismatch in checkpoint saved. > Both of the above scenarios needs a unified approach where accumulated state > has to be saved and restored. The first approach of restoring from checkpoint > works for driver failure but not code upgrade. When the application code > changed, there is a recommendation to delete checkpoint data when new code is > deployed. If so, how do you reconstitute all of the stateful (e.g: > updateStateByKey) information from the last run. Every streaming application > has to save up-to-date state for each session represented by key and then > initialize it from this when a new session starts for the same key. Does > every application have to create their own mechanism given this is very > similar to current state checkpointing to HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12305) Add Receiver scheduling info onto Spark Streaming web UI
[ https://issues.apache.org/jira/browse/SPARK-12305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054201#comment-15054201 ] Sean Owen commented on SPARK-12305: --- Same [~proflin] please don't open nearly blank JIRAs like this. Wait until you've written up a clear description and then open it. I'm going to close these otherwise. https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > Add Receiver scheduling info onto Spark Streaming web UI > > > Key: SPARK-12305 > URL: https://issues.apache.org/jira/browse/SPARK-12305 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liwei Lin >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12306) Add an option to ignore BlockRDD partition data loss
[ https://issues.apache.org/jira/browse/SPARK-12306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054199#comment-15054199 ] Sean Owen commented on SPARK-12306: --- [~proflin] there's no detail here. It doesn't sound like something that should be optionally ignored. I'd have to close this unless you can make a case for this, or at least describe it. > Add an option to ignore BlockRDD partition data loss > > > Key: SPARK-12306 > URL: https://issues.apache.org/jira/browse/SPARK-12306 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liwei Lin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12302) Example for servlet filter used by spark.ui.filters
[ https://issues.apache.org/jira/browse/SPARK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12302: Assignee: Apache Spark > Example for servlet filter used by spark.ui.filters > --- > > Key: SPARK-12302 > URL: https://issues.apache.org/jira/browse/SPARK-12302 > Project: Spark > Issue Type: Improvement > Components: Examples >Affects Versions: 1.5.2 >Reporter: Kai Sasaki >Assignee: Apache Spark >Priority: Trivial > Labels: examples, security > > Although {{spark.ui.filters}} configuration uses simple servlet filter, it is > often difficult to understand how to write filter code and how to integrate > actual spark applications. > It can be help to write examples for trying secure Spark cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12304) Make Spark Streaming web UI display more friendly Receiver graphs
[ https://issues.apache.org/jira/browse/SPARK-12304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12304: Assignee: Apache Spark > Make Spark Streaming web UI display more friendly Receiver graphs > - > > Key: SPARK-12304 > URL: https://issues.apache.org/jira/browse/SPARK-12304 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liwei Lin >Assignee: Apache Spark >Priority: Minor > Attachments: after-5.png, before-5.png > > > Currently, the Spark Streaming web UI uses the same maxY when displays 'Input > Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. > This may lead to somewhat un-friendly graphs: once we have tens of Receivers > or more, every 'Per-Receiver Times' line almost hits the ground. > This issue proposes to calculate a new maxY against the original one, which > is shared among all the `Per-Receiver Times& Histograms' graphs. > Before: > !before-5.png! > After: > !after-5.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12302) Example for servlet filter used by spark.ui.filters
[ https://issues.apache.org/jira/browse/SPARK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12302: Assignee: (was: Apache Spark) > Example for servlet filter used by spark.ui.filters > --- > > Key: SPARK-12302 > URL: https://issues.apache.org/jira/browse/SPARK-12302 > Project: Spark > Issue Type: Improvement > Components: Examples >Affects Versions: 1.5.2 >Reporter: Kai Sasaki >Priority: Trivial > Labels: examples, security > > Although {{spark.ui.filters}} configuration uses simple servlet filter, it is > often difficult to understand how to write filter code and how to integrate > actual spark applications. > It can be help to write examples for trying secure Spark cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12304) Make Spark Streaming web UI display more friendly Receiver graphs
[ https://issues.apache.org/jira/browse/SPARK-12304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12304: Assignee: (was: Apache Spark) > Make Spark Streaming web UI display more friendly Receiver graphs > - > > Key: SPARK-12304 > URL: https://issues.apache.org/jira/browse/SPARK-12304 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liwei Lin >Priority: Minor > Attachments: after-5.png, before-5.png > > > Currently, the Spark Streaming web UI uses the same maxY when displays 'Input > Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. > This may lead to somewhat un-friendly graphs: once we have tens of Receivers > or more, every 'Per-Receiver Times' line almost hits the ground. > This issue proposes to calculate a new maxY against the original one, which > is shared among all the `Per-Receiver Times& Histograms' graphs. > Before: > !before-5.png! > After: > !after-5.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12305) Add Receiver scheduling info onto Spark Streaming web UI
[ https://issues.apache.org/jira/browse/SPARK-12305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwei Lin updated SPARK-12305: -- Shepherd: Shixiong Zhu > Add Receiver scheduling info onto Spark Streaming web UI > > > Key: SPARK-12305 > URL: https://issues.apache.org/jira/browse/SPARK-12305 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liwei Lin >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12306) Add an option to ignore BlockRDD partition data loss
[ https://issues.apache.org/jira/browse/SPARK-12306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054184#comment-15054184 ] Liwei Lin edited comment on SPARK-12306 at 12/12/15 9:51 AM: - This issue is reported by me, and I'm working on it. :-) was (Author: proflin): This is reported by me, and I'm working on it. :-) > Add an option to ignore BlockRDD partition data loss > > > Key: SPARK-12306 > URL: https://issues.apache.org/jira/browse/SPARK-12306 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liwei Lin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12306) Add an option to ignore BlockRDD partition data loss
[ https://issues.apache.org/jira/browse/SPARK-12306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054184#comment-15054184 ] Liwei Lin commented on SPARK-12306: --- This is reported by me, and I'm working on it. :-) > Add an option to ignore BlockRDD partition data loss > > > Key: SPARK-12306 > URL: https://issues.apache.org/jira/browse/SPARK-12306 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liwei Lin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12305) Add Receiver scheduling info onto Spark Streaming web UI
[ https://issues.apache.org/jira/browse/SPARK-12305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054185#comment-15054185 ] Liwei Lin commented on SPARK-12305: --- This issue is reported by me, and I'm working on it. :-) > Add Receiver scheduling info onto Spark Streaming web UI > > > Key: SPARK-12305 > URL: https://issues.apache.org/jira/browse/SPARK-12305 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liwei Lin >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12306) Add an option to ignore BlockRDD partition data loss
[ https://issues.apache.org/jira/browse/SPARK-12306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwei Lin updated SPARK-12306: -- Shepherd: Shixiong Zhu > Add an option to ignore BlockRDD partition data loss > > > Key: SPARK-12306 > URL: https://issues.apache.org/jira/browse/SPARK-12306 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liwei Lin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12306) Add an option to ignore BlockRDD partition data loss
[ https://issues.apache.org/jira/browse/SPARK-12306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwei Lin updated SPARK-12306: -- Component/s: Streaming > Add an option to ignore BlockRDD partition data loss > > > Key: SPARK-12306 > URL: https://issues.apache.org/jira/browse/SPARK-12306 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liwei Lin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12306) Add an option to ignore BlockRDD partition data loss
[ https://issues.apache.org/jira/browse/SPARK-12306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwei Lin updated SPARK-12306: -- Affects Version/s: 1.5.2 > Add an option to ignore BlockRDD partition data loss > > > Key: SPARK-12306 > URL: https://issues.apache.org/jira/browse/SPARK-12306 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.5.2 >Reporter: Liwei Lin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12306) Add an option to ignore BlockRDD partition data loss
[ https://issues.apache.org/jira/browse/SPARK-12306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwei Lin updated SPARK-12306: -- Summary: Add an option to ignore BlockRDD partition data loss (was: ToEdit) > Add an option to ignore BlockRDD partition data loss > > > Key: SPARK-12306 > URL: https://issues.apache.org/jira/browse/SPARK-12306 > Project: Spark > Issue Type: Improvement >Reporter: Liwei Lin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12305) Add Receiver scheduling info onto Spark Streaming web UI
[ https://issues.apache.org/jira/browse/SPARK-12305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwei Lin updated SPARK-12305: -- Summary: Add Receiver scheduling info onto Spark Streaming web UI (was: Adds Receiver scheduling info onto Spark Streaming web UI) > Add Receiver scheduling info onto Spark Streaming web UI > > > Key: SPARK-12305 > URL: https://issues.apache.org/jira/browse/SPARK-12305 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liwei Lin >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12305) Adds Receiver scheduling info onto Spark Streaming web UI
[ https://issues.apache.org/jira/browse/SPARK-12305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwei Lin updated SPARK-12305: -- Summary: Adds Receiver scheduling info onto Spark Streaming web UI (was: Adds Receiver scheduling info onto ) > Adds Receiver scheduling info onto Spark Streaming web UI > - > > Key: SPARK-12305 > URL: https://issues.apache.org/jira/browse/SPARK-12305 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liwei Lin >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12305) Adds Receiver scheduling info onto Spark Streaming web UI
[ https://issues.apache.org/jira/browse/SPARK-12305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwei Lin updated SPARK-12305: -- Component/s: Streaming > Adds Receiver scheduling info onto Spark Streaming web UI > - > > Key: SPARK-12305 > URL: https://issues.apache.org/jira/browse/SPARK-12305 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liwei Lin >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12305) Adds Receiver scheduling info onto Spark Streaming web UI
[ https://issues.apache.org/jira/browse/SPARK-12305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwei Lin updated SPARK-12305: -- Affects Version/s: 1.5.2 > Adds Receiver scheduling info onto Spark Streaming web UI > - > > Key: SPARK-12305 > URL: https://issues.apache.org/jira/browse/SPARK-12305 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liwei Lin >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12305) Adds Receiver scheduling info onto
[ https://issues.apache.org/jira/browse/SPARK-12305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwei Lin updated SPARK-12305: -- Priority: Minor (was: Critical) > Adds Receiver scheduling info onto > --- > > Key: SPARK-12305 > URL: https://issues.apache.org/jira/browse/SPARK-12305 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liwei Lin >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12305) Adds Receiver scheduling info onto
Liwei Lin created SPARK-12305: - Summary: Adds Receiver scheduling info onto Key: SPARK-12305 URL: https://issues.apache.org/jira/browse/SPARK-12305 Project: Spark Issue Type: Improvement Reporter: Liwei Lin Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12306) ToEdit
Liwei Lin created SPARK-12306: - Summary: ToEdit Key: SPARK-12306 URL: https://issues.apache.org/jira/browse/SPARK-12306 Project: Spark Issue Type: Improvement Reporter: Liwei Lin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12304) Make Spark Streaming web UI display more friendly Receiver graphs
[ https://issues.apache.org/jira/browse/SPARK-12304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwei Lin updated SPARK-12304: -- Description: Currently, the Spark Streaming web UI uses the same maxY when displays 'Input Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. This may lead to somewhat un-friendly graphs: once we have tens of Receivers or more, every 'Per-Receiver Times' line almost hits the ground. This issue proposes to calculate a new maxY against the original one, which is shared among all the `Per-Receiver Times& Histograms' graphs. Before: !before-5.png! After: !after-5.png! was: Currently, the Spark Streaming web UI uses the same maxY when displays 'Input Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. This may lead to somewhat un-friendly graphs: once we have tens of Receivers or more, every 'Per-Receiver Times' line almost hits the ground. This issue proposes to calculate a new maxY against the original one, which is shared among all the `Per-Receiver Times& Histograms' graphs. > Make Spark Streaming web UI display more friendly Receiver graphs > - > > Key: SPARK-12304 > URL: https://issues.apache.org/jira/browse/SPARK-12304 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liwei Lin >Priority: Minor > Attachments: after-5.png, before-5.png > > > Currently, the Spark Streaming web UI uses the same maxY when displays 'Input > Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. > This may lead to somewhat un-friendly graphs: once we have tens of Receivers > or more, every 'Per-Receiver Times' line almost hits the ground. > This issue proposes to calculate a new maxY against the original one, which > is shared among all the `Per-Receiver Times& Histograms' graphs. > Before: > !before-5.png! > After: > !after-5.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12304) Make Spark Streaming web UI display more friendly Receiver graphs
[ https://issues.apache.org/jira/browse/SPARK-12304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwei Lin updated SPARK-12304: -- Description: Currently, the Spark Streaming web UI uses the same maxY when displays 'Input Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. This may lead to somewhat un-friendly graphs: once we have tens of Receivers or more, every 'Per-Receiver Times' line almost hits the ground. This issue proposes to calculate a new maxY against the original one, which is shared among all the `Per-Receiver Times& Histograms' graphs. was: Currently, the Spark Streaming web UI uses the same maxY when displays 'Input Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. This may lead to somewhat un-friendly graphs: once we have tens of Receivers or more, every 'Per-Receiver Times' line almost hits the ground. This issue proposes to calculate a new maxY against the original one, which is shared among all the `Per-Receiver Times& Histograms' graphs. > Make Spark Streaming web UI display more friendly Receiver graphs > - > > Key: SPARK-12304 > URL: https://issues.apache.org/jira/browse/SPARK-12304 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liwei Lin >Priority: Minor > Attachments: after-5.png, before-5.png > > > Currently, the Spark Streaming web UI uses the same maxY when displays 'Input > Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. > This may lead to somewhat un-friendly graphs: once we have tens of Receivers > or more, every 'Per-Receiver Times' line almost hits the ground. > This issue proposes to calculate a new maxY against the original one, which > is shared among all the `Per-Receiver Times& Histograms' graphs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12304) Make Spark Streaming web UI display more friendly Receiver graphs
[ https://issues.apache.org/jira/browse/SPARK-12304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwei Lin updated SPARK-12304: -- Attachment: after-5.png before-5.png > Make Spark Streaming web UI display more friendly Receiver graphs > - > > Key: SPARK-12304 > URL: https://issues.apache.org/jira/browse/SPARK-12304 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liwei Lin >Priority: Minor > Attachments: after-5.png, before-5.png > > > Currently, the Spark Streaming web UI uses the same maxY when displays 'Input > Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. > This may lead to somewhat un-friendly graphs: once we have tens of Receivers > or more, every 'Per-Receiver Times' line almost hits the ground. > This issue proposes to calculate a new maxY against the original one, which > is shared among all the `Per-Receiver Times& Histograms' graphs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12304) Make Spark Streaming web UI display more friendly Receiver graphs
[ https://issues.apache.org/jira/browse/SPARK-12304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwei Lin updated SPARK-12304: -- Summary: Make Spark Streaming web UI display more friendly Receiver graphs (was: Make Spark Streaming web UI display more friendly Receiver graph) > Make Spark Streaming web UI display more friendly Receiver graphs > - > > Key: SPARK-12304 > URL: https://issues.apache.org/jira/browse/SPARK-12304 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Liwei Lin >Priority: Minor > Attachments: after-5.png, before-5.png > > > Currently, the Spark Streaming web UI uses the same maxY when displays 'Input > Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. > This may lead to somewhat un-friendly graphs: once we have tens of Receivers > or more, every 'Per-Receiver Times' line almost hits the ground. > This issue proposes to calculate a new maxY against the original one, which > is shared among all the `Per-Receiver Times& Histograms' graphs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12304) Make Spark Streaming web UI display more friendly Receiver graph
Liwei Lin created SPARK-12304: - Summary: Make Spark Streaming web UI display more friendly Receiver graph Key: SPARK-12304 URL: https://issues.apache.org/jira/browse/SPARK-12304 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.5.2 Reporter: Liwei Lin Priority: Minor Currently, the Spark Streaming web UI uses the same maxY when displays 'Input Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. This may lead to somewhat un-friendly graphs: once we have tens of Receivers or more, every 'Per-Receiver Times' line almost hits the ground. This issue proposes to calculate a new maxY against the original one, which is shared among all the `Per-Receiver Times& Histograms' graphs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver
[ https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-11193. --- Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 10203 [https://github.com/apache/spark/pull/10203] > Spark 1.5+ Kinesis Streaming - ClassCastException when starting > KinesisReceiver > --- > > Key: SPARK-11193 > URL: https://issues.apache.org/jira/browse/SPARK-11193 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.0, 1.5.1 >Reporter: Phil Kallos > Fix For: 2.0.0, 1.6.1 > > Attachments: screen.png > > > After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis > Spark Streaming application, and am being consistently greeted with this > exception: > java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast > to scala.collection.mutable.SynchronizedMap > at > org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532) > at > org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982) > at > org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Worth noting that I am able to reproduce this issue locally, and also on > Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0). > Also, I am not able to run the included kinesis-asl example. > Built locally using: > git checkout v1.5.1 > mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package > Example run command: > bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector > https://kinesis.us-east-1.amazonaws.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries
[ https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2870. -- Resolution: Not A Problem > Thorough schema inference directly on RDDs of Python dictionaries > - > > Key: SPARK-2870 > URL: https://issues.apache.org/jira/browse/SPARK-2870 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Reporter: Nicholas Chammas > > h4. Background > I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. > They process JSON text directly and infer a schema that covers the entire > source data set. > This is very important with semi-structured data like JSON since individual > elements in the data set are free to have different structures. Matching > fields across elements may even have different value types. > For example: > {code} > {"a": 5} > {"a": "cow"} > {code} > To get a queryable schema that covers the whole data set, you need to infer a > schema by looking at the whole data set. The aforementioned > {{SQLContext.json...()}} methods do this very well. > h4. Feature Request > What we need is for {{SQlContext.inferSchema()}} to do this, too. > Alternatively, we need a new {{SQLContext}} method that works on RDDs of > Python dictionaries and does something functionally equivalent to this: > {code} > SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x))) > {code} > As of 1.0.2, > [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema] > just looks at the first element in the data set. This won't help much when > the structure of the elements in the target RDD is variable. > h4. Example Use Case > * You have some JSON text data that you want to analyze using Spark SQL. > * You would use one of the {{SQLContext.json...()}} methods, but you need to > do some filtering on the data first to remove bad elements--basically, some > minimal schema validation. > * You deserialize the JSON objects to Python {{dict}} s and filter out the > bad ones. You now have an RDD of dictionaries. > * From this RDD, you want a SchemaRDD that captures the schema for the whole > data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12303) Configuration parameter by which can choose if we want the REPL generated class directory name to be random or fixed name.
[ https://issues.apache.org/jira/browse/SPARK-12303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12303: -- Priority: Minor (was: Major) Issue Type: New Feature (was: Wish) What would be the purpose of this? > Configuration parameter by which can choose if we want the REPL generated > class directory name to be random or fixed name. > --- > > Key: SPARK-12303 > URL: https://issues.apache.org/jira/browse/SPARK-12303 > Project: Spark > Issue Type: New Feature > Components: Spark Shell >Reporter: piyush >Priority: Minor > > .class generated by spark REPL are stored in a temp directory with random > name. > Configuration parameter by which can choose if we want the REPL generated > class directory name to be random or fixed name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"
[ https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054142#comment-15054142 ] Xiao Li commented on SPARK-12218: - Found the fix, but that fix is not merged into 1.5.2. Tomorrow I will open a PR and let them merge the fix to 1.5.3. Thanks! > Boolean logic in sql does not work "not (A and B)" is not the same as "(not > A) or (not B)" > > > Key: SPARK-12218 > URL: https://issues.apache.org/jira/browse/SPARK-12218 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Irakli Machabeli >Priority: Blocker > > Two identical queries produce different results > In [2]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( > PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', > 'PreviouslyChargedOff'))").count() > Out[2]: 18 > In [3]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and ( > not(PaymentsReceived=0) or not (ExplicitRoll in ('PreviouslyPaidOff', > 'PreviouslyChargedOff')))").count() > Out[3]: 28 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12300) Fix schema inferance on local collections
[ https://issues.apache.org/jira/browse/SPARK-12300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12300: Assignee: (was: Apache Spark) > Fix schema inferance on local collections > - > > Key: SPARK-12300 > URL: https://issues.apache.org/jira/browse/SPARK-12300 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: holdenk >Priority: Minor > > Current schema inferance for local python collections halts as soon as there > are no NullTypes. This is different than when we specify a sampling ratio of > 1.0 on a distributed collection. This could result in incomplete schema > information. > Repro: > {code} > input = [{"a": 1}, {"b": "coffee"}] > df = sqlContext.createDataFrame(input) > print df.schema > {code} > Discovered while looking at SPARK-2870 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12300) Fix schema inferance on local collections
[ https://issues.apache.org/jira/browse/SPARK-12300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12300: Assignee: Apache Spark > Fix schema inferance on local collections > - > > Key: SPARK-12300 > URL: https://issues.apache.org/jira/browse/SPARK-12300 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: holdenk >Assignee: Apache Spark >Priority: Minor > > Current schema inferance for local python collections halts as soon as there > are no NullTypes. This is different than when we specify a sampling ratio of > 1.0 on a distributed collection. This could result in incomplete schema > information. > Repro: > {code} > input = [{"a": 1}, {"b": "coffee"}] > df = sqlContext.createDataFrame(input) > print df.schema > {code} > Discovered while looking at SPARK-2870 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org