[jira] [Commented] (SPARK-7804) Incorrect results from JDBCRDD -- one record repeatly
[ https://issues.apache.org/jira/browse/SPARK-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555680#comment-14555680 ] Paul Wu commented on SPARK-7804: Unfortunately, JdbcRDD was poorly designed since the lowerbound and upperbound are long types which are too limited. One of my team member implemented a general one based on the idea. Some of my team are worried about the home-made solution. When we saw JDBCRDD, it looks like what we wanted. In fact, I hope JDBCRDD can be public or JdbcRDD can be re-designed to take care general situation just like what JDBCRDD does. > Incorrect results from JDBCRDD -- one record repeatly > - > > Key: SPARK-7804 > URL: https://issues.apache.org/jira/browse/SPARK-7804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0, 1.3.1 >Reporter: Paul Wu > > Getting only one record repeated in the RDD and repeated field value: > > I have a table like: > {code} > attuid name email > 12 john j...@appp.com > 23 tom t...@appp.com > 34 tony t...@appp.com > {code} > My code: > {code} > JavaSparkContext sc = new JavaSparkContext(sparkConf); > String url = ""; > java.util.Properties prop = new Properties(); > List partitionList = new ArrayList<>(); > //int i; > partitionList.add(new JDBCPartition("1=1", 0)); > > List fields = new ArrayList(); > fields.add(DataTypes.createStructField("attuid", > DataTypes.StringType, true)); > fields.add(DataTypes.createStructField("name", DataTypes.StringType, > true)); > fields.add(DataTypes.createStructField("email", DataTypes.StringType, > true)); > StructType schema = DataTypes.createStructType(fields); > JDBCRDD jdbcRDD = new JDBCRDD(sc.sc(), > JDBCRDD.getConnector("oracle.jdbc.OracleDriver", url, prop), > > schema, > " USERS", > new String[]{"attuid", "name", "email"}, > new Filter[]{ }, > > partitionList.toArray(new JDBCPartition[0]) > > ); > > System.out.println("count before to Java RDD=" + > jdbcRDD.cache().count()); > JavaRDD jrdd = jdbcRDD.toJavaRDD(); > System.out.println("count=" + jrdd.count()); > List lr = jrdd.collect(); > for (Row r : lr) { > for (int ii = 0; ii < r.length(); ii++) { > System.out.println(r.getString(ii)); > } > } > {code} > === > result is : > {code} > 34 > tony > t...@appp.com > 34 > tony > t...@appp.com > 34 > tony > t...@appp.com > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7605) Python API for ElementwiseProduct
[ https://issues.apache.org/jira/browse/SPARK-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555669#comment-14555669 ] Apache Spark commented on SPARK-7605: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/6346 > Python API for ElementwiseProduct > - > > Key: SPARK-7605 > URL: https://issues.apache.org/jira/browse/SPARK-7605 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Affects Versions: 1.4.0 >Reporter: Yanbo Liang > > Python API for org.apache.spark.mllib.feature.ElementwiseProduct -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7605) Python API for ElementwiseProduct
[ https://issues.apache.org/jira/browse/SPARK-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7605: --- Assignee: (was: Apache Spark) > Python API for ElementwiseProduct > - > > Key: SPARK-7605 > URL: https://issues.apache.org/jira/browse/SPARK-7605 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Affects Versions: 1.4.0 >Reporter: Yanbo Liang > > Python API for org.apache.spark.mllib.feature.ElementwiseProduct -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7605) Python API for ElementwiseProduct
[ https://issues.apache.org/jira/browse/SPARK-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7605: --- Assignee: Apache Spark > Python API for ElementwiseProduct > - > > Key: SPARK-7605 > URL: https://issues.apache.org/jira/browse/SPARK-7605 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Affects Versions: 1.4.0 >Reporter: Yanbo Liang >Assignee: Apache Spark > > Python API for org.apache.spark.mllib.feature.ElementwiseProduct -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7221) Expose the current processed file name of FileInputDStream to the users
[ https://issues.apache.org/jira/browse/SPARK-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7221: --- Assignee: (was: Apache Spark) > Expose the current processed file name of FileInputDStream to the users > --- > > Key: SPARK-7221 > URL: https://issues.apache.org/jira/browse/SPARK-7221 > Project: Spark > Issue Type: Wish > Components: Streaming >Reporter: Saisai Shao >Priority: Minor > > This is a wished feature from Spark user list > (http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-textFileStream-fileStream-Get-file-name-tt22692.html). > Currently there's no API to get the processed file name for > FileInputDStream, it is useful if we can expose this to the users. > The major problem is how to expose this to the users with an elegant way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7221) Expose the current processed file name of FileInputDStream to the users
[ https://issues.apache.org/jira/browse/SPARK-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7221: --- Assignee: Apache Spark > Expose the current processed file name of FileInputDStream to the users > --- > > Key: SPARK-7221 > URL: https://issues.apache.org/jira/browse/SPARK-7221 > Project: Spark > Issue Type: Wish > Components: Streaming >Reporter: Saisai Shao >Assignee: Apache Spark >Priority: Minor > > This is a wished feature from Spark user list > (http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-textFileStream-fileStream-Get-file-name-tt22692.html). > Currently there's no API to get the processed file name for > FileInputDStream, it is useful if we can expose this to the users. > The major problem is how to expose this to the users with an elegant way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7221) Expose the current processed file name of FileInputDStream to the users
[ https://issues.apache.org/jira/browse/SPARK-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555668#comment-14555668 ] Apache Spark commented on SPARK-7221: - User 'animeshbaranawal' has created a pull request for this issue: https://github.com/apache/spark/pull/6347 > Expose the current processed file name of FileInputDStream to the users > --- > > Key: SPARK-7221 > URL: https://issues.apache.org/jira/browse/SPARK-7221 > Project: Spark > Issue Type: Wish > Components: Streaming >Reporter: Saisai Shao >Priority: Minor > > This is a wished feature from Spark user list > (http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-textFileStream-fileStream-Get-file-name-tt22692.html). > Currently there's no API to get the processed file name for > FileInputDStream, it is useful if we can expose this to the users. > The major problem is how to expose this to the users with an elegant way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7804) Incorrect results from JDBCRDD -- one record repeatly
[ https://issues.apache.org/jira/browse/SPARK-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-7804: -- Flags: (was: Important) Labels: (was: JDBCRDD sql) > Incorrect results from JDBCRDD -- one record repeatly > - > > Key: SPARK-7804 > URL: https://issues.apache.org/jira/browse/SPARK-7804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0, 1.3.1 >Reporter: Paul Wu > > Getting only one record repeated in the RDD and repeated field value: > > I have a table like: > {code} > attuid name email > 12 john j...@appp.com > 23 tom t...@appp.com > 34 tony t...@appp.com > {code} > My code: > {code} > JavaSparkContext sc = new JavaSparkContext(sparkConf); > String url = ""; > java.util.Properties prop = new Properties(); > List partitionList = new ArrayList<>(); > //int i; > partitionList.add(new JDBCPartition("1=1", 0)); > > List fields = new ArrayList(); > fields.add(DataTypes.createStructField("attuid", > DataTypes.StringType, true)); > fields.add(DataTypes.createStructField("name", DataTypes.StringType, > true)); > fields.add(DataTypes.createStructField("email", DataTypes.StringType, > true)); > StructType schema = DataTypes.createStructType(fields); > JDBCRDD jdbcRDD = new JDBCRDD(sc.sc(), > JDBCRDD.getConnector("oracle.jdbc.OracleDriver", url, prop), > > schema, > " USERS", > new String[]{"attuid", "name", "email"}, > new Filter[]{ }, > > partitionList.toArray(new JDBCPartition[0]) > > ); > > System.out.println("count before to Java RDD=" + > jdbcRDD.cache().count()); > JavaRDD jrdd = jdbcRDD.toJavaRDD(); > System.out.println("count=" + jrdd.count()); > List lr = jrdd.collect(); > for (Row r : lr) { > for (int ii = 0; ii < r.length(); ii++) { > System.out.println(r.getString(ii)); > } > } > {code} > === > result is : > {code} > 34 > tony > t...@appp.com > 34 > tony > t...@appp.com > 34 > tony > t...@appp.com > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7804) Incorrect results from JDBCRDD -- one record repeatly
[ https://issues.apache.org/jira/browse/SPARK-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555647#comment-14555647 ] Josh Rosen commented on SPARK-7804: --- If possible, we should be hiding the internal JDBCRDD (all-caps) from the Javadoc; I've filed SPARK-7821 so that we remember to follow up on this. Slightly confusingly, Spark also has another class called JdbcRDD (note the different capitalization) which _is_ a public API: https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/rdd/JdbcRDD.html. Perhaps you meant to use that instead? There might be a way to address your use-case while continuing to use the public DataFrame APIs, but I don't know enough about your use-case or Spark SQL APIs to provide a great answer. The Spark Users mailing list would probably be a better place to have that discussion, though. In the meantime, I'm going to resolve this JIRA ticket as "Not an Issue." > Incorrect results from JDBCRDD -- one record repeatly > - > > Key: SPARK-7804 > URL: https://issues.apache.org/jira/browse/SPARK-7804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0, 1.3.1 >Reporter: Paul Wu > > Getting only one record repeated in the RDD and repeated field value: > > I have a table like: > {code} > attuid name email > 12 john j...@appp.com > 23 tom t...@appp.com > 34 tony t...@appp.com > {code} > My code: > {code} > JavaSparkContext sc = new JavaSparkContext(sparkConf); > String url = ""; > java.util.Properties prop = new Properties(); > List partitionList = new ArrayList<>(); > //int i; > partitionList.add(new JDBCPartition("1=1", 0)); > > List fields = new ArrayList(); > fields.add(DataTypes.createStructField("attuid", > DataTypes.StringType, true)); > fields.add(DataTypes.createStructField("name", DataTypes.StringType, > true)); > fields.add(DataTypes.createStructField("email", DataTypes.StringType, > true)); > StructType schema = DataTypes.createStructType(fields); > JDBCRDD jdbcRDD = new JDBCRDD(sc.sc(), > JDBCRDD.getConnector("oracle.jdbc.OracleDriver", url, prop), > > schema, > " USERS", > new String[]{"attuid", "name", "email"}, > new Filter[]{ }, > > partitionList.toArray(new JDBCPartition[0]) > > ); > > System.out.println("count before to Java RDD=" + > jdbcRDD.cache().count()); > JavaRDD jrdd = jdbcRDD.toJavaRDD(); > System.out.println("count=" + jrdd.count()); > List lr = jrdd.collect(); > for (Row r : lr) { > for (int ii = 0; ii < r.length(); ii++) { > System.out.println(r.getString(ii)); > } > } > {code} > === > result is : > {code} > 34 > tony > t...@appp.com > 34 > tony > t...@appp.com > 34 > tony > t...@appp.com > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7804) Incorrect results from JDBCRDD -- one record repeatly
[ https://issues.apache.org/jira/browse/SPARK-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-7804. --- Resolution: Invalid > Incorrect results from JDBCRDD -- one record repeatly > - > > Key: SPARK-7804 > URL: https://issues.apache.org/jira/browse/SPARK-7804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0, 1.3.1 >Reporter: Paul Wu > > Getting only one record repeated in the RDD and repeated field value: > > I have a table like: > {code} > attuid name email > 12 john j...@appp.com > 23 tom t...@appp.com > 34 tony t...@appp.com > {code} > My code: > {code} > JavaSparkContext sc = new JavaSparkContext(sparkConf); > String url = ""; > java.util.Properties prop = new Properties(); > List partitionList = new ArrayList<>(); > //int i; > partitionList.add(new JDBCPartition("1=1", 0)); > > List fields = new ArrayList(); > fields.add(DataTypes.createStructField("attuid", > DataTypes.StringType, true)); > fields.add(DataTypes.createStructField("name", DataTypes.StringType, > true)); > fields.add(DataTypes.createStructField("email", DataTypes.StringType, > true)); > StructType schema = DataTypes.createStructType(fields); > JDBCRDD jdbcRDD = new JDBCRDD(sc.sc(), > JDBCRDD.getConnector("oracle.jdbc.OracleDriver", url, prop), > > schema, > " USERS", > new String[]{"attuid", "name", "email"}, > new Filter[]{ }, > > partitionList.toArray(new JDBCPartition[0]) > > ); > > System.out.println("count before to Java RDD=" + > jdbcRDD.cache().count()); > JavaRDD jrdd = jdbcRDD.toJavaRDD(); > System.out.println("count=" + jrdd.count()); > List lr = jrdd.collect(); > for (Row r : lr) { > for (int ii = 0; ii < r.length(); ii++) { > System.out.println(r.getString(ii)); > } > } > {code} > === > result is : > {code} > 34 > tony > t...@appp.com > 34 > tony > t...@appp.com > 34 > tony > t...@appp.com > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7535) Audit Pipeline APIs for 1.4
[ https://issues.apache.org/jira/browse/SPARK-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14554629#comment-14554629 ] Xiangrui Meng edited comment on SPARK-7535 at 5/22/15 6:17 AM: --- Some notes: In PR #6322: 1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage already does. 1. Move Evaluator to ml.evaluation. 1. Mention larger metrics are better. 1. PipelineModel doc. “compiled” -> “fitted” 1. Hide PolynomialExpansion.expand 1. Hide VectorAssembler. 1. Word2Vec.minCount -> @param 1. ParamValidators -> DeveloperApi 1. Hide MetadataUtils/SchemaUtils. Others: 1. @varargs to setDefault (SPARK-7498) 1. Update RegexTokenizer default setting. (SPARK-7794) 1. Mention `RegexTokenizer` in `Tokenizer`. (SPARK-7794) 1. Remove Params.validateParams(paramMap)? 1. param and getParam should be final (SPARK-7816) 1. UnresolvedAttribute (Java compatibility?) 1. Missing RegressionEvaluator (SPARK-7404) 1. ml.feature missing package doc (SPARK-7808) 1. ALS -> use dataframes to store user/item factors? Then we can hide ALS.Rating 1. ALSModel -> remove training parameters? was (Author: mengxr): Some notes: 1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage already does. 2. @varargs to setDefault (SPARK-7498) 3. Move Evaluator to ml.evaluation. 4. Mention larger metrics are better. 5. PipelineModel doc. “compiled” -> “fitted” 6. Remove Params.validateParams(paramMap)? 7. UnresolvedAttribute (Java compatibility?) 8. Missing RegressionEvaluator (SPARK-7404) 9. ml.feature missing package doc (SPARK-7808) 10. param and getParam should be final (SPARK-7816) 11. Hide PolynomialExpansion.expand 12. Update RegexTokenizer default setting. (SPARK-7794) 13. Mention `RegexTokenizer` in `Tokenizer`. (SPARK-7794) 14. Hide VectorAssembler. 15. Word2Vec.minCount -> @param 16. ParamValidators -> DeveloperApi 17. Params -> @DeveloperApi 18. ALS -> use dataframes to store user/item factors? Then we can hide ALS.Rating 19. ALSModel -> remove training parameters? 20. Hide MetadataUtils/SchemaUtils. > Audit Pipeline APIs for 1.4 > --- > > Key: SPARK-7535 > URL: https://issues.apache.org/jira/browse/SPARK-7535 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng > > This is an umbrella for auditing the Pipeline (spark.ml) APIs. Items to > check: > * Public/protected/private access > * Consistency across spark.ml > * Classes, methods, and parameters in spark.mllib but missing in spark.ml > ** We should create JIRAs for each of these (under an umbrella) as to-do > items for future releases. > For each algorithm or API component, create a subtask under this umbrella. > Some major new items: > * new feature transformers > * tree models > * elastic-net > * ML attributes > * developer APIs (Predictor, Classifier, Regressor) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7821) Hide private SQL JDBC classes from Javadoc
Josh Rosen created SPARK-7821: - Summary: Hide private SQL JDBC classes from Javadoc Key: SPARK-7821 URL: https://issues.apache.org/jira/browse/SPARK-7821 Project: Spark Issue Type: Improvement Components: Documentation, SQL Reporter: Josh Rosen We should hide {{private\[sql\]}} JDBC classes from the generated Javadoc, since showing these internal classes can be confusing to users. This is especially important for the SQL {{jdbc}} package because it contains an internal JDBCRDD class which is easily confused with the public JdbcRDD class in Spark Core (see SPARK-7804 for an example of this). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7712) Native Spark Window Functions & Performance Improvements
[ https://issues.apache.org/jira/browse/SPARK-7712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555629#comment-14555629 ] Yin Huai commented on SPARK-7712: - btw, I just removed fix version. We set that after when we resolve the jira. > Native Spark Window Functions & Performance Improvements > - > > Key: SPARK-7712 > URL: https://issues.apache.org/jira/browse/SPARK-7712 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.0 >Reporter: Herman van Hovell tot Westerflier > Original Estimate: 336h > Remaining Estimate: 336h > > Hi All, > After playing with the current spark window implementation, I tried to take > this to next level. My main goal is/was to address the following issues: > Native Spark SQL & Performance. > *Native Spark SQL* > The current implementation uses Hive UDAFs as its aggregation mechanism. We > try to address the following issues by moving to a more 'native' Spark SQL > approach: > - Window functions require Hive. Some people (mostly by accident) use Spark > SQL without Hive. Usage of UDAFs is still supported though. > - Adding your own Aggregates requires you to write them in Hive instead of > native Spark SQL. > - Hive UDAFs are very well written and quite quick, but they are opaque in > processing and memory management; this makes them hard to optimize. By using > 'Native' Spark SQL constructs we can actually do alot more optimization, for > example AggregateEvaluation style Window processing (this would require us to > move some of the code out of the AggregateEvaluation class into some Common > base class), or Tungten style memory management. > *Performance* > - Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED > PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current > implementation in spark uses a sliding window approach in these cases. This > means that an aggregate is maintained for every row, so space usage is N (N > being the number of rows). This also means that all these aggregates all need > to be updated separately, this takes N*(N-1)/2 updates. The running case > differs from the Sliding case because we are only adding data to an aggregate > function (no reset is required), we only need to maintain one aggregate (like > in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each > row, and get the aggregate value after each update. This is what the new > implementation does. This approach only uses 1 buffer, and only requires N > updates; I am currently working on data with window sizes of 500-1000 doing > running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED > FOLLOWING case also uses this approach and the fact that aggregate operations > are communitative, there is one twist though it will process the input buffer > in reverse. > - Fewer comparisons in the sliding case. The current implementation > determines frame boundaries for every input row. The new implementation makes > more use of the fact that the window is sorted, maintains the boundaries, and > only moves them when the current row order changes. This is a minor > improvement. > - A single Window node is able to process all types of Frames for the same > Partitioning/Ordering. This saves a little time/memory spent buffering and > managing partitions. > - A lot of the staging code is moved from the execution phase to the > initialization phase. Minor performance improvement, and improves readability > of the execution code. > The original work including some benchmarking code for the running case can > be here: https://github.com/hvanhovell/spark-window > A PR has been created, this is still work in progress, and can be found here: > https://github.com/apache/spark/pull/6278 > Comments, feedback and other discussion is much appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7578) User guide update for spark.ml IDF, Normalizer, StandardScaler
[ https://issues.apache.org/jira/browse/SPARK-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7578. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6127 [https://github.com/apache/spark/pull/6127] > User guide update for spark.ml IDF, Normalizer, StandardScaler > -- > > Key: SPARK-7578 > URL: https://issues.apache.org/jira/browse/SPARK-7578 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Fix For: 1.4.0 > > > Copied from [SPARK-7443]: > {quote} > Now that we have algorithms in spark.ml which are not in spark.mllib, we > should start making subsections for the spark.ml API as needed. We can follow > the structure of the spark.mllib user guide. > * The spark.ml user guide can provide: (a) code examples and (b) info on > algorithms which do not exist in spark.mllib. > * We should not duplicate info in the spark.ml guides. Since spark.mllib is > still the primary API, we should provide links to the corresponding > algorithms in the spark.mllib user guide for more info. > {quote} > Note: I created a new subsection for links to spark.ml-specific guides in > this JIRA's PR: [SPARK-7557]. This transformer can go within the new > subsection. I'll try to get that PR merged ASAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7712) Native Spark Window Functions & Performance Improvements
[ https://issues.apache.org/jira/browse/SPARK-7712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555627#comment-14555627 ] Yin Huai commented on SPARK-7712: - [~hvanhovell] Thank you for the update. These optimizations look great. Since we will have a major refactoring of UDAF interfaces after 1.4 release and window functions are quite related to that, I propose to work on our window function improvement with our UDAF refactoring work or after we have fixed the design of the UDAF interfaces (we can figure out how we are going to proceed once we release 1.4). What do you think? Also, please feel free to post more thoughts at here and we can use this jira to do more design discussion. > Native Spark Window Functions & Performance Improvements > - > > Key: SPARK-7712 > URL: https://issues.apache.org/jira/browse/SPARK-7712 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.0 >Reporter: Herman van Hovell tot Westerflier > Original Estimate: 336h > Remaining Estimate: 336h > > Hi All, > After playing with the current spark window implementation, I tried to take > this to next level. My main goal is/was to address the following issues: > Native Spark SQL & Performance. > *Native Spark SQL* > The current implementation uses Hive UDAFs as its aggregation mechanism. We > try to address the following issues by moving to a more 'native' Spark SQL > approach: > - Window functions require Hive. Some people (mostly by accident) use Spark > SQL without Hive. Usage of UDAFs is still supported though. > - Adding your own Aggregates requires you to write them in Hive instead of > native Spark SQL. > - Hive UDAFs are very well written and quite quick, but they are opaque in > processing and memory management; this makes them hard to optimize. By using > 'Native' Spark SQL constructs we can actually do alot more optimization, for > example AggregateEvaluation style Window processing (this would require us to > move some of the code out of the AggregateEvaluation class into some Common > base class), or Tungten style memory management. > *Performance* > - Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED > PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current > implementation in spark uses a sliding window approach in these cases. This > means that an aggregate is maintained for every row, so space usage is N (N > being the number of rows). This also means that all these aggregates all need > to be updated separately, this takes N*(N-1)/2 updates. The running case > differs from the Sliding case because we are only adding data to an aggregate > function (no reset is required), we only need to maintain one aggregate (like > in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each > row, and get the aggregate value after each update. This is what the new > implementation does. This approach only uses 1 buffer, and only requires N > updates; I am currently working on data with window sizes of 500-1000 doing > running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED > FOLLOWING case also uses this approach and the fact that aggregate operations > are communitative, there is one twist though it will process the input buffer > in reverse. > - Fewer comparisons in the sliding case. The current implementation > determines frame boundaries for every input row. The new implementation makes > more use of the fact that the window is sorted, maintains the boundaries, and > only moves them when the current row order changes. This is a minor > improvement. > - A single Window node is able to process all types of Frames for the same > Partitioning/Ordering. This saves a little time/memory spent buffering and > managing partitions. > - A lot of the staging code is moved from the execution phase to the > initialization phase. Minor performance improvement, and improves readability > of the execution code. > The original work including some benchmarking code for the running case can > be here: https://github.com/hvanhovell/spark-window > A PR has been created, this is still work in progress, and can be found here: > https://github.com/apache/spark/pull/6278 > Comments, feedback and other discussion is much appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7712) Native Spark Window Functions & Performance Improvements
[ https://issues.apache.org/jira/browse/SPARK-7712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-7712: Fix Version/s: (was: 1.5.0) > Native Spark Window Functions & Performance Improvements > - > > Key: SPARK-7712 > URL: https://issues.apache.org/jira/browse/SPARK-7712 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.0 >Reporter: Herman van Hovell tot Westerflier > Original Estimate: 336h > Remaining Estimate: 336h > > Hi All, > After playing with the current spark window implementation, I tried to take > this to next level. My main goal is/was to address the following issues: > Native Spark SQL & Performance. > *Native Spark SQL* > The current implementation uses Hive UDAFs as its aggregation mechanism. We > try to address the following issues by moving to a more 'native' Spark SQL > approach: > - Window functions require Hive. Some people (mostly by accident) use Spark > SQL without Hive. Usage of UDAFs is still supported though. > - Adding your own Aggregates requires you to write them in Hive instead of > native Spark SQL. > - Hive UDAFs are very well written and quite quick, but they are opaque in > processing and memory management; this makes them hard to optimize. By using > 'Native' Spark SQL constructs we can actually do alot more optimization, for > example AggregateEvaluation style Window processing (this would require us to > move some of the code out of the AggregateEvaluation class into some Common > base class), or Tungten style memory management. > *Performance* > - Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED > PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current > implementation in spark uses a sliding window approach in these cases. This > means that an aggregate is maintained for every row, so space usage is N (N > being the number of rows). This also means that all these aggregates all need > to be updated separately, this takes N*(N-1)/2 updates. The running case > differs from the Sliding case because we are only adding data to an aggregate > function (no reset is required), we only need to maintain one aggregate (like > in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each > row, and get the aggregate value after each update. This is what the new > implementation does. This approach only uses 1 buffer, and only requires N > updates; I am currently working on data with window sizes of 500-1000 doing > running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED > FOLLOWING case also uses this approach and the fact that aggregate operations > are communitative, there is one twist though it will process the input buffer > in reverse. > - Fewer comparisons in the sliding case. The current implementation > determines frame boundaries for every input row. The new implementation makes > more use of the fact that the window is sorted, maintains the boundaries, and > only moves them when the current row order changes. This is a minor > improvement. > - A single Window node is able to process all types of Frames for the same > Partitioning/Ordering. This saves a little time/memory spent buffering and > managing partitions. > - A lot of the staging code is moved from the execution phase to the > initialization phase. Minor performance improvement, and improves readability > of the execution code. > The original work including some benchmarking code for the running case can > be here: https://github.com/hvanhovell/spark-window > A PR has been created, this is still work in progress, and can be found here: > https://github.com/apache/spark/pull/6278 > Comments, feedback and other discussion is much appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error
[ https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7819: Priority: Critical (was: Major) > Isolated Hive Client Loader appears to cause Native Library > libMapRClient.4.0.2-mapr.so already loaded in another classloader error > --- > > Key: SPARK-7819 > URL: https://issues.apache.org/jira/browse/SPARK-7819 > Project: Spark > Issue Type: Bug >Affects Versions: 1.4.0 >Reporter: Fi >Priority: Critical > Attachments: stacktrace.txt, test.py > > > In reference to the pull request: > https://github.com/apache/spark/pull/5876 > I have been running the Spark 1.3 branch for some time with no major hiccups, > and recently switched to the Spark 1.4 branch. > I build my spark distribution with the following build command: > make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive > -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver > When running a python script containing a series of smoke tests I use to > validate the build, I encountered an error under the following conditions: > * start a spark context > * start a hive context > * run any hive query > * stop the spark context > * start a second spark context > * run any hive query > *** ERROR > From what I can tell, the Isolated Class Loader is hitting a MapR class that > is loading its native library (presumedly as part of a static initializer). > Unfortunately, the JVM prohibits this the second time around. > I would think that shutting down the SparkContext would clear out any > vestigials of the JVM, so I'm surprised that this would even be a problem. > Note: all other smoke tests we are running passes fine. > I will attach the stacktrace and a python script reproducing the issue (at > least for my environment and build). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error
[ https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7819: Affects Version/s: (was: 1.4.1) 1.4.0 > Isolated Hive Client Loader appears to cause Native Library > libMapRClient.4.0.2-mapr.so already loaded in another classloader error > --- > > Key: SPARK-7819 > URL: https://issues.apache.org/jira/browse/SPARK-7819 > Project: Spark > Issue Type: Bug >Affects Versions: 1.4.0 >Reporter: Fi >Priority: Critical > Attachments: stacktrace.txt, test.py > > > In reference to the pull request: > https://github.com/apache/spark/pull/5876 > I have been running the Spark 1.3 branch for some time with no major hiccups, > and recently switched to the Spark 1.4 branch. > I build my spark distribution with the following build command: > make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive > -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver > When running a python script containing a series of smoke tests I use to > validate the build, I encountered an error under the following conditions: > * start a spark context > * start a hive context > * run any hive query > * stop the spark context > * start a second spark context > * run any hive query > *** ERROR > From what I can tell, the Isolated Class Loader is hitting a MapR class that > is loading its native library (presumedly as part of a static initializer). > Unfortunately, the JVM prohibits this the second time around. > I would think that shutting down the SparkContext would clear out any > vestigials of the JVM, so I'm surprised that this would even be a problem. > Note: all other smoke tests we are running passes fine. > I will attach the stacktrace and a python script reproducing the issue (at > least for my environment and build). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error
[ https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7819: Target Version/s: 1.4.0 > Isolated Hive Client Loader appears to cause Native Library > libMapRClient.4.0.2-mapr.so already loaded in another classloader error > --- > > Key: SPARK-7819 > URL: https://issues.apache.org/jira/browse/SPARK-7819 > Project: Spark > Issue Type: Bug >Affects Versions: 1.4.0 >Reporter: Fi > Attachments: stacktrace.txt, test.py > > > In reference to the pull request: > https://github.com/apache/spark/pull/5876 > I have been running the Spark 1.3 branch for some time with no major hiccups, > and recently switched to the Spark 1.4 branch. > I build my spark distribution with the following build command: > make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive > -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver > When running a python script containing a series of smoke tests I use to > validate the build, I encountered an error under the following conditions: > * start a spark context > * start a hive context > * run any hive query > * stop the spark context > * start a second spark context > * run any hive query > *** ERROR > From what I can tell, the Isolated Class Loader is hitting a MapR class that > is loading its native library (presumedly as part of a static initializer). > Unfortunately, the JVM prohibits this the second time around. > I would think that shutting down the SparkContext would clear out any > vestigials of the JVM, so I'm surprised that this would even be a problem. > Note: all other smoke tests we are running passes fine. > I will attach the stacktrace and a python script reproducing the issue (at > least for my environment and build). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7820) Java 8 test suite compile error under SBT
[ https://issues.apache.org/jira/browse/SPARK-7820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-7820: --- Priority: Minor (was: Major) > Java 8 test suite compile error under SBT > - > > Key: SPARK-7820 > URL: https://issues.apache.org/jira/browse/SPARK-7820 > Project: Spark > Issue Type: Bug > Components: Build, Streaming >Affects Versions: 1.4.0 >Reporter: Saisai Shao >Priority: Minor > > Lots of compilation error is shown when java 8 test suite is enabled in SBT: > {{JAVA_HOME=/usr/java/jdk1.8.0_45 ./sbt/sbt -Pyarn -Phadoop-2.4 > -Dhadoop.version=2.6.0 -Pjava8-test}} > {code} > [error] > /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:43: > error: cannot find symbol > [error] public class Java8APISuite extends LocalJavaStreamingContext > implements Serializable { > [error]^ > [error] symbol: class LocalJavaStreamingContext > [error] > /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55: > error: cannot find symbol > [error] JavaDStream stream = > JavaTestUtils.attachTestInputStream(ssc, inputData, 1); > [error] ^ > [error] symbol: variable ssc > [error] location: class Java8APISuite > [error] > /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55: > error: cannot find symbol > [error] JavaDStream stream = > JavaTestUtils.attachTestInputStream(ssc, inputData, 1); > [error] ^ > [error] symbol: variable JavaTestUtils > [error] location: class Java8APISuite > [error] > /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:57: > error: cannot find symbol > [error] JavaTestUtils.attachTestOutputStream(letterCount); > [error] ^ > [error] symbol: variable JavaTestUtils > [error] location: class Java8APISuite > [error] > /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58: > error: cannot find symbol > [error] List> result = JavaTestUtils.runStreams(ssc, 2, 2); > [error] ^ > [error] symbol: variable ssc > [error] location: class Java8APISuite > [error] > /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58: > error: cannot find symbol > [error] List> result = JavaTestUtils.runStreams(ssc, 2, 2); > [error] ^ > [error] symbol: variable JavaTestUtils > [error] location: class Java8APISuite > [error] > /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:73: > error: cannot find symbol > [error] JavaDStream stream = > JavaTestUtils.attachTestInputStream(ssc, inputData, 1); > [error] ^ > [error] symbol: variable ssc > [error] location: class Java8APISuite > {code} > The class {{JavaAPISuite}} relies on {{LocalJavaStreamingContext}} which > exists in streaming test jar. It is OK for maven compile, since it will > generate test jar, but will be failed in sbt test compile, sbt do not > generate test jar by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7804) Incorrect results from JDBCRDD -- one record repeatly
[ https://issues.apache.org/jira/browse/SPARK-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555602#comment-14555602 ] Paul Wu commented on SPARK-7804: Thanks -- you are right. The cache() was a problem and also I cannot use "List lr = jrdd.collect();". But jrdd.foreach((Row r) -> { System.out.println(r.get(0) + " ." + r.get(1) + " " + r.get(2)); }); or foreachParition will work. We really wanted to use DataFrame, however it does not have the partition options that we really need to improve the performance. Using this class, we can take the advantage of sending multiple query to each db partition at the same time. By as you said this is the internal code (from JAVA DOC, I cannot see it), I'm not sure what I can do now. I guess you guys can close this ticket. Thanks again! > Incorrect results from JDBCRDD -- one record repeatly > - > > Key: SPARK-7804 > URL: https://issues.apache.org/jira/browse/SPARK-7804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0, 1.3.1 >Reporter: Paul Wu > Labels: JDBCRDD, sql > > Getting only one record repeated in the RDD and repeated field value: > > I have a table like: > {code} > attuid name email > 12 john j...@appp.com > 23 tom t...@appp.com > 34 tony t...@appp.com > {code} > My code: > {code} > JavaSparkContext sc = new JavaSparkContext(sparkConf); > String url = ""; > java.util.Properties prop = new Properties(); > List partitionList = new ArrayList<>(); > //int i; > partitionList.add(new JDBCPartition("1=1", 0)); > > List fields = new ArrayList(); > fields.add(DataTypes.createStructField("attuid", > DataTypes.StringType, true)); > fields.add(DataTypes.createStructField("name", DataTypes.StringType, > true)); > fields.add(DataTypes.createStructField("email", DataTypes.StringType, > true)); > StructType schema = DataTypes.createStructType(fields); > JDBCRDD jdbcRDD = new JDBCRDD(sc.sc(), > JDBCRDD.getConnector("oracle.jdbc.OracleDriver", url, prop), > > schema, > " USERS", > new String[]{"attuid", "name", "email"}, > new Filter[]{ }, > > partitionList.toArray(new JDBCPartition[0]) > > ); > > System.out.println("count before to Java RDD=" + > jdbcRDD.cache().count()); > JavaRDD jrdd = jdbcRDD.toJavaRDD(); > System.out.println("count=" + jrdd.count()); > List lr = jrdd.collect(); > for (Row r : lr) { > for (int ii = 0; ii < r.length(); ii++) { > System.out.println(r.getString(ii)); > } > } > {code} > === > result is : > {code} > 34 > tony > t...@appp.com > 34 > tony > t...@appp.com > 34 > tony > t...@appp.com > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7820) Java 8 test suite compile error under SBT
[ https://issues.apache.org/jira/browse/SPARK-7820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-7820: --- Component/s: Streaming > Java 8 test suite compile error under SBT > - > > Key: SPARK-7820 > URL: https://issues.apache.org/jira/browse/SPARK-7820 > Project: Spark > Issue Type: Bug > Components: Build, Streaming >Affects Versions: 1.4.0 >Reporter: Saisai Shao > > Lots of compilation error is shown when java 8 test suite is enabled in SBT: > {{JAVA_HOME=/usr/java/jdk1.8.0_45 ./sbt/sbt -Pyarn -Phadoop-2.4 > -Dhadoop.version=2.6.0 -Pjava8-test}} > {code} > [error] > /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:43: > error: cannot find symbol > [error] public class Java8APISuite extends LocalJavaStreamingContext > implements Serializable { > [error]^ > [error] symbol: class LocalJavaStreamingContext > [error] > /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55: > error: cannot find symbol > [error] JavaDStream stream = > JavaTestUtils.attachTestInputStream(ssc, inputData, 1); > [error] ^ > [error] symbol: variable ssc > [error] location: class Java8APISuite > [error] > /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55: > error: cannot find symbol > [error] JavaDStream stream = > JavaTestUtils.attachTestInputStream(ssc, inputData, 1); > [error] ^ > [error] symbol: variable JavaTestUtils > [error] location: class Java8APISuite > [error] > /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:57: > error: cannot find symbol > [error] JavaTestUtils.attachTestOutputStream(letterCount); > [error] ^ > [error] symbol: variable JavaTestUtils > [error] location: class Java8APISuite > [error] > /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58: > error: cannot find symbol > [error] List> result = JavaTestUtils.runStreams(ssc, 2, 2); > [error] ^ > [error] symbol: variable ssc > [error] location: class Java8APISuite > [error] > /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58: > error: cannot find symbol > [error] List> result = JavaTestUtils.runStreams(ssc, 2, 2); > [error] ^ > [error] symbol: variable JavaTestUtils > [error] location: class Java8APISuite > [error] > /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:73: > error: cannot find symbol > [error] JavaDStream stream = > JavaTestUtils.attachTestInputStream(ssc, inputData, 1); > [error] ^ > [error] symbol: variable ssc > [error] location: class Java8APISuite > {code} > The class {{JavaAPISuite}} relies on {{LocalJavaStreamingContext}} which > exists in streaming test jar. It is OK for maven compile, since it will > generate test jar, but will be failed in sbt test compile, sbt do not > generate test jar by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error
[ https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1487#comment-1487 ] Fi commented on SPARK-7819: --- FYI, I believe I have worked around the problem for now by disabling isolation by hacking: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala -isolationOn = true, + isolationOn = false, I suppose this is good enough for us since we only need the pre-built version of hive (as provided by the mapr4 profile). > Isolated Hive Client Loader appears to cause Native Library > libMapRClient.4.0.2-mapr.so already loaded in another classloader error > --- > > Key: SPARK-7819 > URL: https://issues.apache.org/jira/browse/SPARK-7819 > Project: Spark > Issue Type: Bug >Affects Versions: 1.4.1 >Reporter: Fi > Attachments: stacktrace.txt, test.py > > > In reference to the pull request: > https://github.com/apache/spark/pull/5876 > I have been running the Spark 1.3 branch for some time with no major hiccups, > and recently switched to the Spark 1.4 branch. > I build my spark distribution with the following build command: > make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive > -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver > When running a python script containing a series of smoke tests I use to > validate the build, I encountered an error under the following conditions: > * start a spark context > * start a hive context > * run any hive query > * stop the spark context > * start a second spark context > * run any hive query > *** ERROR > From what I can tell, the Isolated Class Loader is hitting a MapR class that > is loading its native library (presumedly as part of a static initializer). > Unfortunately, the JVM prohibits this the second time around. > I would think that shutting down the SparkContext would clear out any > vestigials of the JVM, so I'm surprised that this would even be a problem. > Note: all other smoke tests we are running passes fine. > I will attach the stacktrace and a python script reproducing the issue (at > least for my environment and build). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error
[ https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fi updated SPARK-7819: -- Attachment: stacktrace.txt > Isolated Hive Client Loader appears to cause Native Library > libMapRClient.4.0.2-mapr.so already loaded in another classloader error > --- > > Key: SPARK-7819 > URL: https://issues.apache.org/jira/browse/SPARK-7819 > Project: Spark > Issue Type: Bug >Affects Versions: 1.4.1 >Reporter: Fi > Attachments: stacktrace.txt, test.py > > > In reference to the pull request: > https://github.com/apache/spark/pull/5876 > I have been running the Spark 1.3 branch for some time with no major hiccups, > and recently switched to the Spark 1.4 branch. > I build my spark distribution with the following build command: > make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive > -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver > When running a python script containing a series of smoke tests I use to > validate the build, I encountered an error under the following conditions: > * start a spark context > * start a hive context > * run any hive query > * stop the spark context > * start a second spark context > * run any hive query > *** ERROR > From what I can tell, the Isolated Class Loader is hitting a MapR class that > is loading its native library (presumedly as part of a static initializer). > Unfortunately, the JVM prohibits this the second time around. > I would think that shutting down the SparkContext would clear out any > vestigials of the JVM, so I'm surprised that this would even be a problem. > Note: all other smoke tests we are running passes fine. > I will attach the stacktrace and a python script reproducing the issue (at > least for my environment and build). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7820) Java 8 test suite compile error under SBT
Saisai Shao created SPARK-7820: -- Summary: Java 8 test suite compile error under SBT Key: SPARK-7820 URL: https://issues.apache.org/jira/browse/SPARK-7820 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Reporter: Saisai Shao Lots of compilation error is shown when java 8 test suite is enabled in SBT: {{JAVA_HOME=/usr/java/jdk1.8.0_45 ./sbt/sbt -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Pjava8-test}} {code} [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:43: error: cannot find symbol [error] public class Java8APISuite extends LocalJavaStreamingContext implements Serializable { [error]^ [error] symbol: class LocalJavaStreamingContext [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55: error: cannot find symbol [error] JavaDStream stream = JavaTestUtils.attachTestInputStream(ssc, inputData, 1); [error] ^ [error] symbol: variable ssc [error] location: class Java8APISuite [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55: error: cannot find symbol [error] JavaDStream stream = JavaTestUtils.attachTestInputStream(ssc, inputData, 1); [error] ^ [error] symbol: variable JavaTestUtils [error] location: class Java8APISuite [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:57: error: cannot find symbol [error] JavaTestUtils.attachTestOutputStream(letterCount); [error] ^ [error] symbol: variable JavaTestUtils [error] location: class Java8APISuite [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58: error: cannot find symbol [error] List> result = JavaTestUtils.runStreams(ssc, 2, 2); [error] ^ [error] symbol: variable ssc [error] location: class Java8APISuite [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58: error: cannot find symbol [error] List> result = JavaTestUtils.runStreams(ssc, 2, 2); [error] ^ [error] symbol: variable JavaTestUtils [error] location: class Java8APISuite [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:73: error: cannot find symbol [error] JavaDStream stream = JavaTestUtils.attachTestInputStream(ssc, inputData, 1); [error] ^ [error] symbol: variable ssc [error] location: class Java8APISuite {code} The class {{JavaAPISuite}} relies on {{LocalJavaStreamingContext}} which exists in streaming test jar. It is OK for maven compile, since it will generate test jar, but will be failed in sbt test compile, sbt do not generate test jar by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error
[ https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fi updated SPARK-7819: -- Attachment: (was: stacktrace.txt) > Isolated Hive Client Loader appears to cause Native Library > libMapRClient.4.0.2-mapr.so already loaded in another classloader error > --- > > Key: SPARK-7819 > URL: https://issues.apache.org/jira/browse/SPARK-7819 > Project: Spark > Issue Type: Bug >Affects Versions: 1.4.1 >Reporter: Fi > Attachments: test.py > > > In reference to the pull request: > https://github.com/apache/spark/pull/5876 > I have been running the Spark 1.3 branch for some time with no major hiccups, > and recently switched to the Spark 1.4 branch. > I build my spark distribution with the following build command: > make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive > -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver > When running a python script containing a series of smoke tests I use to > validate the build, I encountered an error under the following conditions: > * start a spark context > * start a hive context > * run any hive query > * stop the spark context > * start a second spark context > * run any hive query > *** ERROR > From what I can tell, the Isolated Class Loader is hitting a MapR class that > is loading its native library (presumedly as part of a static initializer). > Unfortunately, the JVM prohibits this the second time around. > I would think that shutting down the SparkContext would clear out any > vestigials of the JVM, so I'm surprised that this would even be a problem. > Note: all other smoke tests we are running passes fine. > I will attach the stacktrace and a python script reproducing the issue (at > least for my environment and build). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error
[ https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fi updated SPARK-7819: -- Attachment: stacktrace.txt test.py > Isolated Hive Client Loader appears to cause Native Library > libMapRClient.4.0.2-mapr.so already loaded in another classloader error > --- > > Key: SPARK-7819 > URL: https://issues.apache.org/jira/browse/SPARK-7819 > Project: Spark > Issue Type: Bug >Affects Versions: 1.4.1 >Reporter: Fi > Attachments: stacktrace.txt, test.py > > > In reference to the pull request: > https://github.com/apache/spark/pull/5876 > I have been running the Spark 1.3 branch for some time with no major hiccups, > and recently switched to the Spark 1.4 branch. > I build my spark distribution with the following build command: > make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive > -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver > When running a python script containing a series of smoke tests I use to > validate the build, I encountered an error under the following conditions: > * start a spark context > * start a hive context > * run any hive query > * stop the spark context > * start a second spark context > * run any hive query > *** ERROR > From what I can tell, the Isolated Class Loader is hitting a MapR class that > is loading its native library (presumedly as part of a static initializer). > Unfortunately, the JVM prohibits this the second time around. > I would think that shutting down the SparkContext would clear out any > vestigials of the JVM, so I'm surprised that this would even be a problem. > Note: all other smoke tests we are running passes fine. > I will attach the stacktrace and a python script reproducing the issue (at > least for my environment and build). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error
Fi created SPARK-7819: - Summary: Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error Key: SPARK-7819 URL: https://issues.apache.org/jira/browse/SPARK-7819 Project: Spark Issue Type: Bug Affects Versions: 1.4.1 Reporter: Fi Attachments: stacktrace.txt, test.py In reference to the pull request: https://github.com/apache/spark/pull/5876 I have been running the Spark 1.3 branch for some time with no major hiccups, and recently switched to the Spark 1.4 branch. I build my spark distribution with the following build command: make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver When running a python script containing a series of smoke tests I use to validate the build, I encountered an error under the following conditions: * start a spark context * start a hive context * run any hive query * stop the spark context * start a second spark context * run any hive query *** ERROR >From what I can tell, the Isolated Class Loader is hitting a MapR class that >is loading its native library (presumedly as part of a static initializer). Unfortunately, the JVM prohibits this the second time around. I would think that shutting down the SparkContext would clear out any vestigials of the JVM, so I'm surprised that this would even be a problem. Note: all other smoke tests we are running passes fine. I will attach the stacktrace and a python script reproducing the issue (at least for my environment and build). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7818) Java 8 test suite compile error under SBT
Saisai Shao created SPARK-7818: -- Summary: Java 8 test suite compile error under SBT Key: SPARK-7818 URL: https://issues.apache.org/jira/browse/SPARK-7818 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Reporter: Saisai Shao Lots of compilation error is shown when java 8 test suite is enabled in SBT: {{JAVA_HOME=/usr/java/jdk1.8.0_45 ./sbt/sbt -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Pjava8-test}} {code} [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:43: error: cannot find symbol [error] public class Java8APISuite extends LocalJavaStreamingContext implements Serializable { [error]^ [error] symbol: class LocalJavaStreamingContext [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55: error: cannot find symbol [error] JavaDStream stream = JavaTestUtils.attachTestInputStream(ssc, inputData, 1); [error] ^ [error] symbol: variable ssc [error] location: class Java8APISuite [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55: error: cannot find symbol [error] JavaDStream stream = JavaTestUtils.attachTestInputStream(ssc, inputData, 1); [error] ^ [error] symbol: variable JavaTestUtils [error] location: class Java8APISuite [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:57: error: cannot find symbol [error] JavaTestUtils.attachTestOutputStream(letterCount); [error] ^ [error] symbol: variable JavaTestUtils [error] location: class Java8APISuite [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58: error: cannot find symbol [error] List> result = JavaTestUtils.runStreams(ssc, 2, 2); [error] ^ [error] symbol: variable ssc [error] location: class Java8APISuite [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58: error: cannot find symbol [error] List> result = JavaTestUtils.runStreams(ssc, 2, 2); [error] ^ [error] symbol: variable JavaTestUtils [error] location: class Java8APISuite [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:73: error: cannot find symbol [error] JavaDStream stream = JavaTestUtils.attachTestInputStream(ssc, inputData, 1); [error] ^ [error] symbol: variable ssc [error] location: class Java8APISuite {code} The class {{JavaAPISuite}} relies on {{LocalJavaStreamingContext}} which exists in streaming test jar. It is OK for maven compile, since it will generate test jar, but will be failed in sbt test compile, sbt do not generate test jar by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7536) Audit MLlib Python API for 1.4
[ https://issues.apache.org/jira/browse/SPARK-7536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1459#comment-1459 ] Manoj Kumar commented on SPARK-7536: Should all of this be done before the 1.4 release? > Audit MLlib Python API for 1.4 > -- > > Key: SPARK-7536 > URL: https://issues.apache.org/jira/browse/SPARK-7536 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > For new public APIs added to MLlib, we need to check the generated HTML doc > and compare the Scala & Python versions. We need to track: > * Inconsistency: Do class/method/parameter names match? SPARK-7667 > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. SPARK-7666 > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. SPARK-7665 > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python. > ** classification > *** StreamingLogisticRegressionWithSGD SPARK-7633 > ** clustering > *** GaussianMixture SPARK-6258 > *** LDA SPARK-6259 > *** Power Iteration Clustering SPARK-5962 > *** StreamingKMeans SPARK-4118 > ** evaluation > *** MultilabelMetrics SPARK-6094 > ** feature > *** ElementwiseProduct SPARK-7605 > *** PCA SPARK-7604 > ** linalg > *** Distributed linear algebra SPARK-6100 > ** pmml.export SPARK-7638 > ** regression > *** StreamingLinearRegressionWithSGD SPARK-4127 > ** stat > *** KernelDensity SPARK-7639 > ** util > *** MLUtils SPARK-6263 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7404) Add RegressionEvaluator to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7404: --- Assignee: Apache Spark (was: Ram Sriharsha) > Add RegressionEvaluator to spark.ml > --- > > Key: SPARK-7404 > URL: https://issues.apache.org/jira/browse/SPARK-7404 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Apache Spark > > This allows users to tune regression models using the pipeline API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7404) Add RegressionEvaluator to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7404: --- Assignee: Ram Sriharsha (was: Apache Spark) > Add RegressionEvaluator to spark.ml > --- > > Key: SPARK-7404 > URL: https://issues.apache.org/jira/browse/SPARK-7404 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Ram Sriharsha > > This allows users to tune regression models using the pipeline API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7404) Add RegressionEvaluator to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1422#comment-1422 ] Apache Spark commented on SPARK-7404: - User 'harsha2010' has created a pull request for this issue: https://github.com/apache/spark/pull/6344 > Add RegressionEvaluator to spark.ml > --- > > Key: SPARK-7404 > URL: https://issues.apache.org/jira/browse/SPARK-7404 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Ram Sriharsha > > This allows users to tune regression models using the pipeline API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7507) pyspark.sql.types.StructType and Row should implement __iter__()
[ https://issues.apache.org/jira/browse/SPARK-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555482#comment-14555482 ] Nicholas Chammas commented on SPARK-7507: - Since {{Row}} seems most analogous to a {{namedtuple}} in Python, here is an interesting parallel that suggests we should perhaps instead support {{vars(Row)}} and not {{dict(Row)}}. http://stackoverflow.com/q/26180528/877069 https://docs.python.org/3/library/functions.html#vars https://docs.python.org/3/library/collections.html#collections.somenamedtuple._asdict {quote} somenamedtuple._asdict() Return a new OrderedDict which maps field names to their corresponding values. Note, this method is no longer needed now that the same effect can be achieved by using the built-in vars() function: {quote} > pyspark.sql.types.StructType and Row should implement __iter__() > > > Key: SPARK-7507 > URL: https://issues.apache.org/jira/browse/SPARK-7507 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Reporter: Nicholas Chammas >Priority: Minor > > {{StructType}} looks an awful lot like a Python dictionary. > However, it doesn't implement {{\_\_iter\_\_()}}, so doing a quick conversion > like this doesn't work: > {code} > >>> df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}'])) > >>> df.schema > StructType(List(StructField(name,StringType,true))) > >>> dict(df.schema) > Traceback (most recent call last): > File "", line 1, in > TypeError: 'StructType' object is not iterable > {code} > This would be super helpful for doing any custom schema manipulations without > having to go through the whole {{.json() -> json.loads() -> manipulate() -> > json.dumps() -> .fromJson()}} charade. > Same goes for {{Row}}, which offers an > [{{asDict()}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.Row.asDict] > method but doesn't support the more Pythonic {{dict(Row)}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7605) Python API for ElementwiseProduct
[ https://issues.apache.org/jira/browse/SPARK-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555475#comment-14555475 ] Manoj Kumar commented on SPARK-7605: Hi, Can this be assigned to me? > Python API for ElementwiseProduct > - > > Key: SPARK-7605 > URL: https://issues.apache.org/jira/browse/SPARK-7605 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Affects Versions: 1.4.0 >Reporter: Yanbo Liang > > Python API for org.apache.spark.mllib.feature.ElementwiseProduct -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7322) Add DataFrame DSL for window function support
[ https://issues.apache.org/jira/browse/SPARK-7322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555462#comment-14555462 ] Apache Spark commented on SPARK-7322: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/6343 > Add DataFrame DSL for window function support > - > > Key: SPARK-7322 > URL: https://issues.apache.org/jira/browse/SPARK-7322 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Cheng Hao > Labels: DataFrame > > Here's a proposal for supporting window functions in the DataFrame DSL: > 1. Add an over function to Column: > {code} > class Column { > ... > def over(window: Window): Column > ... > } > {code} > 2. Window: > {code} > object Window { > def partitionBy(...): Window > def orderBy(...): Window > object Frame { > def unbounded: Frame > def preceding(n: Long): Frame > def following(n: Long): Frame > } > class Frame > } > class Window { > def orderBy(...): Window > def rowsBetween(Frame, Frame): Window > def rangeBetween(Frame, Frame): Window // maybe add this later > } > {code} > Here's an example to use it: > {code} > df.select( > avg(“age”).over(Window.partitionBy(“..”, “..”).orderBy(“..”, “..”) > .rowsBetween(Frame.unbounded, Frame.currentRow)) > ) > df.select( > avg(“age”).over(Window.partitionBy(“..”, “..”).orderBy(“..”, “..”) > .rowsBetween(Frame.preceding(50), Frame.following(10))) > ) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7817) Intellij Idea cannot find symbol when import scala object
bofei.xiao created SPARK-7817: - Summary: Intellij Idea cannot find symbol when import scala object Key: SPARK-7817 URL: https://issues.apache.org/jira/browse/SPARK-7817 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.3.1 Environment: micorosoft server 2003 java 1.6 maven 3.04 Reporter: bofei.xiao [ERROR] src\main\java\org\apache\spark\exaples\streaming\JavaQueueStream.java:[33,47] cannot find symbol symbol : class StreamingExamples location: package org.apache.spark.exaples.streaming in fact,StreamingExamples is a object under org.apache.spark.exaples.streaming -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7785) Add pretty printing to pyspark.mllib.linalg.Matrices
[ https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-7785: --- Priority: Minor (was: Major) > Add pretty printing to pyspark.mllib.linalg.Matrices > > > Key: SPARK-7785 > URL: https://issues.apache.org/jira/browse/SPARK-7785 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Manoj Kumar >Priority: Minor > > Add __str__ and __repr__ to matrices. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7785) Add pretty printing to pyspark.mllib.linalg.Matrices
[ https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-7785: --- Description: Add __str__ and __repr__ to matrices. (was: For DenseMatrices. Class Methods __str__, transpose Object Methods zeros, ones, eye, rand, randn, diag For SparseMatrices Class Methods __str__, transpose Object Methods, fromCoo, speye, sprand, sprandn, spdiag, Matrices Methods, horzcat, vertcat) > Add pretty printing to pyspark.mllib.linalg.Matrices > > > Key: SPARK-7785 > URL: https://issues.apache.org/jira/browse/SPARK-7785 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Manoj Kumar > > Add __str__ and __repr__ to matrices. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7785) Add pretty printing to pyspark.mllib.linalg.Matrices
[ https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555440#comment-14555440 ] Burak Yavuz commented on SPARK-7785: For operations with BlockMatrix, you will need these classes. > Add pretty printing to pyspark.mllib.linalg.Matrices > > > Key: SPARK-7785 > URL: https://issues.apache.org/jira/browse/SPARK-7785 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Manoj Kumar > > For DenseMatrices. > Class Methods > __str__, transpose > Object Methods > zeros, ones, eye, rand, randn, diag > For SparseMatrices > Class Methods > __str__, transpose > Object Methods, > fromCoo, speye, sprand, sprandn, spdiag, > Matrices Methods, horzcat, vertcat -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7785) Add pretty printing to pyspark.mllib.linalg.Matrices
[ https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555438#comment-14555438 ] Manoj Kumar commented on SPARK-7785: Sounds great. In the Pull Request, I just added support for __str__ and __repr__ . But was there any particular need to have these classes in the first place, since almost all of them are wrappers around numpy and scipy? > Add pretty printing to pyspark.mllib.linalg.Matrices > > > Key: SPARK-7785 > URL: https://issues.apache.org/jira/browse/SPARK-7785 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Manoj Kumar > > For DenseMatrices. > Class Methods > __str__, transpose > Object Methods > zeros, ones, eye, rand, randn, diag > For SparseMatrices > Class Methods > __str__, transpose > Object Methods, > fromCoo, speye, sprand, sprandn, spdiag, > Matrices Methods, horzcat, vertcat -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7785) Add pretty printing to pyspark.mllib.linalg.Matrices
[ https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555436#comment-14555436 ] Apache Spark commented on SPARK-7785: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/6342 > Add pretty printing to pyspark.mllib.linalg.Matrices > > > Key: SPARK-7785 > URL: https://issues.apache.org/jira/browse/SPARK-7785 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Manoj Kumar > > For DenseMatrices. > Class Methods > __str__, transpose > Object Methods > zeros, ones, eye, rand, randn, diag > For SparseMatrices > Class Methods > __str__, transpose > Object Methods, > fromCoo, speye, sprand, sprandn, spdiag, > Matrices Methods, horzcat, vertcat -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7785) Add pretty printing to pyspark.mllib.linalg.Matrices
[ https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7785: --- Assignee: (was: Apache Spark) > Add pretty printing to pyspark.mllib.linalg.Matrices > > > Key: SPARK-7785 > URL: https://issues.apache.org/jira/browse/SPARK-7785 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Manoj Kumar > > For DenseMatrices. > Class Methods > __str__, transpose > Object Methods > zeros, ones, eye, rand, randn, diag > For SparseMatrices > Class Methods > __str__, transpose > Object Methods, > fromCoo, speye, sprand, sprandn, spdiag, > Matrices Methods, horzcat, vertcat -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7785) Add pretty printing to pyspark.mllib.linalg.Matrices
[ https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7785: --- Assignee: Apache Spark > Add pretty printing to pyspark.mllib.linalg.Matrices > > > Key: SPARK-7785 > URL: https://issues.apache.org/jira/browse/SPARK-7785 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Manoj Kumar >Assignee: Apache Spark > > For DenseMatrices. > Class Methods > __str__, transpose > Object Methods > zeros, ones, eye, rand, randn, diag > For SparseMatrices > Class Methods > __str__, transpose > Object Methods, > fromCoo, speye, sprand, sprandn, spdiag, > Matrices Methods, horzcat, vertcat -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7789) sql on security hbase:Token generation only allowed for Kerberos authenticated clients
[ https://issues.apache.org/jira/browse/SPARK-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555435#comment-14555435 ] meiyoula edited comment on SPARK-7789 at 5/22/15 1:51 AM: -- I also used hive 0.13 and Kerberos. [~deanchen]Has you executed the select statement. Below is my test sql statement. {quote} create table s1 ( key1 string, c11 int, c12 string, c13 string, c14 string ) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties( "hbase.columns.mapping" = ":key, info:c11, info:c12, info:c13, info:c14 ") tblproperties("hbase.table.name" = "shb1"); select * from s1; {quote} After reading the hive and hbase code, I think the root cause is that: When the driver obtained the hbase token and add it into Credentials of CurrentUser, the hbase token will also go to executors. So the authentication of user(in executor) is TOKEN to hbase.But the hive code will send request to hbase sever to obtain token no matter what the authentication is. And the hbase code just allow the Kerberos authenticated clients to obtain token. So the exception occurs. So I think the HIVE-8874 is meaningful, it should be merged. was (Author: meiyoula): I also used hive 0.13 and Kerberos. [~deanchen]Has you executed the select statement. Below is my test sql statement. After reading the hive and hbase code, I think the root cause is that: When the driver obtained the hbase token and add it into Credentials of CurrentUser, the hbase token will also go to executors. So the authentication of user(in executor) is TOKEN to hbase.But the hive code will send request to hbase sever to obtain token no matter what the authentication is. And the hbase code just allow the Kerberos authenticated clients to obtain token. So the exception occurs. {quote} create table s1 ( key1 string, c11 int, c12 string, c13 string, c14 string ) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties( "hbase.columns.mapping" = ":key, info:c11, info:c12, info:c13, info:c14 ") tblproperties("hbase.table.name" = "shb1"); select * from s1; {quote} So I think the HIVE-8874 is meaningful, it should be merged. > sql on security hbase:Token generation only allowed for Kerberos > authenticated clients > --- > > Key: SPARK-7789 > URL: https://issues.apache.org/jira/browse/SPARK-7789 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: meiyoula > > After creating a hbase table in beeline, then execute select sql statement, > Executor occurs the exception: > {quote} > java.lang.IllegalStateException: Error while configuring input job properties > at > org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureTableJobProperties(HBaseStorageHandler.java:343) > at > org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureInputJobProperties(HBaseStorageHandler.java:279) > at > org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:804) > at > org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:774) > at > org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:300) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) > at scala.Option.map(Option.scala:145) > at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:220) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark
[jira] [Updated] (SPARK-7680) Add a fake Receiver that generates random strings, useful for prototyping
[ https://issues.apache.org/jira/browse/SPARK-7680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7680: - Target Version/s: (was: 1.4.0) > Add a fake Receiver that generates random strings, useful for prototyping > - > > Key: SPARK-7680 > URL: https://issues.apache.org/jira/browse/SPARK-7680 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7789) sql on security hbase:Token generation only allowed for Kerberos authenticated clients
[ https://issues.apache.org/jira/browse/SPARK-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555435#comment-14555435 ] meiyoula commented on SPARK-7789: - I also used hive 0.13 and Kerberos. [~deanchen]Has you executed the select statement. Below is my test sql statement. After reading the hive and hbase code, I think the root cause is that: When the driver obtained the hbase token and add it into Credentials of CurrentUser, the hbase token will also go to executors. So the authentication of user(in executor) is TOKEN to hbase.But the hive code will send request to hbase sever to obtain token no matter what the authentication is. And the hbase code just allow the Kerberos authenticated clients to obtain token. So the exception occurs. {quote} create table s1 ( key1 string, c11 int, c12 string, c13 string, c14 string ) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties( "hbase.columns.mapping" = ":key, info:c11, info:c12, info:c13, info:c14 ") tblproperties("hbase.table.name" = "shb1"); select * from s1; {quote} So I think the HIVE-8874 is meaningful, it should be merged. > sql on security hbase:Token generation only allowed for Kerberos > authenticated clients > --- > > Key: SPARK-7789 > URL: https://issues.apache.org/jira/browse/SPARK-7789 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: meiyoula > > After creating a hbase table in beeline, then execute select sql statement, > Executor occurs the exception: > {quote} > java.lang.IllegalStateException: Error while configuring input job properties > at > org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureTableJobProperties(HBaseStorageHandler.java:343) > at > org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureInputJobProperties(HBaseStorageHandler.java:279) > at > org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:804) > at > org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:774) > at > org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:300) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) > at scala.Option.map(Option.scala:145) > at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:220) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.hbase.security.AccessDeniedException: > org.apache.hadoop.hbase.security.AccessDeniedException: Token generation only > allowed for Kerberos authenticated clients > at > org.apache.hadoop.hbase.security.token.TokenProvider.getAuthenticationToken(TokenProvider.java:124) > at > org.apache.hadoop.hbase.protobuf.generated.AuthenticationProtos$AuthenticationService$1.getAuthenticationToken(AuthenticationProtos.java:4267) > at > org.apache.hadoop.hbase.protobuf.generated.Authenticati
[jira] [Updated] (SPARK-7785) Add pretty printing to pyspark.mllib.linalg.Matrices
[ https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-7785: --- Summary: Add pretty printing to pyspark.mllib.linalg.Matrices (was: Add missing items to pyspark.mllib.linalg.Matrices) > Add pretty printing to pyspark.mllib.linalg.Matrices > > > Key: SPARK-7785 > URL: https://issues.apache.org/jira/browse/SPARK-7785 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Manoj Kumar > > For DenseMatrices. > Class Methods > __str__, transpose > Object Methods > zeros, ones, eye, rand, randn, diag > For SparseMatrices > Class Methods > __str__, transpose > Object Methods, > fromCoo, speye, sprand, sprandn, spdiag, > Matrices Methods, horzcat, vertcat -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555423#comment-14555423 ] Apache Spark commented on SPARK-7042: - User 'kostya-sh' has created a pull request for this issue: https://github.com/apache/spark/pull/6341 > Spark version of akka-actor_2.11 is not compatible with the official > akka-actor_2.11 2.3.x > -- > > Key: SPARK-7042 > URL: https://issues.apache.org/jira/browse/SPARK-7042 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.1 >Reporter: Konstantin Shaposhnikov >Priority: Minor > > When connecting to a remote Spark cluster (that runs Spark branch-1.3 built > with Scala 2.11) from an application that uses akka 2.3.9 I get the following > error: > {noformat} > 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] > [sparkDriver-akka.actor.default-dispatcher-5] - > Association with remote system [akka.tcp://sparkExecutor@server:59007] has > failed, address is now gated for [5000] ms. > Reason is: [akka.actor.Identify; local class incompatible: stream classdesc > serialVersionUID = -213377755528332889, local class serialVersionUID = 1]. > {noformat} > It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been > built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations > (see https://issues.scala-lang.org/browse/SI-8549). > The following steps can resolve the issue: > - re-build the custom akka library that is used by Spark with the more recent > version of Scala compiler (e.g. 2.11.6) > - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo > - update version of akka used by spark (master and 1.3 branch) > I would also suggest to upgrade to the latest version of akka 2.3.9 (or > 2.3.10 that should be released soon). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7446) Inverse transform for StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-7446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7446: --- Assignee: holdenk (was: Apache Spark) > Inverse transform for StringIndexer > --- > > Key: SPARK-7446 > URL: https://issues.apache.org/jira/browse/SPARK-7446 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: holdenk >Priority: Minor > > It is useful to convert the encoded indices back to their string > representation for result inspection. We can add a parameter to > StringIndexer/StringIndexModel for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7446) Inverse transform for StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-7446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555420#comment-14555420 ] Apache Spark commented on SPARK-7446: - User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/6339 > Inverse transform for StringIndexer > --- > > Key: SPARK-7446 > URL: https://issues.apache.org/jira/browse/SPARK-7446 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: holdenk >Priority: Minor > > It is useful to convert the encoded indices back to their string > representation for result inspection. We can add a parameter to > StringIndexer/StringIndexModel for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7446) Inverse transform for StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-7446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7446: --- Assignee: Apache Spark (was: holdenk) > Inverse transform for StringIndexer > --- > > Key: SPARK-7446 > URL: https://issues.apache.org/jira/browse/SPARK-7446 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Apache Spark >Priority: Minor > > It is useful to convert the encoded indices back to their string > representation for result inspection. We can add a parameter to > StringIndexer/StringIndexModel for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7657) [YARN] Show driver link in Spark UI
[ https://issues.apache.org/jira/browse/SPARK-7657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-7657: Assignee: Hari Shreedharan (was: Imran Rashid) > [YARN] Show driver link in Spark UI > --- > > Key: SPARK-7657 > URL: https://issues.apache.org/jira/browse/SPARK-7657 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.4.0 >Reporter: Hari Shreedharan >Assignee: Hari Shreedharan >Priority: Minor > Fix For: 1.5.0 > > > Currently, the driver link does not show up in the application UI. It is > painful to debug apps running in cluster mode if the link does not show up. > Client mode is fine since the links are local to the client machine. > In YARN mode, it is possible to just get this from the YARN container report. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7657) [YARN] Show driver link in Spark UI
[ https://issues.apache.org/jira/browse/SPARK-7657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-7657: --- Assignee: Imran Rashid > [YARN] Show driver link in Spark UI > --- > > Key: SPARK-7657 > URL: https://issues.apache.org/jira/browse/SPARK-7657 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.4.0 >Reporter: Hari Shreedharan >Assignee: Imran Rashid >Priority: Minor > Fix For: 1.5.0 > > > Currently, the driver link does not show up in the application UI. It is > painful to debug apps running in cluster mode if the link does not show up. > Client mode is fine since the links are local to the client machine. > In YARN mode, it is possible to just get this from the YARN container report. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7657) [YARN] Show driver link in Spark UI
[ https://issues.apache.org/jira/browse/SPARK-7657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-7657. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6166 [https://github.com/apache/spark/pull/6166] > [YARN] Show driver link in Spark UI > --- > > Key: SPARK-7657 > URL: https://issues.apache.org/jira/browse/SPARK-7657 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.4.0 >Reporter: Hari Shreedharan >Priority: Minor > Fix For: 1.5.0 > > > Currently, the driver link does not show up in the application UI. It is > painful to debug apps running in cluster mode if the link does not show up. > Client mode is fine since the links are local to the client machine. > In YARN mode, it is possible to just get this from the YARN container report. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7535) Audit Pipeline APIs for 1.4
[ https://issues.apache.org/jira/browse/SPARK-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14554629#comment-14554629 ] Xiangrui Meng edited comment on SPARK-7535 at 5/22/15 1:21 AM: --- Some notes: 1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage already does. 2. @varargs to setDefault (SPARK-7498) 3. Move Evaluator to ml.evaluation. 4. Mention larger metrics are better. 5. PipelineModel doc. “compiled” -> “fitted” 6. Remove Params.validateParams(paramMap)? 7. UnresolvedAttribute (Java compatibility?) 8. Missing RegressionEvaluator (SPARK-7404) 9. ml.feature missing package doc (SPARK-7808) 10. param and getParam should be final (SPARK-7816) 11. Hide PolynomialExpansion.expand 12. Update RegexTokenizer default setting. (SPARK-7794) 13. Mention `RegexTokenizer` in `Tokenizer`. (SPARK-7794) 14. Hide VectorAssembler. 15. Word2Vec.minCount -> @param 16. ParamValidators -> DeveloperApi 17. Params -> @DeveloperApi 18. ALS -> use dataframes to store user/item factors? Then we can hide ALS.Rating 19. ALSModel -> remove training parameters? 20. Hide MetadataUtils/SchemaUtils. was (Author: mengxr): Some notes: 1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage already does. 2. @varargs to setDefault (SPARK-7498) 3. Move Evaluator to ml.evaluation. 4. Mention larger metrics are better. 5. PipelineModel doc. “compiled” -> “fitted” 6. Remove Params.validateParams(paramMap)? 7. UnresolvedAttribute (Java compatibility?) 8. Missing RegressionEvaluator (SPARK-7404) 9. ml.feature missing package doc (SPARK-7808) 10. param and getParam should be final 11. Hide PolynomialExpansion.expand 12. Update RegexTokenizer default setting. (SPARK-7794) 13. Mention `RegexTokenizer` in `Tokenizer`. (SPARK-7794) 14. Hide VectorAssembler. 15. Word2Vec.minCount -> @param 16. ParamValidators -> DeveloperApi 17. Params -> @DeveloperApi 18. ALS -> use dataframes to store user/item factors? Then we can hide ALS.Rating 19. ALSModel -> remove training parameters? 20. Hide MetadataUtils/SchemaUtils. > Audit Pipeline APIs for 1.4 > --- > > Key: SPARK-7535 > URL: https://issues.apache.org/jira/browse/SPARK-7535 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng > > This is an umbrella for auditing the Pipeline (spark.ml) APIs. Items to > check: > * Public/protected/private access > * Consistency across spark.ml > * Classes, methods, and parameters in spark.mllib but missing in spark.ml > ** We should create JIRAs for each of these (under an umbrella) as to-do > items for future releases. > For each algorithm or API component, create a subtask under this umbrella. > Some major new items: > * new feature transformers > * tree models > * elastic-net > * ML attributes > * developer APIs (Predictor, Classifier, Regressor) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7816) Mark params, getters, and user-facing classes final
Xiangrui Meng created SPARK-7816: Summary: Mark params, getters, and user-facing classes final Key: SPARK-7816 URL: https://issues.apache.org/jira/browse/SPARK-7816 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng This is to tighten spark.ml APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7815) Move UTF8String into Unsafe java package, and have it work against memory address directly
Reynold Xin created SPARK-7815: -- Summary: Move UTF8String into Unsafe java package, and have it work against memory address directly Key: SPARK-7815 URL: https://issues.apache.org/jira/browse/SPARK-7815 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Davies Liu So we can avoid an extra copy of data into byte array. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7814) Turn code generation on by default
Reynold Xin created SPARK-7814: -- Summary: Turn code generation on by default Key: SPARK-7814 URL: https://issues.apache.org/jira/browse/SPARK-7814 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7813) Push code generation into expression definition
[ https://issues.apache.org/jira/browse/SPARK-7813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7813: --- Labels: codegen (was: ) > Push code generation into expression definition > --- > > Key: SPARK-7813 > URL: https://issues.apache.org/jira/browse/SPARK-7813 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Davies Liu >Priority: Critical > Labels: codegen > > Right now we define all expression code generation in a single file. If we > want to do code generation for most default expressions, it'd only make sense > to push them into the expression definitions themselves (similar to "eval" > method). > We would need to design an updated version of the expression API for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7813) Push code generation into expression definition
[ https://issues.apache.org/jira/browse/SPARK-7813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7813: --- Summary: Push code generation into expression definition (was: Push code generation into expression definition themselves) > Push code generation into expression definition > --- > > Key: SPARK-7813 > URL: https://issues.apache.org/jira/browse/SPARK-7813 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Davies Liu >Priority: Critical > > Right now we define all expression code generation in a single file. If we > want to do code generation for most default expressions, it'd only make sense > to push them into the expression definitions themselves (similar to "eval" > method). > We would need to design an updated version of the expression API for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7813) Push code generation into expression definition themselves
Reynold Xin created SPARK-7813: -- Summary: Push code generation into expression definition themselves Key: SPARK-7813 URL: https://issues.apache.org/jira/browse/SPARK-7813 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Davies Liu Priority: Critical Right now we define all expression code generation in a single file. If we want to do code generation for most default expressions, it'd only make sense to push them into the expression definitions themselves (similar to "eval" method). We would need to design an updated version of the expression API for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7219) HashingTF should output ML attributes
[ https://issues.apache.org/jira/browse/SPARK-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-7219. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6308 [https://github.com/apache/spark/pull/6308] > HashingTF should output ML attributes > - > > Key: SPARK-7219 > URL: https://issues.apache.org/jira/browse/SPARK-7219 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Trivial > Fix For: 1.4.0 > > > HashingTF knows the output feature dimension, which should be in the output > ML attributes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7812) Speed up SQL code generation
Reynold Xin created SPARK-7812: -- Summary: Speed up SQL code generation Key: SPARK-7812 URL: https://issues.apache.org/jira/browse/SPARK-7812 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Davies Liu Priority: Critical Explore other frameworks to speed up code generation for SQL expressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7794) Update RegexTokenizer default settings.
[ https://issues.apache.org/jira/browse/SPARK-7794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7794. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6330 [https://github.com/apache/spark/pull/6330] > Update RegexTokenizer default settings. > --- > > Key: SPARK-7794 > URL: https://issues.apache.org/jira/browse/SPARK-7794 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 1.4.0 > > > Should use a simple default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7811) Fix typo on slf4j configuration on metrics.properties.template
Judy Nash created SPARK-7811: Summary: Fix typo on slf4j configuration on metrics.properties.template Key: SPARK-7811 URL: https://issues.apache.org/jira/browse/SPARK-7811 Project: Spark Issue Type: Bug Reporter: Judy Nash Priority: Minor There are a minor typo on slf4jsink configuration at metrics.properties.template. slf4j is mispelled as sl4j on 2 of the configuration. Correcting the typo so users' custom settings will be loaded correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7776) Add shutdown hook to stop StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-7776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-7776. -- Resolution: Fixed Fix Version/s: 1.4.0 > Add shutdown hook to stop StreamingContext > -- > > Key: SPARK-7776 > URL: https://issues.apache.org/jira/browse/SPARK-7776 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Blocker > Fix For: 1.4.0 > > > Shutdown hook to stop SparkContext was added recently. This results in ugly > errors when a streaming application is terminated by ctrl-C. > {code} > Exception in thread "Thread-27" org.apache.spark.SparkException: Job > cancelled because SparkContext was shut down > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:736) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:735) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:735) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1468) > at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84) > at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1403) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1642) > at > org.apache.spark.SparkContext$$anonfun$3.apply$mcV$sp(SparkContext.scala:559) > at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2266) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Utils.scala:2236) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2236) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2236) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1764) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:2236) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2236) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2236) > at scala.util.Try$.apply(Try.scala:161) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:2236) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2218) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) > {code} > This is because the Spark's shutdown hook stops the context, and the > streaming jobs fail in the middle. The correct solution is to stop the > streaming context before the spark context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7783) Add rollup and cube support to DataFrame Python DSL
[ https://issues.apache.org/jira/browse/SPARK-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-7783. Resolution: Fixed Fix Version/s: 1.4.0 > Add rollup and cube support to DataFrame Python DSL > --- > > Key: SPARK-7783 > URL: https://issues.apache.org/jira/browse/SPARK-7783 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Davies Liu > Fix For: 1.4.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7810) rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used
[ https://issues.apache.org/jira/browse/SPARK-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ai He updated SPARK-7810: - External issue URL: (was: https://github.com/apache/spark/pull/6338) > rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used > --- > > Key: SPARK-7810 > URL: https://issues.apache.org/jira/browse/SPARK-7810 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.1 >Reporter: Ai He > > Method "_load_from_socket" in rdd.py cannot load data from jvm socket if ipv6 > is used. The current method only works well with ipv4. New modification > should work around both two protocols. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7810) rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used
[ https://issues.apache.org/jira/browse/SPARK-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ai He updated SPARK-7810: - External issue URL: https://github.com/apache/spark/pull/6338 > rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used > --- > > Key: SPARK-7810 > URL: https://issues.apache.org/jira/browse/SPARK-7810 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.1 >Reporter: Ai He > > Method "_load_from_socket" in rdd.py cannot load data from jvm socket if ipv6 > is used. The current method only works well with ipv4. New modification > should work around both two protocols. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7810) rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used
[ https://issues.apache.org/jira/browse/SPARK-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7810: --- Assignee: (was: Apache Spark) > rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used > --- > > Key: SPARK-7810 > URL: https://issues.apache.org/jira/browse/SPARK-7810 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.1 >Reporter: Ai He > > Method "_load_from_socket" in rdd.py cannot load data from jvm socket if ipv6 > is used. The current method only works well with ipv4. New modification > should work around both two protocols. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7810) rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used
[ https://issues.apache.org/jira/browse/SPARK-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555361#comment-14555361 ] Apache Spark commented on SPARK-7810: - User 'AiHe' has created a pull request for this issue: https://github.com/apache/spark/pull/6338 > rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used > --- > > Key: SPARK-7810 > URL: https://issues.apache.org/jira/browse/SPARK-7810 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.1 >Reporter: Ai He > > Method "_load_from_socket" in rdd.py cannot load data from jvm socket if ipv6 > is used. The current method only works well with ipv4. New modification > should work around both two protocols. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7810) rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used
[ https://issues.apache.org/jira/browse/SPARK-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7810: --- Assignee: Apache Spark > rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used > --- > > Key: SPARK-7810 > URL: https://issues.apache.org/jira/browse/SPARK-7810 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.1 >Reporter: Ai He >Assignee: Apache Spark > > Method "_load_from_socket" in rdd.py cannot load data from jvm socket if ipv6 > is used. The current method only works well with ipv4. New modification > should work around both two protocols. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7810) rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used
Ai He created SPARK-7810: Summary: rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used Key: SPARK-7810 URL: https://issues.apache.org/jira/browse/SPARK-7810 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.1 Reporter: Ai He Method "_load_from_socket" in rdd.py cannot load data from jvm socket if ipv6 is used. The current method only works well with ipv4. New modification should work around both two protocols. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6387) HTTP mode of HiveThriftServer2 doesn't work when built with Hive 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-6387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6387. - Resolution: Won't Fix We aren't building with Hive 12 anymore. > HTTP mode of HiveThriftServer2 doesn't work when built with Hive 0.12.0 > --- > > Key: SPARK-6387 > URL: https://issues.apache.org/jira/browse/SPARK-6387 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.2.1, 1.3.0 >Reporter: Cheng Lian > > Reproduction steps: > # Compile Spark against Hive 0.12.0 > {noformat}$ ./build/sbt > -Pyarn,hadoop-2.4,hive,hive-thriftserver,hive-0.12.0,scala-2.10 > -Dhadoop.version=2.4.1 clean assembly/assembly{noformat} > # Start the Thrift server in HTTP mode > Add the following stanza in {{hive-site.xml}}: > {noformat} > hive.server2.transport.mode > http > {noformat} > and > {noformat}$ ./bin/start-thriftserver.sh{noformat} > # Connect to the Thrift server via Beeline > {noformat}$ ./bin/beeline -u > "jdbc:hive2://localhost:10001/default?hive.server2.transport.mode=http;hive.server2.thrift.http.path=cliservice"{noformat} > # Execute any query and check the server log > We can see that no query execution related logs are output. > The reason is that, when running under HTTP mode, although we pass in a > {{SparkSQLCLIService}} instance > ([here|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L102]) > to {{ThriftHttpCLIService}}, Hive 0.12.0 just ignores it, and instantiate a > new {{CLIService}} > ([here|https://github.com/apache/hive/blob/release-0.12.0/service/src/java/org/apache/hive/service/cli/thrift/ThriftHttpCLIService.java#L91-L92] > and > [here|https://github.com/apache/hive/blob/release-0.12.0/service/src/java/org/apache/hive/service/cli/thrift/EmbeddedThriftBinaryCLIService.java#L32]). > Notice that while compiling against Hive 0.13.1, Spark SQL doesn't suffer > from this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7684) TestHive.reset complains Database does not exist: default
[ https://issues.apache.org/jira/browse/SPARK-7684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7684: Assignee: Cheng Lian > TestHive.reset complains Database does not exist: default > - > > Key: SPARK-7684 > URL: https://issues.apache.org/jira/browse/SPARK-7684 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Yin Huai >Assignee: Cheng Lian > > To see the error, try {{test-only > org.apache.spark.sql.hive.MetastoreDataSourcesSuite}}. You will see > {code} > 19:23:30.487 ERROR org.apache.spark.sql.hive.test.TestHive: FATAL ERROR: > Failed to reset TestDB state. > org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution > Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Database > does not exist: default > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:333) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:310) > at > org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:139) > at > org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:310) > at > org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:300) > at > org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:425) > at > org.apache.spark.sql.hive.test.TestHiveContext.runSqlHive(TestHive.scala:94) > at > org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:433) > at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite.afterEach(MetastoreDataSourcesSuite.scala:43) > at > org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205) > at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite.afterEach(MetastoreDataSourcesSuite.scala:40) > at > org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220) > at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite.afterEach(MetastoreDataSourcesSuite.scala:40) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:264) > at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite.runTest(MetastoreDataSourcesSuite.scala:40) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:318) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) > at org.scalatest.Suite$class.run(Suite.scala:1424) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at org.scalatest.SuperEngine.runImpl(Engine.scala:545) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) > at org.scalatest.FunSuite.run(FunSuite.scala:1555) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671) > at sbt.ForkMain$Run$2.call(ForkMain.java:294) > at sbt.ForkMain$Run$2.call(ForkMain.java:284) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2178) createSchemaRDD is not thread safe
[ https://issues.apache.org/jira/browse/SPARK-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2178. - Resolution: Later This has been a problem since Spark SQL 1.0 and we haven't heard a lot of complaints. Furthermore Scala 2.10 (which is the only version that should have the problem) isn't making new releases anymore. Macros would be nice, but they aren't pressing enough to leave this open for now. > createSchemaRDD is not thread safe > -- > > Key: SPARK-2178 > URL: https://issues.apache.org/jira/browse/SPARK-2178 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust > > This is because implicit type tags are not thread safe. We could fix this > with compile time macros (which could also make the conversion a lot faster). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3184) Allow user to specify num tasks to use for a table
[ https://issues.apache.org/jira/browse/SPARK-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3184. - Resolution: Won't Fix > Allow user to specify num tasks to use for a table > -- > > Key: SPARK-3184 > URL: https://issues.apache.org/jira/browse/SPARK-3184 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Andy Konwinski >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5494) SparkSqlSerializer Ignores KryoRegistrators
[ https://issues.apache.org/jira/browse/SPARK-5494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5494. - Resolution: Fixed Fix Version/s: 1.4.0 Assignee: Michael Armbrust > SparkSqlSerializer Ignores KryoRegistrators > --- > > Key: SPARK-5494 > URL: https://issues.apache.org/jira/browse/SPARK-5494 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Hamel Ajay Kothari >Assignee: Michael Armbrust > Fix For: 1.4.0 > > > We should make SparkSqlSerializer call {{super.newKryo}} before doing any of > it's custom stuff in order to make sure it picks up on custom > KryoRegistrators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7404) Add RegressionEvaluator to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555326#comment-14555326 ] Ram Sriharsha commented on SPARK-7404: -- ah perfect, didn't notice RegressionMetrics in codebase. that is great! > Add RegressionEvaluator to spark.ml > --- > > Key: SPARK-7404 > URL: https://issues.apache.org/jira/browse/SPARK-7404 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Ram Sriharsha > > This allows users to tune regression models using the pipeline API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7809) MultivariateOnlineSummarizer should allow users to configure what to compute
Xiangrui Meng created SPARK-7809: Summary: MultivariateOnlineSummarizer should allow users to configure what to compute Key: SPARK-7809 URL: https://issues.apache.org/jira/browse/SPARK-7809 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Now MultivariateOnlineSummarizer computes every summary statistics it can provide, which is okay and convenient for small number of features. It the feature dimension is large, this becomes expensive. So we should add setters to allow users to configure what to compute. {code} val summarizer = new MultivariateOnlineSummarizer() .withMean(false) .withMax(false) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7785) Add missing items to pyspark.mllib.linalg.Matrices
[ https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555313#comment-14555313 ] Burak Yavuz commented on SPARK-7785: My belief on the Python linalg api so far has been that in Python, you have two beautiful libraries called numpy and scipy. The overhead of serialization-deserialization is not worth the implementation of wrappers, because numpy and scipy are backed by C and have vectorization and such. For linalg, we have been leveraging Breeze for a long time, only adding stuff for when the performance can be greatly improved. If we can obtain better performance from numpy and scipy, then let's just leverage them. Most of these methods were already named very similar to numpy and scipy anyway. > Add missing items to pyspark.mllib.linalg.Matrices > -- > > Key: SPARK-7785 > URL: https://issues.apache.org/jira/browse/SPARK-7785 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Manoj Kumar > > For DenseMatrices. > Class Methods > __str__, transpose > Object Methods > zeros, ones, eye, rand, randn, diag > For SparseMatrices > Class Methods > __str__, transpose > Object Methods, > fromCoo, speye, sprand, sprandn, spdiag, > Matrices Methods, horzcat, vertcat -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7807) High-Availablity:: SparkHadoopUtil.scala should support hadoopConfiguration.addResource()
[ https://issues.apache.org/jira/browse/SPARK-7807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555312#comment-14555312 ] Norman He commented on SPARK-7807: -- Running spark in local mode or mesos cluster, The hadoopConfiguration need to know all the core-site.xml and hdfs-site.xml from some http://url service. Due to bundle issue (there are a lot of core-site.xml and hdfs-site.xml in all kinds of testing jar), the spark instantiate hadoopConfiguration won't be able to pick up the correct resource in HDFS high availabity setup. Add the spark.hadoop.url support can be one way to solve this issue and the clean way. > High-Availablity:: SparkHadoopUtil.scala should support > hadoopConfiguration.addResource() > -- > > Key: SPARK-7807 > URL: https://issues.apache.org/jira/browse/SPARK-7807 > Project: Spark > Issue Type: Improvement > Environment: running spark against remote-hadoop HA cluster. Easy of > use with spark.hadoop.url. prefix. > 1) user can support sparkConf with prefix spark.hadoop.url. like > spark.hadoop.url.core-site > and spark.hadoop.url.hdfs-site >Reporter: Norman He >Priority: Trivial > Labels: easyfix > > line 97 : should below should be able to change to > conf.getAll.foreach { case (key, value) => > if (key.startsWith("spark.hadoop.")) { > hadoopConf.set(key.substring("spark.hadoop.".length), value) > } > } > new version--- > conf.getAll.foreach { case (key, value) => > if (key.startsWith("spark.hadoop.")) { > if( key.startsWith("spark.hadoop.url.")) >hadoopConf.addResource(new URL(value)) > else > hadoopConf.set(key.substring("spark.hadoop.".length), value) > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7404) Add RegressionEvaluator to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555311#comment-14555311 ] Xiangrui Meng commented on SPARK-7404: -- I think we only need to wrap `RegressionMetrics` from the `spark.mllib` package, which provides R2, RMSE, and MAE. > Add RegressionEvaluator to spark.ml > --- > > Key: SPARK-7404 > URL: https://issues.apache.org/jira/browse/SPARK-7404 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Ram Sriharsha > > This allows users to tune regression models using the pipeline API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7785) Add missing items to pyspark.mllib.linalg.Matrices
[ https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555310#comment-14555310 ] Xiangrui Meng commented on SPARK-7785: -- In PySpark, we should delegate all local linear algebra operations to numpy and scipy. For the factory methods, users should use numpy/scipy directly. > Add missing items to pyspark.mllib.linalg.Matrices > -- > > Key: SPARK-7785 > URL: https://issues.apache.org/jira/browse/SPARK-7785 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Manoj Kumar > > For DenseMatrices. > Class Methods > __str__, transpose > Object Methods > zeros, ones, eye, rand, randn, diag > For SparseMatrices > Class Methods > __str__, transpose > Object Methods, > fromCoo, speye, sprand, sprandn, spdiag, > Matrices Methods, horzcat, vertcat -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7535) Audit Pipeline APIs for 1.4
[ https://issues.apache.org/jira/browse/SPARK-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14554629#comment-14554629 ] Xiangrui Meng edited comment on SPARK-7535 at 5/21/15 11:49 PM: Some notes: 1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage already does. 2. @varargs to setDefault (SPARK-7498) 3. Move Evaluator to ml.evaluation. 4. Mention larger metrics are better. 5. PipelineModel doc. “compiled” -> “fitted” 6. Remove Params.validateParams(paramMap)? 7. UnresolvedAttribute (Java compatibility?) 8. Missing RegressionEvaluator (SPARK-7404) 9. ml.feature missing package doc (SPARK-7808) 10. param and getParam should be final 11. Hide PolynomialExpansion.expand 12. Update RegexTokenizer default setting. (SPARK-7794) 13. Mention `RegexTokenizer` in `Tokenizer`. (SPARK-7794) 14. Hide VectorAssembler. 15. Word2Vec.minCount -> @param 16. ParamValidators -> DeveloperApi 17. Params -> @DeveloperApi 18. ALS -> use dataframes to store user/item factors? Then we can hide ALS.Rating 19. ALSModel -> remove training parameters? 20. Hide MetadataUtils/SchemaUtils. was (Author: mengxr): Some notes: 1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage already does. 2. @varargs to setDefault (SPARK-7498) 3. Move Evaluator to ml.evaluation. 4. Mention larger metrics are better. 5. PipelineModel doc. “compiled” -> “fitted” 6. Remove Params.validateParams(paramMap)? 7. UnresolvedAttribute (Java compatibility?) 8. Missing RegressionEvaluator (SPARK-7404) 9. ml.feature missing package doc 10. param and getParam should be final 11. Hide PolynomialExpansion.expand 12. Update RegexTokenizer default setting. (SPARK-7794) 13. Mention `RegexTokenizer` in `Tokenizer`. (SPARK-7794) 14. Hide VectorAssembler. 15. Word2Vec.minCount -> @param 16. ParamValidators -> DeveloperApi 17. Params -> @DeveloperApi 18. ALS -> use dataframes to store user/item factors? Then we can hide ALS.Rating 19. ALSModel -> remove training parameters? 20. Hide MetadataUtils/SchemaUtils. > Audit Pipeline APIs for 1.4 > --- > > Key: SPARK-7535 > URL: https://issues.apache.org/jira/browse/SPARK-7535 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng > > This is an umbrella for auditing the Pipeline (spark.ml) APIs. Items to > check: > * Public/protected/private access > * Consistency across spark.ml > * Classes, methods, and parameters in spark.mllib but missing in spark.ml > ** We should create JIRAs for each of these (under an umbrella) as to-do > items for future releases. > For each algorithm or API component, create a subtask under this umbrella. > Some major new items: > * new feature transformers > * tree models > * elastic-net > * ML attributes > * developer APIs (Predictor, Classifier, Regressor) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7808) Package doc for spark.ml.feature
Xiangrui Meng created SPARK-7808: Summary: Package doc for spark.ml.feature Key: SPARK-7808 URL: https://issues.apache.org/jira/browse/SPARK-7808 Project: Spark Issue Type: Documentation Components: Documentation, ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng We added several feature transformers in Spark 1.4. It would be great to add package doc for `spark.ml.feature`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7535) Audit Pipeline APIs for 1.4
[ https://issues.apache.org/jira/browse/SPARK-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14554629#comment-14554629 ] Xiangrui Meng edited comment on SPARK-7535 at 5/21/15 11:45 PM: Some notes: 1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage already does. 2. @varargs to setDefault (SPARK-7498) 3. Move Evaluator to ml.evaluation. 4. Mention larger metrics are better. 5. PipelineModel doc. “compiled” -> “fitted” 6. Remove Params.validateParams(paramMap)? 7. UnresolvedAttribute (Java compatibility?) 8. Missing RegressionEvaluator (SPARK-7404) 9. ml.feature missing package doc 10. param and getParam should be final 11. Hide PolynomialExpansion.expand 12. Update RegexTokenizer default setting. (SPARK-7794) 13. Mention `RegexTokenizer` in `Tokenizer`. (SPARK-7794) 14. Hide VectorAssembler. 15. Word2Vec.minCount -> @param 16. ParamValidators -> DeveloperApi 17. Params -> @DeveloperApi 18. ALS -> use dataframes to store user/item factors? Then we can hide ALS.Rating 19. ALSModel -> remove training parameters? 20. Hide MetadataUtils/SchemaUtils. was (Author: mengxr): Some notes: 1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage already does. 2. @varargs to setDefault (SPARK-7498) 3. Move Evaluator to ml.evaluation. 4. Mention larger metrics are better. 5. PipelineModel doc. “compiled” -> “fitted” 6. Remove Params.validateParams(paramMap)? 7. UnresolvedAttribute (Java compatibility?) 8. Missing RegressionEvaluator (SPARK-7404) 9. ml.feature missing package doc 10. param and getParam should be final 11. Hide PolynomialExpansion.expand 12. Update RegexTokenizer default setting. 13. Mention `RegexTokenizer` in `Tokenizer`. 14. Hide VectorAssembler. 15. Word2Vec.minCount -> @param 16. ParamValidators -> DeveloperApi 17. Params -> @DeveloperApi 18. ALS -> use dataframes to store user/item factors? Then we can hide ALS.Rating 19. ALSModel -> remove training parameters? 20. Hide MetadataUtils/SchemaUtils. > Audit Pipeline APIs for 1.4 > --- > > Key: SPARK-7535 > URL: https://issues.apache.org/jira/browse/SPARK-7535 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng > > This is an umbrella for auditing the Pipeline (spark.ml) APIs. Items to > check: > * Public/protected/private access > * Consistency across spark.ml > * Classes, methods, and parameters in spark.mllib but missing in spark.ml > ** We should create JIRAs for each of these (under an umbrella) as to-do > items for future releases. > For each algorithm or API component, create a subtask under this umbrella. > Some major new items: > * new feature transformers > * tree models > * elastic-net > * ML attributes > * developer APIs (Predictor, Classifier, Regressor) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7807) High-Availablity:: SparkHadoopUtil.scala should support hadoopConfiguration.addResource()
[ https://issues.apache.org/jira/browse/SPARK-7807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7807: - Issue Type: Improvement (was: Bug) Can you explain this change more and why it would be useful? > High-Availablity:: SparkHadoopUtil.scala should support > hadoopConfiguration.addResource() > -- > > Key: SPARK-7807 > URL: https://issues.apache.org/jira/browse/SPARK-7807 > Project: Spark > Issue Type: Improvement > Environment: running spark against remote-hadoop HA cluster. Easy of > use with spark.hadoop.url. prefix. > 1) user can support sparkConf with prefix spark.hadoop.url. like > spark.hadoop.url.core-site > and spark.hadoop.url.hdfs-site >Reporter: Norman He >Priority: Trivial > Labels: easyfix > > line 97 : should below should be able to change to > conf.getAll.foreach { case (key, value) => > if (key.startsWith("spark.hadoop.")) { > hadoopConf.set(key.substring("spark.hadoop.".length), value) > } > } > new version--- > conf.getAll.foreach { case (key, value) => > if (key.startsWith("spark.hadoop.")) { > if( key.startsWith("spark.hadoop.url.")) >hadoopConf.addResource(new URL(value)) > else > hadoopConf.set(key.substring("spark.hadoop.".length), value) > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7758) Failed to start thrift server when metastore is postgre sql
[ https://issues.apache.org/jira/browse/SPARK-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7758: Assignee: Cheng Lian > Failed to start thrift server when metastore is postgre sql > --- > > Key: SPARK-7758 > URL: https://issues.apache.org/jira/browse/SPARK-7758 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Tao Wang >Assignee: Cheng Lian >Priority: Blocker > Attachments: hive-site.xml, with error.log, with no error.log > > > I am using today's master branch to start thrift server with setting > metastore to postgre sql, and it shows error like: > {code} > 15/05/20 20:43:57 DEBUG Schema: DROP TABLE DELETEME1432125837197 CASCADE > 15/05/20 20:43:57 ERROR Datastore: Error thrown executing DROP TABLE > DELETEME1432125837197 CASCADE : Syntax error: Encountered "CASCADE" at line > 1, column 34. > java.sql.SQLSyntaxErrorException: Syntax error: Encountered "CASCADE" at line > 1, column 34. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown > Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) > at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) > at > org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264) > at > org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)` > But it works well with earlier master branch (on 7th, April). > After printing their debug level log, I found current branch tries to connect > with derby but didn't know why, maybe the big reconstructure in sql module > cause this issue. > The Datastore shows in current branch: > 15/05/20 20:43:57 DEBUG Datastore: === Datastore > = > 15/05/20 20:43:57 DEBUG Datastore: StoreManager : "rdbms" > (org.datanucleus.store.rdbms.RDBMSStoreManager) > 15/05/20 20:43:57 DEBUG Datastore: Datastore : read-write > 15/05/20 20:43:57 DEBUG Datastore: Schema Control : AutoCreate(None), > Validate(None) > 15/05/20 20:43:57 DEBUG Datastore: Query Languages : [JDOQL, JPQL, SQL, > STOREDPROC] > 15/05/20 20:43:57 DEBUG Datastore: Queries : Timeout=0 > 15/05/20 20:43:57 DEBUG Datastore: > === > 15/05/20 20:43:57 DEBUG Datastore: Datastore Adapter : > org.datanucleus.store.rdbms.adapter.PostgreSQLAdapter > 15/05/20 20:43:57 DEBUG Datastore: Datastore : name="Apache Derby" > version="10.10.1.1 - (1458268)" > 15/05/20 20:43:57 DEBUG Datastore: Datastore Driver : name="Apache Derby > Embedded JDBC Driver" version="10.10.1.1 - (1458268)" > 15/05/20 20:43:57 DEBUG Datastore: Primary Connection Factory : > URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true] > 15/05/20 20:43:57 DEBUG Datastore: Secondary Connection Factory : > URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true] > 15/05/20 20:43:57 DEBUG Datastore: Datastore Identifiers : > factory="datanucleus1" case=UPPERCASE catalog= schema=SPARK > 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Cases : "MixedCase" > UPPERCASE "MixedCase-Sensitive" > 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Lengths (max) : > Table=128 Column=128 Constraint=128 Index=128 Delimiter=" > 15/05/20 20:43:57 DEBUG Datastore: Support for Identifiers in DDL : > catalog=false schema=true > 15/05/20 20:43:57 DEBUG Datastore: Datastore : checkTableViewExistence, > rdbmsConstraintCreateMode=DataNucleus, initialiseColumnInfo=ALL > 15/05/20 20:43:57 DEBUG Datastore: Support Statement Batching : yes > (max-batch-size=50) > 15/05/20 20:43:57 DEBUG Datastore: Queries : Results direction=forward, > type=forward-only, concurrency=read-only > 15/05/20 20:43:57 DEBUG Datastore: Java-Types : string-default-length=255 > 15/05/20 20:43:57 DEBUG Datastore: JDBC-Types : [id=2009], BLOB, CLOB, TIME, > DATE, BOOLEAN, VARCHAR, DECIMAL, NUMERIC, CHAR, BINARY, FLOAT, LONGVARBINARY, > VARBINARY, JAVA_OBJECT > 15/05/20 20:43:57 DEBUG Datastore: > === > The Datastore in earlier master branch: > 15/05/20 20:18:10 D
[jira] [Comment Edited] (SPARK-7404) Add RegressionEvaluator to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555285#comment-14555285 ] Ram Sriharsha edited comment on SPARK-7404 at 5/21/15 11:36 PM: scikit learn and R provide a variety of regression metrics http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics http://artax.karlin.mff.cuni.cz/r-help/library/rminer/html/mmetric.html R2 score and RMSE seem like natural first metrics to make available via the Evaluator. was (Author: rams): scikit learn and R provide a variety of regression metrics http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics http://artax.karlin.mff.cuni.cz/r-help/library/rminer/html/mmetric.html R2 score and RMSE seem like natural metrics to make available via the Evaluator. > Add RegressionEvaluator to spark.ml > --- > > Key: SPARK-7404 > URL: https://issues.apache.org/jira/browse/SPARK-7404 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Ram Sriharsha > > This allows users to tune regression models using the pipeline API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7404) Add RegressionEvaluator to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555285#comment-14555285 ] Ram Sriharsha edited comment on SPARK-7404 at 5/21/15 11:35 PM: scikit learn and R provide a variety of regression metrics http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics http://artax.karlin.mff.cuni.cz/r-help/library/rminer/html/mmetric.html R2 score and RMSE seem like natural metrics to make available via the Evaluator. was (Author: rams): sickout learn and R provide a variety of regression metrics http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics http://artax.karlin.mff.cuni.cz/r-help/library/rminer/html/mmetric.html R2 score and RMSE seem like natural metrics to make available via the Evaluator. > Add RegressionEvaluator to spark.ml > --- > > Key: SPARK-7404 > URL: https://issues.apache.org/jira/browse/SPARK-7404 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Ram Sriharsha > > This allows users to tune regression models using the pipeline API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7404) Add RegressionEvaluator to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555285#comment-14555285 ] Ram Sriharsha commented on SPARK-7404: -- sickout learn and R provide a variety of regression metrics http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics http://artax.karlin.mff.cuni.cz/r-help/library/rminer/html/mmetric.html R2 score and RMSE seem like natural metrics to make available via the Evaluator. > Add RegressionEvaluator to spark.ml > --- > > Key: SPARK-7404 > URL: https://issues.apache.org/jira/browse/SPARK-7404 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Ram Sriharsha > > This allows users to tune regression models using the pipeline API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6289) PySpark doesn't maintain SQL date Types
[ https://issues.apache.org/jira/browse/SPARK-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555286#comment-14555286 ] Davies Liu commented on SPARK-6289: --- This will be fixed by upgrading to Pyrolite 4.6, which will pickle java.sql.Date as datetime.date > PySpark doesn't maintain SQL date Types > --- > > Key: SPARK-6289 > URL: https://issues.apache.org/jira/browse/SPARK-6289 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.2.1 >Reporter: Michael Nazario >Assignee: Davies Liu > > For the DateType, Spark SQL requires a datetime.date in Python. However, if > you collect a row based on that type, you'll end up with a returned value > which is type datetime.datetime. > I have tried to reproduce this using the pyspark shell, but have been unable > to. This is definitely a problem coming from pyrolite though: > https://github.com/irmen/Pyrolite/ > Pyrolite is being used for datetime and date serialization, but appears to > not map to date objects, but maps to datetime objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7624) Task scheduler delay is increasing time over time in spark local mode
[ https://issues.apache.org/jira/browse/SPARK-7624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555284#comment-14555284 ] Apache Spark commented on SPARK-7624: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/6337 > Task scheduler delay is increasing time over time in spark local mode > - > > Key: SPARK-7624 > URL: https://issues.apache.org/jira/browse/SPARK-7624 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1 >Reporter: Jack Hu >Assignee: Davies Liu > Labels: delay, schedule > Fix For: 1.4.0 > > > I am running a simple spark streaming program with spark 1.3.1 in local mode, > it receives json string from a socket with rate 50 events per second, it can > run well in first 6 hours (although the minor gc count per minute is > increasing all the time), after that, i can see that the scheduler delay in > every task is significant increased from 10 ms to 100 ms, after 10 hours > running, the task delay is about 800 ms and cpu is also increased from 2% to > 30%. This causes the steaming job can not finish in one batch interval (5 > seconds). I dumped the java memory after 16 hours and can see there are about > 20 {{org.apache.spark.scheduler.local.ReviveOffers}} objects in > {{akka.actor.LightArrayRevolverScheduler$TaskQueue[]}}. Then i checked the > code and see only one place may put the {{ReviveOffers}} to akka > {{LightArrayRevolverScheduler}}: the {{LocalActor::reviveOffers}} > {code} > def reviveOffers() { > val offers = Seq(new WorkerOffer(localExecutorId, localExecutorHostname, > freeCores)) > val tasks = scheduler.resourceOffers(offers).flatten > for (task <- tasks) { > freeCores -= scheduler.CPUS_PER_TASK > executor.launchTask(executorBackend, taskId = task.taskId, > attemptNumber = task.attemptNumber, > task.name, task.serializedTask) > } > if (tasks.isEmpty && scheduler.activeTaskSets.nonEmpty) { > // Try to reviveOffer after 1 second, because scheduler may wait for > locality timeout > context.system.scheduler.scheduleOnce(1000 millis, self, ReviveOffers) > } > } > {code} > I removed the last three lines in this method (the whole {{if}} block, which > is introduced from https://issues.apache.org/jira/browse/SPARK-4939), it > worked smooth after 20 hours running, the scheduler delay is about 10 ms all > the time. So there should have some conditions that the ReviveOffers will be > duplicate scheduled? I am not sure why this happens, but i feel that this is > the root cause of this issue. > My spark settings: > # Memor: 3G > # CPU: 8 cores > # Streaming Batch interval: 5 seconds. > Here are my streaming code: > {code} > val input = ssc.socketTextStream( > hostname, port, StorageLevel.MEMORY_ONLY_SER).mapPartitions( > /// parse the json to Order > Order(_), preservePartitioning = true) > val mresult = input.map( > v => (v.customer, UserSpending(v.customer, v.count * v.price, > v.timestamp.toLong))).cache() > val tempr = mresult.window( > Seconds(firstStageWindowSize), > Seconds(firstStageWindowSize) > ).transform( > rdd => rdd.union(rdd).union(rdd).union(rdd) > ) > tempr.count.print > tempr.cache().foreachRDD((rdd, t) => { > for (i <- 1 to 5) { > val c = rdd.filter(x=>scala.util.Random.nextInt(5) == i).count() > println("""T: """ + t + """: """ + c) > } > }) > {code} > > Updated at 2015-05-15 > I did print some detail schedule times of the suspect lines in > {{LocalActor::reviveOffers}}: {color:red}*1685343501*{color} times after 18 > hours running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4939) Python updateStateByKey example hang in local mode
[ https://issues.apache.org/jira/browse/SPARK-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555282#comment-14555282 ] Apache Spark commented on SPARK-4939: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/6337 > Python updateStateByKey example hang in local mode > -- > > Key: SPARK-4939 > URL: https://issues.apache.org/jira/browse/SPARK-4939 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, Streaming >Affects Versions: 1.2.1 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.2.2, 1.3.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7785) Add missing items to pyspark.mllib.linalg.Matrices
[ https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555280#comment-14555280 ] Joseph K. Bradley commented on SPARK-7785: -- Ping [~brkyvz] [~mengxr] to weigh in on how many linalg methods we want to add to our PySpark classes > Add missing items to pyspark.mllib.linalg.Matrices > -- > > Key: SPARK-7785 > URL: https://issues.apache.org/jira/browse/SPARK-7785 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Manoj Kumar > > For DenseMatrices. > Class Methods > __str__, transpose > Object Methods > zeros, ones, eye, rand, randn, diag > For SparseMatrices > Class Methods > __str__, transpose > Object Methods, > fromCoo, speye, sprand, sprandn, spdiag, > Matrices Methods, horzcat, vertcat -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7807) High-Availablity:: SparkHadoopUtil.scala should support hadoopConfiguration.addResource()
Norman He created SPARK-7807: Summary: High-Availablity:: SparkHadoopUtil.scala should support hadoopConfiguration.addResource() Key: SPARK-7807 URL: https://issues.apache.org/jira/browse/SPARK-7807 Project: Spark Issue Type: Bug Environment: running spark against remote-hadoop HA cluster. Easy of use with spark.hadoop.url. prefix. 1) user can support sparkConf with prefix spark.hadoop.url. like spark.hadoop.url.core-site and spark.hadoop.url.hdfs-site Reporter: Norman He Priority: Trivial line 97 : should below should be able to change to conf.getAll.foreach { case (key, value) => if (key.startsWith("spark.hadoop.")) { hadoopConf.set(key.substring("spark.hadoop.".length), value) } } new version--- conf.getAll.foreach { case (key, value) => if (key.startsWith("spark.hadoop.")) { if( key.startsWith("spark.hadoop.url.")) hadoopConf.addResource(new URL(value)) else hadoopConf.set(key.substring("spark.hadoop.".length), value) } } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7404) Add RegressionEvaluator to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Sriharsha reassigned SPARK-7404: Assignee: Ram Sriharsha > Add RegressionEvaluator to spark.ml > --- > > Key: SPARK-7404 > URL: https://issues.apache.org/jira/browse/SPARK-7404 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Ram Sriharsha > > This allows users to tune regression models using the pipeline API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7806) spark-ec2 launch script fails for Python3
[ https://issues.apache.org/jira/browse/SPARK-7806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7806: --- Assignee: (was: Apache Spark) > spark-ec2 launch script fails for Python3 > - > > Key: SPARK-7806 > URL: https://issues.apache.org/jira/browse/SPARK-7806 > Project: Spark > Issue Type: Bug > Components: EC2, PySpark >Affects Versions: 1.3.1 > Environment: All platforms. >Reporter: Matthew Goodman >Priority: Minor > > Depending on the options used the spark-ec2 script will terminate > ungracefully. > Relevant buglets include: > - urlopen() returning bytes vs. string > - floor division change for partition calculation > - filter() iteration behavior change in module calculation > I have a fixed version that I wish to contribute. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org