[jira] [Updated] (SPARK-6101) Create a SparkSQL DataSource API implementation for DynamoDB
[ https://issues.apache.org/jira/browse/SPARK-6101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-6101: Fix Version/s: (was: 1.5.0) 1.6.0 Description: similar to https://github.com/databricks/spark-avro and https://github.com/databricks/spark-csv Here's a good basis for a java-based, high-level dynamodb java connector: https://github.com/sporcina/dynamodb-connector/ was:similar to https://github.com/databricks/spark-avro and https://github.com/databricks/spark-csv > Create a SparkSQL DataSource API implementation for DynamoDB > > > Key: SPARK-6101 > URL: https://issues.apache.org/jira/browse/SPARK-6101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.2.0 >Reporter: Chris Fregly >Assignee: Chris Fregly > Fix For: 1.6.0 > > > similar to https://github.com/databricks/spark-avro and > https://github.com/databricks/spark-csv > Here's a good basis for a java-based, high-level dynamodb java connector: > https://github.com/sporcina/dynamodb-connector/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4144) Support incremental model training of Naive Bayes classifier
[ https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627045#comment-14627045 ] Chris Fregly commented on SPARK-4144: - @[~freeman-lab]: looks like this is still open. any chance I can take it back? did you make any progress that you'd like to share? let me know. i'd love to help here. > Support incremental model training of Naive Bayes classifier > > > Key: SPARK-4144 > URL: https://issues.apache.org/jira/browse/SPARK-4144 > Project: Spark > Issue Type: Improvement > Components: MLlib, Streaming >Reporter: Chris Fregly >Assignee: Jeremy Freeman > > Per Xiangrui Meng from the following user list discussion: > http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E > > "For Naive Bayes, we need to update the priors and conditional > probabilities, which means we should also remember the number of > observations for the updates." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8550) table() no longer supports specifying the database - ie. table([database].[table]).
Chris Fregly created SPARK-8550: --- Summary: table() no longer supports specifying the database - ie. table([database].[table]). Key: SPARK-8550 URL: https://issues.apache.org/jira/browse/SPARK-8550 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Chris Fregly this is a regression from 1.3. the workaround is to use sql("SELECT * FROM [database].[table]") for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6101) Create a SparkSQL DataSource API implementation for DynamoDB
[ https://issues.apache.org/jira/browse/SPARK-6101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-6101: Description: similar to https://github.com/databricks/spark-avro and https://github.com/databricks/spark-csv (was: similar to https://github.com/databricks/spark-avro) > Create a SparkSQL DataSource API implementation for DynamoDB > > > Key: SPARK-6101 > URL: https://issues.apache.org/jira/browse/SPARK-6101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.2.0 >Reporter: Chris Fregly >Assignee: Chris Fregly > Fix For: 1.5.0 > > > similar to https://github.com/databricks/spark-avro and > https://github.com/databricks/spark-csv -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6654) Update Kinesis Streaming impls (both KCL-based and Direct) to use latest aws-java-sdk and kinesis-client-library
[ https://issues.apache.org/jira/browse/SPARK-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly resolved SPARK-6654. - Resolution: Duplicate Fix Version/s: 1.4.0 duplicate of SPARK-7679 > Update Kinesis Streaming impls (both KCL-based and Direct) to use latest > aws-java-sdk and kinesis-client-library > > > Key: SPARK-6654 > URL: https://issues.apache.org/jira/browse/SPARK-6654 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Chris Fregly > Fix For: 1.4.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6654) Update Kinesis Streaming impls (both KCL-based and Direct) to use latest aws-java-sdk and kinesis-client-library
[ https://issues.apache.org/jira/browse/SPARK-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly closed SPARK-6654. --- > Update Kinesis Streaming impls (both KCL-based and Direct) to use latest > aws-java-sdk and kinesis-client-library > > > Key: SPARK-6654 > URL: https://issues.apache.org/jira/browse/SPARK-6654 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Chris Fregly > Fix For: 1.4.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6654) Update Kinesis Streaming impls (both KCL-based and Direct) to use latest aws-java-sdk and kinesis-client-library
[ https://issues.apache.org/jira/browse/SPARK-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-6654: Target Version/s: 1.4.0 (was: 1.5.0) > Update Kinesis Streaming impls (both KCL-based and Direct) to use latest > aws-java-sdk and kinesis-client-library > > > Key: SPARK-6654 > URL: https://issues.apache.org/jira/browse/SPARK-6654 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Chris Fregly > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6101) Create a SparkSQL DataSource API implementation for DynamoDB
[ https://issues.apache.org/jira/browse/SPARK-6101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-6101: Fix Version/s: (was: 1.4.0) 1.5.0 > Create a SparkSQL DataSource API implementation for DynamoDB > > > Key: SPARK-6101 > URL: https://issues.apache.org/jira/browse/SPARK-6101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.2.0 >Reporter: Chris Fregly >Assignee: Chris Fregly > Fix For: 1.5.0 > > > similar to https://github.com/databricks/spark-avro -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4184) Improve Spark Streaming documentation to address commonly-asked questions
[ https://issues.apache.org/jira/browse/SPARK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly closed SPARK-4184. --- Resolution: Duplicate we'll incorporate changes in incrementally > Improve Spark Streaming documentation to address commonly-asked questions > -- > > Key: SPARK-4184 > URL: https://issues.apache.org/jira/browse/SPARK-4184 > Project: Spark > Issue Type: Documentation > Components: Streaming >Reporter: Chris Fregly > Labels: documentation, streaming > > Improve Streaming documentation including API descriptions, > concurrency/thread safety, fault tolerance, replication, checkpointing, > scalability, resource allocation and utilization, back pressure, and > monitoring. > also, add a section to the kinesis streaming guide describing how to use IAM > roles with the Spark Kinesis Receiver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6654) Update Kinesis Streaming impls (both KCL-based and Direct) to use latest aws-java-sdk and kinesis-client-library
[ https://issues.apache.org/jira/browse/SPARK-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-6654: Priority: Major (was: Blocker) Target Version/s: 1.5.0 (was: 1.4.0) > Update Kinesis Streaming impls (both KCL-based and Direct) to use latest > aws-java-sdk and kinesis-client-library > > > Key: SPARK-6654 > URL: https://issues.apache.org/jira/browse/SPARK-6654 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Chris Fregly > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7178) Improve DataFrame documentation and code samples
[ https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14522351#comment-14522351 ] Chris Fregly commented on SPARK-7178: - fillNa() is also commonly used: https://forums.databricks.com/questions/790/how-do-i-replace-nulls-with-0s-in-a-dataframe.html > Improve DataFrame documentation and code samples > > > Key: SPARK-7178 > URL: https://issues.apache.org/jira/browse/SPARK-7178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.1 >Reporter: Chris Fregly > Labels: dataframe > > AND and OR are not straightforward when using the new DataFrame API. > the current convention - accepted by Pandas users - is to use the bitwise & > and | instead of AND and OR. when using these, however, you need to wrap > each expression in parenthesis to prevent the bitwise operator from > dominating. > also, working with StructTypes is a bit confusing. the following link: > https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema > (Python tab) implies that you can work with tuples directly when creating a > DataFrame. > however, the following code errors out unless we explicitly use Row's: > {code} > from pyspark.sql import Row > from pyspark.sql.types import * > # The schema is encoded in a string. > schemaString = "a" > fields = [StructField(field_name, MapType(StringType(),IntegerType())) for > field_name in schemaString.split()] > schema = StructType(fields) > df = sqlContext.createDataFrame([Row(a={'b': 1})], schema) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7178) Improve DataFrame documentation and code samples
[ https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517858#comment-14517858 ] Chris Fregly edited comment on SPARK-7178 at 4/28/15 8:07 PM: -- added these to the forums AND and OR: https://forums.databricks.com/questions/758/how-do-i-use-and-and-or-within-my-dataframe-operat.html Nested Map Columns in DataFrames: https://forums.databricks.com/questions/764/how-do-i-create-a-dataframe-with-nested-map-column.html Casting columns of DataFrames: https://forums.databricks.com/questions/767/how-do-i-cast-within-a-dataframe.html was (Author: cfregly): added this to the forums to address the AND and OR: https://forums.databricks.com/questions/758/how-do-i-use-and-and-or-within-my-dataframe-operat.html > Improve DataFrame documentation and code samples > > > Key: SPARK-7178 > URL: https://issues.apache.org/jira/browse/SPARK-7178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.1 >Reporter: Chris Fregly > Labels: dataframe > > AND and OR are not straightforward when using the new DataFrame API. > the current convention - accepted by Pandas users - is to use the bitwise & > and | instead of AND and OR. when using these, however, you need to wrap > each expression in parenthesis to prevent the bitwise operator from > dominating. > also, working with StructTypes is a bit confusing. the following link: > https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema > (Python tab) implies that you can work with tuples directly when creating a > DataFrame. > however, the following code errors out unless we explicitly use Row's: > {code} > from pyspark.sql import Row > from pyspark.sql.types import * > # The schema is encoded in a string. > schemaString = "a" > fields = [StructField(field_name, MapType(StringType(),IntegerType())) for > field_name in schemaString.split()] > schema = StructType(fields) > df = sqlContext.createDataFrame([Row(a={'b': 1})], schema) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7178) Improve DataFrame documentation and code samples
[ https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517858#comment-14517858 ] Chris Fregly commented on SPARK-7178: - added this to the forums to address the AND and OR: https://forums.databricks.com/questions/758/how-do-i-use-and-and-or-within-my-dataframe-operat.html > Improve DataFrame documentation and code samples > > > Key: SPARK-7178 > URL: https://issues.apache.org/jira/browse/SPARK-7178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.1 >Reporter: Chris Fregly > Labels: dataframe > > AND and OR are not straightforward when using the new DataFrame API. > the current convention - accepted by Pandas users - is to use the bitwise & > and | instead of AND and OR. when using these, however, you need to wrap > each expression in parenthesis to prevent the bitwise operator from > dominating. > also, working with StructTypes is a bit confusing. the following link: > https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema > (Python tab) implies that you can work with tuples directly when creating a > DataFrame. > however, the following code errors out unless we explicitly use Row's: > {code} > from pyspark.sql import Row > from pyspark.sql.types import * > # The schema is encoded in a string. > schemaString = "a" > fields = [StructField(field_name, MapType(StringType(),IntegerType())) for > field_name in schemaString.split()] > schema = StructType(fields) > df = sqlContext.createDataFrame([Row(a={'b': 1})], schema) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7178) Improve DataFrame documentation and code samples
[ https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516095#comment-14516095 ] Chris Fregly commented on SPARK-7178: - {code} from pyspark.sql import Row from pyspark.sql.types import * # The schema is encoded in a string. schemaString = "a" fields = [StructField(field_name, MapType(StringType(),IntegerType())) for field_name in schemaString.split()] schema = StructType(fields) df = sqlContext.createDataFrame([Row(a={'b': 1})], schema) {code} is equivalent to (without schema) {code} df2 = sqlContext.createDataFrame([{'a':{'b': 1}}]) {code} but this isn't clear in the docs as far as i can tell > Improve DataFrame documentation and code samples > > > Key: SPARK-7178 > URL: https://issues.apache.org/jira/browse/SPARK-7178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.1 >Reporter: Chris Fregly > Labels: dataframe > > AND and OR are not straightforward when using the new DataFrame API. > the current convention - accepted by Pandas users - is to use the bitwise & > and | instead of AND and OR. when using these, however, you need to wrap > each expression in parenthesis to prevent the bitwise operator from > dominating. > also, working with StructTypes is a bit confusing. the following link: > https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema > (Python tab) implies that you can work with tuples directly when creating a > DataFrame. > however, the following code errors out unless we explicitly use Row's: > {code} > from pyspark.sql import Row > from pyspark.sql.types import * > # The schema is encoded in a string. > schemaString = "a" > fields = [StructField(field_name, MapType(StringType(),IntegerType())) for > field_name in schemaString.split()] > schema = StructType(fields) > df = sqlContext.createDataFrame([Row(a={'b': 1})], schema) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7178) Improve DataFrame documentation and code samples
[ https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-7178: Description: AND and OR are not straightforward when using the new DataFrame API. the current convention - accepted by Pandas users - is to use the bitwise & and | instead of AND and OR. when using these, however, you need to wrap each expression in parenthesis to prevent the bitwise operator from dominating. also, working with StructTypes is a bit confusing. the following link: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema (Python tab) implies that you can work with tuples directly when creating a DataFrame. however, the following code errors out unless we explicitly use Row's: {code} from pyspark.sql import Row from pyspark.sql.types import * # The schema is encoded in a string. schemaString = "a" fields = [StructField(field_name, MapType(StringType(),IntegerType())) for field_name in schemaString.split()] schema = StructType(fields) df = sqlContext.createDataFrame([Row(a={'b': 1})], schema) {code} was: AND and OR are not straightforward when using the new DataFrame API. the current convention - accepted by Pandas users - is to use the bitwise & and | instead of AND and OR. when using these, however, you need to wrap each expression in parenthesis to prevent the bitwise operator from dominating. also, working with StructTypes is a bit confusing. the following link: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema (Python tab) implies that you can work with tuples directly when creating a DataFrame. however, the following code errors out unless we explicitly use Row's: {code} fields = [StructField(field_name, MapType(StringType(),IntegerType())) for field_name in schemaString.split()] schema = StructType(fields) df = sqlContext.createDataFrame([Row(a={'b': 1})], schema) {code} > Improve DataFrame documentation and code samples > > > Key: SPARK-7178 > URL: https://issues.apache.org/jira/browse/SPARK-7178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.1 >Reporter: Chris Fregly > Labels: dataframe > > AND and OR are not straightforward when using the new DataFrame API. > the current convention - accepted by Pandas users - is to use the bitwise & > and | instead of AND and OR. when using these, however, you need to wrap > each expression in parenthesis to prevent the bitwise operator from > dominating. > also, working with StructTypes is a bit confusing. the following link: > https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema > (Python tab) implies that you can work with tuples directly when creating a > DataFrame. > however, the following code errors out unless we explicitly use Row's: > {code} > from pyspark.sql import Row > from pyspark.sql.types import * > # The schema is encoded in a string. > schemaString = "a" > fields = [StructField(field_name, MapType(StringType(),IntegerType())) for > field_name in schemaString.split()] > schema = StructType(fields) > df = sqlContext.createDataFrame([Row(a={'b': 1})], schema) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7178) Improve DataFrame documentation and code samples
[ https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-7178: Description: AND and OR are not straightforward when using the new DataFrame API. the current convention - accepted by Pandas users - is to use the bitwise & and | instead of AND and OR. when using these, however, you need to wrap each expression in parenthesis to prevent the bitwise operator from dominating. also, working with StructTypes is a bit confusing. the following link: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema (Python tab) implies that you can work with tuples directly when creating a DataFrame. however, the following code errors out unless we explicitly use Row's: {code} fields = [StructField(field_name, MapType(StringType(),IntegerType())) for field_name in schemaString.split()] schema = StructType(fields) df = sqlContext.createDataFrame([Row(a={'b': 1})], schema) {code} was: AND and OR are not straightforward when using the new DataFrame API. the current convention - accepted by Pandas users - is to use the bitwise & and | instead of AND and OR. when using these, however, you need to wrap each expression in parenthesis to prevent the bitwise operator from dominating. also, it's a bit confusing when creating a > Improve DataFrame documentation and code samples > > > Key: SPARK-7178 > URL: https://issues.apache.org/jira/browse/SPARK-7178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.1 >Reporter: Chris Fregly > Labels: dataframe > > AND and OR are not straightforward when using the new DataFrame API. > the current convention - accepted by Pandas users - is to use the bitwise & > and | instead of AND and OR. when using these, however, you need to wrap > each expression in parenthesis to prevent the bitwise operator from > dominating. > also, working with StructTypes is a bit confusing. the following link: > https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema > (Python tab) implies that you can work with tuples directly when creating a > DataFrame. > however, the following code errors out unless we explicitly use Row's: > {code} > fields = [StructField(field_name, MapType(StringType(),IntegerType())) for > field_name in schemaString.split()] > schema = StructType(fields) > df = sqlContext.createDataFrame([Row(a={'b': 1})], schema) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7178) Improve DataFrame documentation and code samples
[ https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516070#comment-14516070 ] Chris Fregly commented on SPARK-7178: - cc'ing [~joshrosen] just talked to him about this as well. > Improve DataFrame documentation and code samples > > > Key: SPARK-7178 > URL: https://issues.apache.org/jira/browse/SPARK-7178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.1 >Reporter: Chris Fregly > Labels: dataframe > > AND and OR are not straightforward when using the new DataFrame API. > the current convention - accepted by Pandas users - is to use the bitwise & > and | instead of AND and OR. when using these, however, you need to wrap > each expression in parenthesis to prevent the bitwise operator from > dominating. > also, working with StructTypes is a bit confusing. the following link: > https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema > (Python tab) implies that you can work with tuples directly when creating a > DataFrame. > however, the following code errors out unless we explicitly use Row's: > {code} > fields = [StructField(field_name, MapType(StringType(),IntegerType())) for > field_name in schemaString.split()] > schema = StructType(fields) > df = sqlContext.createDataFrame([Row(a={'b': 1})], schema) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7178) Improve DataFrame documentation and code samples
[ https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516021#comment-14516021 ] Chris Fregly edited comment on SPARK-7178 at 4/28/15 12:46 AM: --- i recommend updating all of the following: 1) scala/python/pyspark docs (ie. https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame) 2) SQL Programming guide (ie. https://spark.apache.org/docs/latest/sql-programming-guide.html) was (Author: cfregly): i recommend updating both the scala docs and the SQL Programming guide. > Improve DataFrame documentation and code samples > > > Key: SPARK-7178 > URL: https://issues.apache.org/jira/browse/SPARK-7178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.1 >Reporter: Chris Fregly > Labels: dataframe > > AND and OR are not straightforward when using the new DataFrame API. > the current convention - accepted by Pandas users - is to use the bitwise & > and | instead of AND and OR. when using these, however, you need to wrap > each expression in parenthesis to prevent the bitwise operator from > dominating. > also, it's a bit confusing when creating a -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7178) Improve DataFrame documentation and code samples
[ https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-7178: Description: AND and OR are not straightforward when using the new DataFrame API. the current convention - accepted by Pandas users - is to use the bitwise & and | instead of AND and OR. when using these, however, you need to wrap each expression in parenthesis to prevent the bitwise operator from dominating. also, it's a bit confusing when creating a was: AND and OR are not straightforward when using the new DataFrame API. the current convention - accepted by Pandas users - is to use the bitwise & and | instead of AND and OR. when using these, however, you need to wrap each expression in parenthesis to prevent the bitwise operator from dominating. > Improve DataFrame documentation and code samples > > > Key: SPARK-7178 > URL: https://issues.apache.org/jira/browse/SPARK-7178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.1 >Reporter: Chris Fregly > Labels: dataframe > > AND and OR are not straightforward when using the new DataFrame API. > the current convention - accepted by Pandas users - is to use the bitwise & > and | instead of AND and OR. when using these, however, you need to wrap > each expression in parenthesis to prevent the bitwise operator from > dominating. > also, it's a bit confusing when creating a -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7178) Improve DataFrame documentation and code samples
[ https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-7178: Summary: Improve DataFrame documentation and code samples (was: Improve DataFrame documentation to include common uses like AND and OR semantics within filters, etc) > Improve DataFrame documentation and code samples > > > Key: SPARK-7178 > URL: https://issues.apache.org/jira/browse/SPARK-7178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.1 >Reporter: Chris Fregly > Labels: dataframe > > AND and OR are not straightforward when using the new DataFrame API. > the current convention - accepted by Pandas users - is to use the bitwise & > and | instead of AND and OR. when using these, however, you need to wrap > each expression in parenthesis to prevent the bitwise operator from > dominating. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7178) Improve DataFrame documentation to include common uses like AND and OR semantics within filters, etc
[ https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516021#comment-14516021 ] Chris Fregly commented on SPARK-7178: - i recommend updating both the scala docs and the SQL Programming guide. > Improve DataFrame documentation to include common uses like AND and OR > semantics within filters, etc > > > Key: SPARK-7178 > URL: https://issues.apache.org/jira/browse/SPARK-7178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.1 >Reporter: Chris Fregly > Labels: dataframe > > AND and OR are not straightforward when using the new DataFrame API. > the current convention - accepted by Pandas users - is to use the bitwise & > and | instead of AND and OR. when using these, however, you need to wrap > each expression in parenthesis to prevent the bitwise operator from > dominating. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7178) Improve DataFrame documentation to include common uses like AND and OR semantics within filters, etc
[ https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516024#comment-14516024 ] Chris Fregly commented on SPARK-7178: - cc'ing [~rxin] > Improve DataFrame documentation to include common uses like AND and OR > semantics within filters, etc > > > Key: SPARK-7178 > URL: https://issues.apache.org/jira/browse/SPARK-7178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.1 >Reporter: Chris Fregly > Labels: dataframe > > AND and OR are not straightforward when using the new DataFrame API. > the current convention - accepted by Pandas users - is to use the bitwise & > and | instead of AND and OR. when using these, however, you need to wrap > each expression in parenthesis to prevent the bitwise operator from > dominating. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7178) Improve DataFrame documentation to include common uses like AND and OR semantics within filters, etc
Chris Fregly created SPARK-7178: --- Summary: Improve DataFrame documentation to include common uses like AND and OR semantics within filters, etc Key: SPARK-7178 URL: https://issues.apache.org/jira/browse/SPARK-7178 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Chris Fregly AND and OR are not straightforward when using the new DataFrame API. the current convention - accepted by Pandas users - is to use the bitwise & and | instead of AND and OR. when using these, however, you need to wrap each expression in parenthesis to prevent the bitwise operator from dominating. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself
[ https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495613#comment-14495613 ] Chris Fregly commented on SPARK-6514: - [~tdas] can you take a look at this PR when you get a chance? > For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as > the Kinesis stream itself > > > Key: SPARK-6514 > URL: https://issues.apache.org/jira/browse/SPARK-6514 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.3.0 >Reporter: Chris Fregly > > context: i started the original Kinesis impl with KCL 1.0 (not supported), > then finished on KCL 1.1 (supported) without realizing that it's supported. > also, we should upgrade to the latest Kinesis Client Library (KCL) which is > currently v1.2 right now, i believe. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself
[ https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495607#comment-14495607 ] Chris Fregly commented on SPARK-6514: - https://github.com/apache/spark/pull/5375 > For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as > the Kinesis stream itself > > > Key: SPARK-6514 > URL: https://issues.apache.org/jira/browse/SPARK-6514 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.3.0 >Reporter: Chris Fregly > > context: i started the original Kinesis impl with KCL 1.0 (not supported), > then finished on KCL 1.1 (supported) without realizing that it's supported. > also, we should upgrade to the latest Kinesis Client Library (KCL) which is > currently v1.2 right now, i believe. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself
[ https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495605#comment-14495605 ] Chris Fregly commented on SPARK-6514: - hey pawel! i think keeping it regionName is fine. i just reviewed the KCL code and docs again - and realized that they also use this regionName for CloudWatch. and by exposing the implementation, i meant that calling something "Dynamo Region" would be awkward if AWS ever changes the implementation to be something other than Dynamo. that's the level of implementation that i was referring to. also, if we ever move off of KCL - or no longer need to set this region for whatever reason - it may bite us later. just something to think about. lastly, i think it's OK not to extract the region from the stream URL - especially if we make regionName a required field on the API. if it's optional, however, i would use the one in the stream location. this may be overly complicated. you're call. > For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as > the Kinesis stream itself > > > Key: SPARK-6514 > URL: https://issues.apache.org/jira/browse/SPARK-6514 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.3.0 >Reporter: Chris Fregly > > context: i started the original Kinesis impl with KCL 1.0 (not supported), > then finished on KCL 1.1 (supported) without realizing that it's supported. > also, we should upgrade to the latest Kinesis Client Library (KCL) which is > currently v1.2 right now, i believe. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()
[ https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495599#comment-14495599 ] Chris Fregly commented on SPARK-5960: - [~tdas] can you take a look at this PR real quick? > Allow AWS credentials to be passed to KinesisUtils.createStream() > - > > Key: SPARK-5960 > URL: https://issues.apache.org/jira/browse/SPARK-5960 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Chris Fregly >Assignee: Chris Fregly > > While IAM roles are preferable, we're seeing a lot of cases where we need to > pass AWS credentials when creating the KinesisReceiver. > Notes: > * Make sure we don't log the credentials anywhere > * Maintain compatibility with existing KinesisReceiver-based code. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself
[ https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482455#comment-14482455 ] Chris Fregly commented on SPARK-6514: - we may want to inspect the streamURL for the region otherwise, we would need to make the new regionName param be more explicit about its meaning. ie. dynamoRegion, but this exposes the implementation which is not good. > For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as > the Kinesis stream itself > > > Key: SPARK-6514 > URL: https://issues.apache.org/jira/browse/SPARK-6514 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.3.0 >Reporter: Chris Fregly > > context: i started the original Kinesis impl with KCL 1.0 (not supported), > then finished on KCL 1.1 (supported) without realizing that it's supported. > also, we should upgrade to the latest Kinesis Client Library (KCL) which is > currently v1.2 right now, i believe. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself
[ https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-6514: Description: context: i started the original Kinesis impl with KCL 1.0 (not supported), then finished on KCL 1.1 (supported) without realizing that it's supported. also, we should upgrade to the latest Kinesis Client Library (KCL) which is currently v1.2 right now, i believe. was: context: i started the original Kinesis impl with KCL 1.0 (not supported), then finished on KCL 1.1 (supported). also, we should upgrade to the latest Kinesis Client Library (KCL) which is currently v1.2 right now, i believe. > For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as > the Kinesis stream itself > > > Key: SPARK-6514 > URL: https://issues.apache.org/jira/browse/SPARK-6514 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.3.0 >Reporter: Chris Fregly > > context: i started the original Kinesis impl with KCL 1.0 (not supported), > then finished on KCL 1.1 (supported) without realizing that it's supported. > also, we should upgrade to the latest Kinesis Client Library (KCL) which is > currently v1.2 right now, i believe. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself
[ https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-6514: Description: context: i started the original Kinesis impl with KCL 1.0 (not supported), then finished on KCL 1.1 (supported). also, we should upgrade to the latest Kinesis Client Library (KCL) which is currently v1.2 right now, i believe. was: this was not supported when i originally wrote this receiver. this is now supported. also, upgrade to the latest Kinesis Client Library (KCL) which is 1.2, i believe. > For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as > the Kinesis stream itself > > > Key: SPARK-6514 > URL: https://issues.apache.org/jira/browse/SPARK-6514 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.3.0 >Reporter: Chris Fregly > > context: i started the original Kinesis impl with KCL 1.0 (not supported), > then finished on KCL 1.1 (supported). > also, we should upgrade to the latest Kinesis Client Library (KCL) which is > currently v1.2 right now, i believe. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6599) Improve reliability and usability of Kinesis-based Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-6599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-6599: Summary: Improve reliability and usability of Kinesis-based Spark Streaming (was: Add Kinesis Direct API) > Improve reliability and usability of Kinesis-based Spark Streaming > -- > > Key: SPARK-6599 > URL: https://issues.apache.org/jira/browse/SPARK-6599 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself
[ https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-6514: Target Version/s: 1.4.0 (was: 1.3.1) > For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as > the Kinesis stream itself > > > Key: SPARK-6514 > URL: https://issues.apache.org/jira/browse/SPARK-6514 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.3.0 >Reporter: Chris Fregly > > this was not supported when i originally wrote this receiver. > this is now supported. also, upgrade to the latest Kinesis Client Library > (KCL) which is 1.2, i believe. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392646#comment-14392646 ] Chris Fregly commented on SPARK-6407: - from [~mengxr] "The online update should be implemented with GraphX or indexedrdd, which may take some time. There is no open-source solution. Try doing a survey on existing algorithms for online matrix factorization updates." > Streaming ALS for Collaborative Filtering > - > > Key: SPARK-6407 > URL: https://issues.apache.org/jira/browse/SPARK-6407 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Felix Cheung >Priority: Minor > > Like MLLib's ALS implementation for recommendation, and applying to streaming. > Similar to streaming linear regression, logistic regression, could we apply > gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()
[ https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-5960: Target Version/s: 1.4.0 (was: 1.3.1) > Allow AWS credentials to be passed to KinesisUtils.createStream() > - > > Key: SPARK-5960 > URL: https://issues.apache.org/jira/browse/SPARK-5960 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Chris Fregly >Assignee: Chris Fregly > > While IAM roles are preferable, we're seeing a lot of cases where we need to > pass AWS credentials when creating the KinesisReceiver. > Notes: > * Make sure we don't log the credentials anywhere > * Maintain compatibility with existing KinesisReceiver-based code. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4184) Improve Spark Streaming documentation to address commonly-asked questions
[ https://issues.apache.org/jira/browse/SPARK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-4184: Target Version/s: 1.4.0 (was: 1.3.1) > Improve Spark Streaming documentation to address commonly-asked questions > -- > > Key: SPARK-4184 > URL: https://issues.apache.org/jira/browse/SPARK-4184 > Project: Spark > Issue Type: Documentation > Components: Streaming >Reporter: Chris Fregly > Labels: documentation, streaming > > Improve Streaming documentation including API descriptions, > concurrency/thread safety, fault tolerance, replication, checkpointing, > scalability, resource allocation and utilization, back pressure, and > monitoring. > also, add a section to the kinesis streaming guide describing how to use IAM > roles with the Spark Kinesis Receiver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6656) Allow the application name to be passed in versus pulling from SparkContext.getAppName()
Chris Fregly created SPARK-6656: --- Summary: Allow the application name to be passed in versus pulling from SparkContext.getAppName() Key: SPARK-6656 URL: https://issues.apache.org/jira/browse/SPARK-6656 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Chris Fregly this is useful for the scenario where Kinesis Spark Streaming is being invoked from the Spark Shell. in this case, the application name in the SparkContext is pre-set to "Spark Shell". this isn't a common or recommended use case, but it's best to make this configurable outside of SparkContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6654) Update Kinesis Streaming impls (both KCL-based and Direct) to use latest aws-java-sdk and kinesis-client-library
Chris Fregly created SPARK-6654: --- Summary: Update Kinesis Streaming impls (both KCL-based and Direct) to use latest aws-java-sdk and kinesis-client-library Key: SPARK-6654 URL: https://issues.apache.org/jira/browse/SPARK-6654 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Chris Fregly -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself
Chris Fregly created SPARK-6514: --- Summary: For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself Key: SPARK-6514 URL: https://issues.apache.org/jira/browse/SPARK-6514 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0 Reporter: Chris Fregly this was not supported when i originally wrote this receiver. this is now supported. also, upgrade to the latest Kinesis Client Library (KCL) which is 1.2, i believe. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4184) Improve Spark Streaming documentation to address commonly-asked questions
[ https://issues.apache.org/jira/browse/SPARK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14348241#comment-14348241 ] Chris Fregly commented on SPARK-4184: - add reference to Kinesis Docs re: the following Kinesis feature: https://aws.amazon.com/blogs/aws/amazon-kinesis-update-reduced-prop-delay/ > Improve Spark Streaming documentation to address commonly-asked questions > -- > > Key: SPARK-4184 > URL: https://issues.apache.org/jira/browse/SPARK-4184 > Project: Spark > Issue Type: Documentation > Components: Streaming >Reporter: Chris Fregly > Labels: documentation, streaming > > Improve Streaming documentation including API descriptions, > concurrency/thread safety, fault tolerance, replication, checkpointing, > scalability, resource allocation and utilization, back pressure, and > monitoring. > also, add a section to the kinesis streaming guide describing how to use IAM > roles with the Spark Kinesis Receiver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4184) Improve Spark Streaming documentation to address commonly-asked questions
[ https://issues.apache.org/jira/browse/SPARK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346371#comment-14346371 ] Chris Fregly commented on SPARK-4184: - Hey [~sowen]! I'm gonna move this to 1.3.1. This will compliment my work on https://issues.apache.org/jira/browse/SPARK-5960, although the documentation improvements will extend beyond Kinesis. Does that sound reasonable? I'll close this out once and for all! :) Thanks! -Chris > Improve Spark Streaming documentation to address commonly-asked questions > -- > > Key: SPARK-4184 > URL: https://issues.apache.org/jira/browse/SPARK-4184 > Project: Spark > Issue Type: Documentation > Components: Streaming >Reporter: Chris Fregly > Labels: documentation, streaming > > Improve Streaming documentation including API descriptions, > concurrency/thread safety, fault tolerance, replication, checkpointing, > scalability, resource allocation and utilization, back pressure, and > monitoring. > also, add a section to the kinesis streaming guide describing how to use IAM > roles with the Spark Kinesis Receiver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4184) Improve Spark Streaming documentation to address commonly-asked questions
[ https://issues.apache.org/jira/browse/SPARK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-4184: Summary: Improve Spark Streaming documentation to address commonly-asked questions (was: Improve Spark Streaming documentation) > Improve Spark Streaming documentation to address commonly-asked questions > -- > > Key: SPARK-4184 > URL: https://issues.apache.org/jira/browse/SPARK-4184 > Project: Spark > Issue Type: Documentation > Components: Streaming >Reporter: Chris Fregly > Labels: documentation, streaming > > Improve Streaming documentation including API descriptions, > concurrency/thread safety, fault tolerance, replication, checkpointing, > scalability, resource allocation and utilization, back pressure, and > monitoring. > also, add a section to the kinesis streaming guide describing how to use IAM > roles with the Spark Kinesis Receiver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4184) Improve Spark Streaming documentation
[ https://issues.apache.org/jira/browse/SPARK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-4184: Target Version/s: 1.3.1 (was: 1.2.0) > Improve Spark Streaming documentation > - > > Key: SPARK-4184 > URL: https://issues.apache.org/jira/browse/SPARK-4184 > Project: Spark > Issue Type: Documentation > Components: Streaming >Reporter: Chris Fregly > Labels: documentation, streaming > > Improve Streaming documentation including API descriptions, > concurrency/thread safety, fault tolerance, replication, checkpointing, > scalability, resource allocation and utilization, back pressure, and > monitoring. > also, add a section to the kinesis streaming guide describing how to use IAM > roles with the Spark Kinesis Receiver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6102) Create a SparkSQL DataSource API implementation for Redshift
Chris Fregly created SPARK-6102: --- Summary: Create a SparkSQL DataSource API implementation for Redshift Key: SPARK-6102 URL: https://issues.apache.org/jira/browse/SPARK-6102 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: Chris Fregly -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6101) Create a SparkSQL DataSource API implementation for DynamoDB
Chris Fregly created SPARK-6101: --- Summary: Create a SparkSQL DataSource API implementation for DynamoDB Key: SPARK-6101 URL: https://issues.apache.org/jira/browse/SPARK-6101 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: Chris Fregly Fix For: 1.4.0 similar to https://github.com/databricks/spark-avro -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()
[ https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342359#comment-14342359 ] Chris Fregly commented on SPARK-5960: - linking to an old jira where this was originally brought up. decided to add support for AWS credentials for non-IAM environments - of which we still see a fair amount. this will open up kinesis to more environments. > Allow AWS credentials to be passed to KinesisUtils.createStream() > - > > Key: SPARK-5960 > URL: https://issues.apache.org/jira/browse/SPARK-5960 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Chris Fregly >Assignee: Chris Fregly > > While IAM roles are preferable, we're seeing a lot of cases where we need to > pass AWS credentials when creating the KinesisReceiver. > Notes: > * Make sure we don't log the credentials anywhere > * Maintain compatibility with existing KinesisReceiver-based code. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4144) Support incremental model training of Naive Bayes classifier
[ https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341901#comment-14341901 ] Chris Fregly commented on SPARK-4144: - Hey [~freeman-lab]! I was literally just talking to [~josephkb] in the office last week about picking this up. Great timing! Let's coordinate offline. I'll shoot you an email. -Chris > Support incremental model training of Naive Bayes classifier > > > Key: SPARK-4144 > URL: https://issues.apache.org/jira/browse/SPARK-4144 > Project: Spark > Issue Type: Improvement > Components: MLlib, Streaming >Reporter: Chris Fregly >Assignee: Jeremy Freeman > > Per Xiangrui Meng from the following user list discussion: > http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E > > "For Naive Bayes, we need to update the priors and conditional > probabilities, which means we should also remember the number of > observations for the updates." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()
[ https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335846#comment-14335846 ] Chris Fregly commented on SPARK-5960: - pushing this up to 1.3.1 > Allow AWS credentials to be passed to KinesisUtils.createStream() > - > > Key: SPARK-5960 > URL: https://issues.apache.org/jira/browse/SPARK-5960 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Chris Fregly >Assignee: Chris Fregly > > While IAM roles are preferable, we're seeing a lot of cases where we need to > pass AWS credentials when creating the KinesisReceiver. > Notes: > * Make sure we don't log the credentials anywhere > * Maintain compatibility with existing KinesisReceiver-based code. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()
[ https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-5960: Target Version/s: 1.3.1 (was: 1.4.0) > Allow AWS credentials to be passed to KinesisUtils.createStream() > - > > Key: SPARK-5960 > URL: https://issues.apache.org/jira/browse/SPARK-5960 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Chris Fregly >Assignee: Chris Fregly > > While IAM roles are preferable, we're seeing a lot of cases where we need to > pass AWS credentials when creating the KinesisReceiver. > Notes: > * Make sure we don't log the credentials anywhere > * Maintain compatibility with existing KinesisReceiver-based code. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5959) Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver
[ https://issues.apache.org/jira/browse/SPARK-5959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334256#comment-14334256 ] Chris Fregly commented on SPARK-5959: - Checkpointing at a specific sequence number is supported by the IRecordProcessorCheckpointer interface. > Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver > - > > Key: SPARK-5959 > URL: https://issues.apache.org/jira/browse/SPARK-5959 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Chris Fregly > > After each block is stored reliably in the WAL (after the store() call > returns), ACK back to Kinesis. > There is still the issue of the ReliableKinesisReceiver dying before the ACK > back to Kinesis, however no data will be lost. Duplicate data is still > possible. > Notes: > * Make sure we're not overloading the checkpoint control plane which uses > DynamoDB. > * May need to disable auto-checkpointing and remove the checkpoint interval. > * Maintain compatibility with existing KinesisReceiver-based code. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5961) Allow specific nodes in a Spark Streaming cluster to be dedicated/preferred as Receiver Worker Nodes versus regular Spark Worker Nodes
Chris Fregly created SPARK-5961: --- Summary: Allow specific nodes in a Spark Streaming cluster to be dedicated/preferred as Receiver Worker Nodes versus regular Spark Worker Nodes Key: SPARK-5961 URL: https://issues.apache.org/jira/browse/SPARK-5961 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Chris Fregly This type of configuration has come up a lot where certain nodes should be dedicated as Spark Streaming Receivers and others as regular Spark Workers. The reasons include the following: 1) Different instance types/sizes for Receivers vs regular Workers 2) Different OS tuning params for Receivers vs regular Workers ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()
Chris Fregly created SPARK-5960: --- Summary: Allow AWS credentials to be passed to KinesisUtils.createStream() Key: SPARK-5960 URL: https://issues.apache.org/jira/browse/SPARK-5960 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Chris Fregly While IAM roles are preferable, we're seeing a lot of cases where we need to pass AWS credentials when creating the KinesisReceiver. Notes: * Make sure we don't log the credentials anywhere * Maintain compatibility with existing KinesisReceiver-based code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5959) Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver
[ https://issues.apache.org/jira/browse/SPARK-5959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-5959: Component/s: Streaming Description: After each block is stored reliably in the WAL (after the store() call returns), ACK back to Kinesis. There is still the issue of the ReliableKinesisReceiver dying before the ACK back to Kinesis, however no data will be lost. Duplicate data is still possible. Notes: * Make sure we're not overloading the checkpoint control plane which uses DynamoDB. * May need to disable auto-checkpointing and remove the checkpoint interval. * Maintain compatibility with existing KinesisReceiver-based code. was: After each block is stored reliably in the WAL (after the store() call returns), ACK back to Kinesis. There is still the issue of the ReliableKinesisReceiver dying before the ACK back to Kinesis, however no data will be lost. Duplicate data is still possible. Notes * Make sure we're not overloading the checkpoint control plane which uses DynamoDB. * May need to disable auto-checkpointing and remove the checkpoint interval. Target Version/s: 1.4.0 Affects Version/s: 1.1.0 Fix Version/s: (was: 1.4.0) > Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver > - > > Key: SPARK-5959 > URL: https://issues.apache.org/jira/browse/SPARK-5959 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Chris Fregly > > After each block is stored reliably in the WAL (after the store() call > returns), ACK back to Kinesis. > There is still the issue of the ReliableKinesisReceiver dying before the ACK > back to Kinesis, however no data will be lost. Duplicate data is still > possible. > Notes: > * Make sure we're not overloading the checkpoint control plane which uses > DynamoDB. > * May need to disable auto-checkpointing and remove the checkpoint interval. > * Maintain compatibility with existing KinesisReceiver-based code. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5959) Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver
Chris Fregly created SPARK-5959: --- Summary: Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver Key: SPARK-5959 URL: https://issues.apache.org/jira/browse/SPARK-5959 Project: Spark Issue Type: Improvement Reporter: Chris Fregly Fix For: 1.4.0 After each block is stored reliably in the WAL (after the store() call returns), ACK back to Kinesis. There is still the issue of the ReliableKinesisReceiver dying before the ACK back to Kinesis, however no data will be lost. Duplicate data is still possible. Notes * Make sure we're not overloading the checkpoint control plane which uses DynamoDB. * May need to disable auto-checkpointing and remove the checkpoint interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4144) Support incremental model training of Naive Bayes classifier
[ https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305351#comment-14305351 ] Chris Fregly commented on SPARK-4144: - Hi there! Any update on this? I was thinking of working on this as it's been idle for the last few months. Lemme know. Thanks! -Chris > Support incremental model training of Naive Bayes classifier > > > Key: SPARK-4144 > URL: https://issues.apache.org/jira/browse/SPARK-4144 > Project: Spark > Issue Type: Improvement > Components: MLlib, Streaming >Reporter: Chris Fregly >Assignee: Liquan Pei > > Per Xiangrui Meng from the following user list discussion: > http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E > > "For Naive Bayes, we need to update the priors and conditional > probabilities, which means we should also remember the number of > observations for the updates." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4184) Improve Spark Streaming documentation
[ https://issues.apache.org/jira/browse/SPARK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14262377#comment-14262377 ] Chris Fregly commented on SPARK-4184: - hey josh- lemme go through my notes and figure out if everything got into TD's latest iteration of the docs. i'll get back to you in the next few days. good catch. > Improve Spark Streaming documentation > - > > Key: SPARK-4184 > URL: https://issues.apache.org/jira/browse/SPARK-4184 > Project: Spark > Issue Type: Documentation > Components: Streaming >Reporter: Chris Fregly > Labels: documentation, streaming > > Improve Streaming documentation including API descriptions, > concurrency/thread safety, fault tolerance, replication, checkpointing, > scalability, resource allocation and utilization, back pressure, and > monitoring. > also, add a section to the kinesis streaming guide describing how to use IAM > roles with the Spark Kinesis Receiver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4689) Unioning 2 SchemaRDDs should return a SchemaRDD in Python, Scala, and Java
Chris Fregly created SPARK-4689: --- Summary: Unioning 2 SchemaRDDs should return a SchemaRDD in Python, Scala, and Java Key: SPARK-4689 URL: https://issues.apache.org/jira/browse/SPARK-4689 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Chris Fregly Priority: Minor Currently, you need to use unionAll() in Scala. Python does not expose this functionality at the moment. The current work around is to use the UNION ALL HiveQL functionality detailed here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3640) KinesisUtils should accept a credentials object instead of forcing DefaultCredentialsProvider
[ https://issues.apache.org/jira/browse/SPARK-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14204334#comment-14204334 ] Chris Fregly commented on SPARK-3640: - quick update: Aniket and I spoke off-line about using AWS IAM Instance Profiles for EC2 instances. These work similar to IAM User Profiles - you can apply fine-grained IAM Policies to EC2 instances. The DefaultCredentialsProvider handles all of these sources of AWS credentials. I am adding all of this to the Kinesis Spark Streaming Guide, btw. Summary: we may be able to close this jira without a change. just waiting for Aniket to confirm that this AWS Instance Profile approach satisfies his need. it seems to be a safer approach than passing credentials between Spark Driver and Worker nodes. > KinesisUtils should accept a credentials object instead of forcing > DefaultCredentialsProvider > - > > Key: SPARK-3640 > URL: https://issues.apache.org/jira/browse/SPARK-3640 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Aniket Bhatnagar > Labels: kinesis > > KinesisUtils should accept AWS Credentials as a parameter and should default > to DefaultCredentialsProvider if no credentials are provided. Currently, the > implementation forces usage of DefaultCredentialsProvider which can be a pain > especially when jobs are run by multiple unix users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4184) Improve Spark Streaming documentation
Chris Fregly created SPARK-4184: --- Summary: Improve Spark Streaming documentation Key: SPARK-4184 URL: https://issues.apache.org/jira/browse/SPARK-4184 Project: Spark Issue Type: Documentation Components: Streaming Reporter: Chris Fregly Fix For: 1.2.0 Improve Streaming documentation including API descriptions, concurrency/thread safety, fault tolerance, replication, checkpointing, scalability, resource allocation and utilization, back pressure, and monitoring. also, add a section to the kinesis streaming guide describing how to use IAM roles with the Spark Kinesis Receiver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3639) Kinesis examples set master as local
[ https://issues.apache.org/jira/browse/SPARK-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14193285#comment-14193285 ] Chris Fregly commented on SPARK-3639: - great catch, aniket! for some background context, i was trying to make the sample easier to run out of the box. i overlooked the spark-submit scenario, unfortunately. thanks for fixing this. few things: 1) does the Streaming Kinesis Guide (docs/streaming-kinesis-integration.md) need updating with your change? specifically, the Running the Example section? i don't think so, but something to double-check. 2) i noticed you put a comment in the scaladoc about needing +1 workers/threads than receivers, perhaps we should reword this to say (number of kinesis shards+1) workers/threads are needed because the number of receivers is determined by the number of shards in the kinesis stream. might tighten up the message a bit. 3) should we throw an error if the number of workers/threads is not sufficient? nobody likes an error message, but might be helpful here. this is the basis of https://issues.apache.org/jira/browse/SPARK-2475, btw. might want to keep an eye on that jira. thanks again, man. great catch. -chris > Kinesis examples set master as local > > > Key: SPARK-3639 > URL: https://issues.apache.org/jira/browse/SPARK-3639 > Project: Spark > Issue Type: Bug > Components: Examples, Streaming >Affects Versions: 1.0.2, 1.1.0 >Reporter: Aniket Bhatnagar >Assignee: Aniket Bhatnagar >Priority: Minor > Labels: examples > Fix For: 1.1.1, 1.2.0 > > > Kinesis examples set master as local thus not allowing the example to be > tested on a cluster -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3640) KinesisUtils should accept a credentials object instead of forcing DefaultCredentialsProvider
[ https://issues.apache.org/jira/browse/SPARK-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192411#comment-14192411 ] Chris Fregly commented on SPARK-3640: - Agreed that this was no ideal when i first chose this implementation. And as you mentioned, the NotSerializableException is exactly why I went with the DefaultCredentialsProvider. So I spent some time trying to solve this using AWS IAM Roles on separate users under your root AWS account. This appears to work well with the existing DefaultCredentialsProvider. Is this a viable option for you? Basically, every user would get their own ACCESS_KEY_ID and SECRET_KEY. This would be used in place of the root credentials. For thoroughness, I've included links to the instructions as well as an example IAM Policy JSON (I'll also add this to the Spark Kinesis Developer Guide (http://spark.apache.org/docs/latest/streaming-kinesis-integration.html): Creating IAM users http://docs.aws.amazon.com/IAM/latest/UserGuide/Using_SettingUpUser.html https://console.aws.amazon.com/iam/home?#security_credential Setting up Kinesis, DynamoDB, and CloudWatch IAM Policy for the new users http://docs.aws.amazon.com/kinesis/latest/dev/kinesis-using-iam.html IAM Policy Generator http://awspolicygen.s3.amazonaws.com/policygen.html Attaching the Custom Policy https://console.aws.amazon.com/iam/home?#users Select the user Select Attach Policy Select Custom Policy IAM Policy JSON This is already generated using the Policy Generator above... just fill in the missing pieces specific to your environment. { "Statement": [ { "Sid": "Stmt1414784467497", "Action": "kinesis:*", "Effect": "Allow", "Resource": "arn:aws:kinesis:::stream/" }, { "Sid": "Stmt1414784693732", "Action": "dynamodb:*", "Effect": "Allow", "Resource": "arn:aws:dynamodb:us-east-1::table/" }, { "Sid": "Stmt1414785131046", "Action": "cloudwatch:*", "Effect": "Allow", "Resource": "*" } ] } Notes: * The region of the DynamoDB table is intentionally hard-coded to us-east-1 as this is how Kinesis currently works * The DynamoDB table is the same as the application name of the Kinesis Streaming Application. The sample included with the Spark distribution uses KinesisWordCount for the application/table name. Is this a sufficient workaround. Using IAM Policies is an AWS best practice, but not sure if this aligns with your existing environment. If not, I can continue to investigate exposing that CredentialsProvider Lemme know, Aniket! > KinesisUtils should accept a credentials object instead of forcing > DefaultCredentialsProvider > - > > Key: SPARK-3640 > URL: https://issues.apache.org/jira/browse/SPARK-3640 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Aniket Bhatnagar > Labels: kinesis > > KinesisUtils should accept AWS Credentials as a parameter and should default > to DefaultCredentialsProvider if no credentials are provided. Currently, the > implementation forces usage of DefaultCredentialsProvider which can be a pain > especially when jobs are run by multiple unix users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4144) Support incremental model training of Naive Bayes classifier
Chris Fregly created SPARK-4144: --- Summary: Support incremental model training of Naive Bayes classifier Key: SPARK-4144 URL: https://issues.apache.org/jira/browse/SPARK-4144 Project: Spark Issue Type: Improvement Components: MLlib, Streaming Reporter: Chris Fregly Per Xiangrui Meng from the following user list discussion: http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E "For Naive Bayes, we need to update the priors and conditional probabilities, which means we should also remember the number of observations for the updates." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2579) Reading from S3 returns an inconsistent number of items with Spark 0.9.1
[ https://issues.apache.org/jira/browse/SPARK-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14116643#comment-14116643 ] Chris Fregly commented on SPARK-2579: - interesting and possibly-related blog post from netflix earlier this year: http://techblog.netflix.com/2014/01/s3mper-consistency-in-cloud.html > Reading from S3 returns an inconsistent number of items with Spark 0.9.1 > > > Key: SPARK-2579 > URL: https://issues.apache.org/jira/browse/SPARK-2579 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 0.9.1 >Reporter: Eemil Lagerspetz >Priority: Critical > Labels: hdfs, read, s3, skipping > > I have created a random matrix of 1M rows with 10K items on each row, > semicolon-separated. While reading it with Spark 0.9.1 and doing a count, I > consistently get less than 1M rows, and a different number every time at that > ( !! ). Example below: > head -n 1 tool-generate-random-matrix*log > ==> tool-generate-random-matrix-999158.log <== > Row item counts: 999158 > ==> tool-generate-random-matrix.log <== > Row item counts: 997163 > The data is split into 1000 partitions. When I download it using s3cmd sync, > and run the following AWK on it, I get the correct number of rows in each > partition (1000x1000 = 1M). What is up? > {code:title=checkrows.sh|borderStyle=solid} > for k in part-0* > do > echo $k > awk -F ";" ' > NF != 1 { > print "Wrong number of items:",NF > } > END { > if (NR != 1000) { > print "Wrong number of rows:",NR > } > }' "$k" > done > {code} > The matrix generation and counting code is below: > {code:title=Matrix.scala|borderStyle=solid} > package fi.helsinki.cs.nodes.matrix > import java.util.Random > import org.apache.spark._ > import org.apache.spark.SparkContext._ > import scala.collection.mutable.ListBuffer > import org.apache.spark.rdd.RDD > import org.apache.spark.storage.StorageLevel._ > object GenerateRandomMatrix { > def NewGeMatrix(rSeed: Int, rdd: RDD[Int], features: Int) = { > rdd.mapPartitions(part => part.map(xarr => { > val rdm = new Random(rSeed + xarr) > val arr = new Array[Double](features) > for (i <- 0 until features) > arr(i) = rdm.nextDouble() > new Row(xarr, arr) > })) > } > case class Row(id: Int, elements: Array[Double]) {} > def rowFromText(line: String) = { > val idarr = line.split(" ") > val arr = idarr(1).split(";") > // -1 to fix saved matrix indexing error > new Row(idarr(0).toInt-1, arr.map(_.toDouble)) > } > def main(args: Array[String]) { > val master = args(0) > val tasks = args(1).toInt > val savePath = args(2) > val read = args.contains("read") > > val datapoints = 100 > val features = 1 > val sc = new SparkContext(master, "RandomMatrix") > if (read) { > val randomMatrix: RDD[Row] = sc.textFile(savePath, > tasks).map(rowFromText).persist(MEMORY_AND_DISK) > println("Row item counts: "+ randomMatrix.count) > } else { > val rdd = sc.parallelize(0 until datapoints, tasks) > val bcSeed = sc.broadcast(128) > /* Generating a matrix of random Doubles */ > val randomMatrix = NewGeMatrix(bcSeed.value, rdd, > features).persist(MEMORY_AND_DISK) > randomMatrix.map(row => row.id + " " + > row.elements.mkString(";")).saveAsTextFile(savePath) > } > > sc.stop > } > } > {code} > I run this with: > appassembler/bin/tool-generate-random-matrix master 1000 > s3n://keys@path/to/data 1>matrix.log 2>matrix.err > Reading from HDFS gives the right count and right number of items on each > row. However, I had to run with the full path with the server name, just > /matrix does not work (it thinks I want file://): > p="hdfs://ec2-54-188-6-77.us-west-2.compute.amazonaws.com:9000/matrix" > appassembler/bin/tool-generate-random-matrix $( cat > /root/spark-ec2/cluster-url ) 1000 "$p" read 1>readmatrix.log 2>readmatrix.err -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2475) Check whether #cores > #receivers in local mode
[ https://issues.apache.org/jira/browse/SPARK-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114204#comment-14114204 ] Chris Fregly commented on SPARK-2475: - another option for the examples, specifically, is to default the number of local threads similar to to how the Kinesis example does it: https://github.com/apache/spark/blob/ae58aea2d1435b5bb011e68127e1bcddc2edf5b2/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L104 i get the number of shards in the given Kinesis stream and add 1. the goal was to make this example work out of the box with little friction - even an error message can be discouraging. for the other examples, we could just default to 2. the advanced user can override if they want. though i don't think i support an override in my kinesis example. whoops! :) > Check whether #cores > #receivers in local mode > --- > > Key: SPARK-2475 > URL: https://issues.apache.org/jira/browse/SPARK-2475 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Tathagata Das > > When the number of slots in local mode is not more than the number of > receivers, then the system should throw an error. Otherwise the system just > keeps waiting for resources to process the received data. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support
[ https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14083836#comment-14083836 ] Chris Fregly commented on SPARK-1981: - hey nick- due to the Kinesis Client Library's ASL license restriction, we ended up isolating all kinesis-related code to the extras/kinesis-asl module. this module can be activated at build time by including -Pkinesis-asl in either sbt or maven. this is all documented here, btw: https://github.com/apache/spark/blob/master/docs/streaming-kinesis.md looks like i messed up the markdown a bit. whoops! but the details are all there. i'll try to clean that up. > Add AWS Kinesis streaming support > - > > Key: SPARK-1981 > URL: https://issues.apache.org/jira/browse/SPARK-1981 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Chris Fregly >Assignee: Chris Fregly > Fix For: 1.1.0 > > > Add AWS Kinesis support to Spark Streaming. > Initial discussion occured here: https://github.com/apache/spark/pull/223 > I discussed this with Parviz from AWS recently and we agreed that I would > take this over. > Look for a new PR that takes into account all the feedback from the earlier > PR including spark-1.0-compliant implementation, AWS-license-aware build > support, tests, comments, and style guide compliance. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2770) Rename spark-ganglia-lgpl to ganglia-lgpl
Chris Fregly created SPARK-2770: --- Summary: Rename spark-ganglia-lgpl to ganglia-lgpl Key: SPARK-2770 URL: https://issues.apache.org/jira/browse/SPARK-2770 Project: Spark Issue Type: Improvement Components: Build Reporter: Chris Fregly Priority: Minor Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support
[ https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14078422#comment-14078422 ] Chris Fregly commented on SPARK-1981: - [~matei] the ec2 scripts allow you to specify a github repo and commit hash, so i assumed they can build from source. if this is the case, i need the ability to pass the list of -P build profiles such as -Pspark-kinesis-asl which i don't think exists currently. how about the audit and release process? have i covered everything there? thanks! -chris > Add AWS Kinesis streaming support > - > > Key: SPARK-1981 > URL: https://issues.apache.org/jira/browse/SPARK-1981 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Chris Fregly >Assignee: Chris Fregly > > Add AWS Kinesis support to Spark Streaming. > Initial discussion occured here: https://github.com/apache/spark/pull/223 > I discussed this with Parviz from AWS recently and we agreed that I would > take this over. > Look for a new PR that takes into account all the feedback from the earlier > PR including spark-1.0-compliant implementation, AWS-license-aware build > support, tests, comments, and style guide compliance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support
[ https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072761#comment-14072761 ] Chris Fregly commented on SPARK-1981: - in addition to the ec2 scripts, can someone verify that all other build-and-release-related use cases have been covered? i mimic'd the ganglia extras project, but this project doesn't seem to be covered by either the ec2 scripts or the audit-release process. perhaps we should add it, as well? any advice from someone closer to the build and release process would be appreciated. thanks! -chris > Add AWS Kinesis streaming support > - > > Key: SPARK-1981 > URL: https://issues.apache.org/jira/browse/SPARK-1981 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Chris Fregly >Assignee: Chris Fregly > > Add AWS Kinesis support to Spark Streaming. > Initial discussion occured here: https://github.com/apache/spark/pull/223 > I discussed this with Parviz from AWS recently and we agreed that I would > take this over. > Look for a new PR that takes into account all the feedback from the earlier > PR including spark-1.0-compliant implementation, AWS-license-aware build > support, tests, comments, and style guide compliance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support
[ https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067696#comment-14067696 ] Chris Fregly commented on SPARK-1981: - [~pwendell] is there anything i need to do within the spark_ec2 scripts to makes sure kinesis is built and/or enabled when EC2 instances are created? i want to make sure i'm covering all the bases. > Add AWS Kinesis streaming support > - > > Key: SPARK-1981 > URL: https://issues.apache.org/jira/browse/SPARK-1981 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Chris Fregly >Assignee: Chris Fregly > > Add AWS Kinesis support to Spark Streaming. > Initial discussion occured here: https://github.com/apache/spark/pull/223 > I discussed this with Parviz from AWS recently and we agreed that I would > take this over. > Look for a new PR that takes into account all the feedback from the earlier > PR including spark-1.0-compliant implementation, AWS-license-aware build > support, tests, comments, and style guide compliance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support
[ https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14063247#comment-14063247 ] Chris Fregly commented on SPARK-1981: - PR: https://github.com/apache/spark/pull/1434 > Add AWS Kinesis streaming support > - > > Key: SPARK-1981 > URL: https://issues.apache.org/jira/browse/SPARK-1981 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Chris Fregly >Assignee: Chris Fregly > > Add AWS Kinesis support to Spark Streaming. > Initial discussion occured here: https://github.com/apache/spark/pull/223 > I discussed this with Parviz from AWS recently and we agreed that I would > take this over. > Look for a new PR that takes into account all the feedback from the earlier > PR including spark-1.0-compliant implementation, AWS-license-aware build > support, tests, comments, and style guide compliance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support
[ https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061007#comment-14061007 ] Chris Fregly commented on SPARK-1981: - quick update: i completed all code, examples, tests, build, and documentation changes this weekend. everything looks good. however, when i went to merge last night, i noticed this PR: https://github.com/apache/spark/pull/772 this changes the underlying maven and sbt builds a bit - for the better, of course! reverting my build changes and adapting to the new build structure are the last step which i plan to tackle today. almost there! > Add AWS Kinesis streaming support > - > > Key: SPARK-1981 > URL: https://issues.apache.org/jira/browse/SPARK-1981 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Chris Fregly >Assignee: Chris Fregly > > Add AWS Kinesis support to Spark Streaming. > Initial discussion occured here: https://github.com/apache/spark/pull/223 > I discussed this with Parviz from AWS recently and we agreed that I would > take this over. > Look for a new PR that takes into account all the feedback from the earlier > PR including spark-1.0-compliant implementation, AWS-license-aware build > support, tests, comments, and style guide compliance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support
[ https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14057847#comment-14057847 ] Chris Fregly commented on SPARK-1981: - hey guys- i'm in the final phases of cleanup. i refactored quite a bit of the original code to make things more testable - and easier to understand. oh, and i did, indeed, choose the optional-module route. we'll address the additional complexity through documentation. that's what i'm working on right now, actually. hoping to submit the PR by tomorrow or this weekend at the very latest. the goal is to get this in to the 1.1 release which has a timeline outlined here: https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage thanks! -chris > Add AWS Kinesis streaming support > - > > Key: SPARK-1981 > URL: https://issues.apache.org/jira/browse/SPARK-1981 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Chris Fregly >Assignee: Chris Fregly > > Add AWS Kinesis support to Spark Streaming. > Initial discussion occured here: https://github.com/apache/spark/pull/223 > I discussed this with Parviz from AWS recently and we agreed that I would > take this over. > Look for a new PR that takes into account all the feedback from the earlier > PR including spark-1.0-compliant implementation, AWS-license-aware build > support, tests, comments, and style guide compliance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support
[ https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053037#comment-14053037 ] Chris Fregly commented on SPARK-1981: - [~matei] [~pwendell] i'm in the process of making the Kinesis Streaming component an optional module similar to ganglia per https://issues.apache.org/jira/browse/LEGAL-198 unlike the ganglia component, however, this component has tests and examples similar to the other streaming implementations such as Kafka and Flume. these other implementations have their tests and examples in external/ and examples/, respectively. if i understand correctly, i need to put the kinesis streaming code, test, and examples all within extras/, correct? this will cause a bit of confusion for people searching the examples/ source, but - unless i'm missing something - this is the best we can do given the current build scripts and directory structure. is this the correct approach? the other option is to stick with the base AWS Java SDK which is under Apache 2.0 license (https://github.com/aws/aws-sdk-java/blob/master/LICENSE.txt) we'd lose some of the convenience goodies that the Kinesis Client Library gives us like worker load balancing, shard autoscaling, checkpointing, etc but would simplify the build. definitely not optimal, but throwing it out as an option. thoughts? > Add AWS Kinesis streaming support > - > > Key: SPARK-1981 > URL: https://issues.apache.org/jira/browse/SPARK-1981 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Chris Fregly >Assignee: Chris Fregly > > Add AWS Kinesis support to Spark Streaming. > Initial discussion occured here: https://github.com/apache/spark/pull/223 > I discussed this with Parviz from AWS recently and we agreed that I would > take this over. > Look for a new PR that takes into account all the feedback from the earlier > PR including spark-1.0-compliant implementation, AWS-license-aware build > support, tests, comments, and style guide compliance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support
[ https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049484#comment-14049484 ] Chris Fregly commented on SPARK-1981: - hey jonathan! i was just talking to the databricks guys about this at the spark summit yesterday. it's my top priority after the summit ends (tomorrow). my goal is to get a PR submitted by this weekend. the code is written. i just need to do some cleanup. [~pwendell] can you assign this jira to me? i don't have permission, it appears. thanks! -chris > Add AWS Kinesis streaming support > - > > Key: SPARK-1981 > URL: https://issues.apache.org/jira/browse/SPARK-1981 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Chris Fregly > > Add AWS Kinesis support to Spark Streaming. > Initial discussion occured here: https://github.com/apache/spark/pull/223 > I discussed this with Parviz from AWS recently and we agreed that I would > take this over. > Look for a new PR that takes into account all the feedback from the earlier > PR including spark-1.0-compliant implementation, AWS-license-aware build > support, tests, comments, and style guide compliance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1981) Add AWS Kinesis streaming support
Chris Fregly created SPARK-1981: --- Summary: Add AWS Kinesis streaming support Key: SPARK-1981 URL: https://issues.apache.org/jira/browse/SPARK-1981 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Chris Fregly Add AWS Kinesis support to Spark Streaming. Initial discussion occured here: https://github.com/apache/spark/pull/223 I discussed this with Parviz from AWS recently and we agreed that I would take this over. Look for a new PR that takes into account all the feedback from the earlier PR including spark-1.0-compliant implementation, AWS-license-aware build support, tests, comments, and style guide compliance. -- This message was sent by Atlassian JIRA (v6.2#6252)