[jira] [Commented] (SPARK-9492) LogisticRegression in R should provide model statistics
[ https://issues.apache.org/jira/browse/SPARK-9492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14977134#comment-14977134 ] Bilind Hajer commented on SPARK-9492: - So is there no other way to get coefficients from a PipeLineModel of family binomial? > LogisticRegression in R should provide model statistics > --- > > Key: SPARK-9492 > URL: https://issues.apache.org/jira/browse/SPARK-9492 > Project: Spark > Issue Type: Sub-task > Components: ML, R >Reporter: Eric Liang > > Like ml LinearRegression, LogisticRegression should provide a training > summary including feature names and their coefficients. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968596#comment-14968596 ] Bilind Hajer commented on SPARK-11190: -- > df.recipes <- read.df(sqlContext, source = > "org.apache.spark.sql.cassandra",keyspace = "datahub", table = > "person_recipes") > someRecords <- head(df.recipes) > mapField <- someRecords$recipes[[1]] > ls( mapField ) [1] "1000" "1100" "12000" "18000" "2000" "22000" "22074" "24000" [9] "28000" "3000" "33000" "44000" "45000" "47000" "48000" "49000" [17] "5000" "51000" "53000" "55000" "56000" "57000" "57076" "6" [25] "63000" "64000" "65000" "66000" "67000" "73000" "75000" "79000" [33] "8" "82000" "83000" "84000" "87000" "89000" "9" "999000" > mapField[["1000"]] [1] 0 success! Thanks guys, R environment data types are pretty cool, didn't know they existed. > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the
[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14967979#comment-14967979 ] Bilind Hajer commented on SPARK-11190: -- Ok, got master spark version built, I am no longer getting the Error in as.data.frame.default(x[[i]], optional = TRUE) when reading a data frame from a cassandra column family that contains collection data types. But, for example, for a mapit is reading the field as something like > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the column_family correctly, but now lets add a > collection data type to the schema. > Now I will drop that test_table, and recreate the table, with with an extra > column of data type map > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5
[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14967903#comment-14967903 ] Bilind Hajer commented on SPARK-11190: -- I built spark using mvn, on the master version on github right now, spark-scala is able to build, but when I open a sparkR shell, it seems I just have access to R, and not spark R. Any ideas? > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the column_family correctly, but now lets add a > collection data type to the schema. > Now I will drop that test_table, and recreate the table, with with an extra > column of data type map> CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7
[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968079#comment-14968079 ] Bilind Hajer commented on SPARK-11190: -- Well this would be a mapdatatype from Cassandra, and it should convert to the respective R datatype when read in sparkR. I do not think R has a map datatype? > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the column_family correctly, but now lets add a > collection data type to the schema. > Now I will drop that test_table, and recreate the table, with with an extra > column of data type map > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7
[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965391#comment-14965391 ] Bilind Hajer commented on SPARK-11190: -- Nvm, answered my own question. Will test on Master and let you guys know. Thanks. > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the column_family correctly, but now lets add a > collection data type to the schema. > Now I will drop that test_table, and recreate the table, with with an extra > column of data type map> CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9
[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965320#comment-14965320 ] Bilind Hajer commented on SPARK-11190: -- Sorry about that Sean Owen. So I'm assuming this fix is in master branch of spark 1.52? > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the column_family correctly, but now lets add a > collection data type to the schema. > Now I will drop that test_table, and recreate the table, with with an extra > column of data type map> CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9
[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965363#comment-14965363 ] Bilind Hajer commented on SPARK-11190: -- How can I get access to the master branch? Would I be able to clone the current repo and test on my local? > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the column_family correctly, but now lets add a > collection data type to the schema. > Now I will drop that test_table, and recreate the table, with with an extra > column of data type map> CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, >
[jira] [Created] (SPARK-11190) SparkR support for cassandra collection types.
Bilind Hajer created SPARK-11190: Summary: SparkR support for cassandra collection types. Key: SPARK-11190 URL: https://issues.apache.org/jira/browse/SPARK-11190 Project: Spark Issue Type: Bug Affects Versions: 1.5.1 Environment: SparkR Version: 1.5.1 Cassandra Version: 2.1.6 R Version: 3.2.2 Cassandra Connector version: 1.5.0-M2 Reporter: Bilind Hajer Fix For: 1.5.2 I want to create a data frame from a Cassandra keyspace and column family in sparkR. I am able to create data frames from tables which do not include any Cassandra collection datatypes, such as Map, Set and List. But, many of the schemas that I need data from, do include these collection data types. Here is my local environment. SparkR Version: 1.5.1 Cassandra Version: 2.1.6 R Version: 3.2.2 Cassandra Connector version: 1.5.0-M2 To test this issue, I did the following iterative process. sudo ./sparkR --packages com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf spark.cassandra.connection.host=127.0.0.1 Running this command, with sparkR gives me access to the spark cassandra connector package I need, and connects me to my local cqlsh server ( which is up and running while running this code in sparkR shell ). CREATE TABLE test_table ( column_1 int, column_2 text, column_3 float, column_4 uuid, column_5 timestamp, column_6 boolean, column_7 timeuuid, column_8 bigint, column_9 blob, column_10 ascii, column_11 decimal, column_12 double, column_13 inet, column_14 varchar, column_15 varint, PRIMARY KEY( ( column_1, column_2 ) ) ); All of the above data types are supported. I insert dummy data after creating this test schema. For example, now in my sparkR shell, I run the following code. df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", keyspace = "datahub", table = "test_table") assigns with no errors, then, > schema(df.test) StructType |-name = "column_1", type = "IntegerType", nullable = TRUE |-name = "column_2", type = "StringType", nullable = TRUE |-name = "column_10", type = "StringType", nullable = TRUE |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE |-name = "column_12", type = "DoubleType", nullable = TRUE |-name = "column_13", type = "InetAddressType", nullable = TRUE |-name = "column_14", type = "StringType", nullable = TRUE |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE |-name = "column_3", type = "FloatType", nullable = TRUE |-name = "column_4", type = "UUIDType", nullable = TRUE |-name = "column_5", type = "TimestampType", nullable = TRUE |-name = "column_6", type = "BooleanType", nullable = TRUE |-name = "column_7", type = "UUIDType", nullable = TRUE |-name = "column_8", type = "LongType", nullable = TRUE |-name = "column_9", type = "BinaryType", nullable = TRUE Schema is correct. > class(df.test) [1] "DataFrame" attr(,"package") [1] "SparkR" df.test is clearly defined to be a DataFrame Object. > head(df.test) column_1 column_2 column_10 column_11 column_12 column_13 column_14 column_15 11helloNANANANANANA column_3 column_4 column_5 column_6 column_7 column_8 column_9 1 3.4 NA NA NA NA NA NA sparkR is reading from the column_family correctly, but now lets add a collection data type to the schema. Now I will drop that test_table, and recreate the table, with with an extra column of data type mapCREATE TABLE test_table ( column_1 int, column_2 text, column_3 float, column_4 uuid, column_5 timestamp, column_6 boolean, column_7 timeuuid, column_8 bigint, column_9 blob, column_10ascii, column_11decimal, column_12double, column_13inet, column_14varchar, column_15varint, column_16map , PRIMARY KEY( ( column_1, column_2 ) ) ); After inserting dummy data into the new test schema, > df.test <- read.df(sqlContext, source =