Re: [Structured spak streaming] How does cassandra connector readstream deals with deleted record
The connector uses Java driver cql request under the hood which means it responds to the changing database like a normal application would. This means retries may result in a different set of data than the original request if the underlying database changed. On Fri, Jun 26, 2020, 9:42 PM Jungtaek Lim wrote: > I'm not sure how it is implemented, but in general I wouldn't expect such > behavior on the connectors which read from non-streaming fashion storages. > The query result may depend on "when" the records are fetched. > > If you need to reflect the changes in your query you'll probably want to > find a way to retrieve "change logs" from your external storage (or how > your system/product can also produce change logs if your external storage > doesn't support it), and adopt it to your query. There's a keyword you can > google to read further, "Change Data Capture". > > Otherwise, you can apply the traditional approach, run a batch query > periodically and replace entire outputs. > > On Thu, Jun 25, 2020 at 1:26 PM Rahul Kumar > wrote: > >> Hello everyone, >> >> I was wondering, how Cassandra spark connector deals with deleted/updated >> record while readstream operation. If the record was already fetched in >> spark memory, and it got updated or deleted in database, does it get >> reflected in streaming join? >> >> Thanks, >> Rahul >> >> >> >> -- >> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >>
Re: [Structured spak streaming] How does cassandra connector readstream deals with deleted record
I'm not sure how it is implemented, but in general I wouldn't expect such behavior on the connectors which read from non-streaming fashion storages. The query result may depend on "when" the records are fetched. If you need to reflect the changes in your query you'll probably want to find a way to retrieve "change logs" from your external storage (or how your system/product can also produce change logs if your external storage doesn't support it), and adopt it to your query. There's a keyword you can google to read further, "Change Data Capture". Otherwise, you can apply the traditional approach, run a batch query periodically and replace entire outputs. On Thu, Jun 25, 2020 at 1:26 PM Rahul Kumar wrote: > Hello everyone, > > I was wondering, how Cassandra spark connector deals with deleted/updated > record while readstream operation. If the record was already fetched in > spark memory, and it got updated or deleted in database, does it get > reflected in streaming join? > > Thanks, > Rahul > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
[Structured spak streaming] How does cassandra connector readstream deals with deleted record
Hello everyone, I was wondering, how Cassandra spark connector deals with deleted/updated record while readstream operation. If the record was already fetched in spark memory, and it got updated or deleted in database, does it get reflected in streaming join? Thanks, Rahul -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel.
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameReader.html#json(org.apache.spark.rdd.RDD) You can pass a rdd to spark.read.json. // Spark here is SparkSession Also it works completely fine with smaller dataset in a table but with 1B records it takes forever and more importantly the network throughput is only 2.2 KB/s which is too low. It should be somewhere in MB/s On Sat, Nov 26, 2016 at 10:09 AM, Anastasios Zouzias <zouz...@gmail.com> wrote: > Hi there, > > spark.read.json usually takes a filesystem path (usually HDFS) where there > is a file containing JSON per new line. See also > > http://spark.apache.org/docs/latest/sql-programming-guide.html > > Hence, in your case > > val df4 = spark.read.json(rdd) // This line takes forever > > seems wrong. I guess you might want to first store rdd as a text file on > HDFS and then read it using spark.read.json . > > Cheers, > Anastasios > > > > On Sat, Nov 26, 2016 at 9:34 AM, kant kodali <kanth...@gmail.com> wrote: > >> up vote >> down votefavorite >> <http://stackoverflow.com/questions/40797231/apache-spark-or-spark-cassandra-connector-doesnt-look-like-it-is-reading-multipl?noredirect=1#> >> >> Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading >> multiple partitions in parallel. >> >> Here is my code using spark-shell >> >> import org.apache.spark.sql._ >> import org.apache.spark.sql.types.StringType >> spark.sql("""CREATE TEMPORARY VIEW hello USING >> org.apache.spark.sql.cassandra OPTIONS (table "hello", keyspace "db", >> cluster "Test Cluster", pushdown "true")""") >> val df = spark.sql("SELECT test from hello") >> val df2 = df.select(df("test").cast(StringType).as("test")) >> val rdd = df2.rdd.map { case Row(j: String) => j } >> val df4 = spark.read.json(rdd) // This line takes forever >> >> I have about 700 million rows each row is about 1KB and this line >> >> val df4 = spark.read.json(rdd) takes forever as I get the following >> output after 1hr 30 mins >> >> Stage 1:==> (4866 + 2) / 25256] >> >> so at this rate it will probably take days. >> >> I measured the network throughput rate of spark worker nodes using iftop >> and it is about 2.2KB/s (kilobytes per second) which is too low so that >> tells me it not reading partitions in parallel or at very least it is not >> reading good chunk of data else it would be in MB/s. Any ideas on how to >> fix it? >> >> > > > -- > -- Anastasios Zouzias > <a...@zurich.ibm.com> >
Re: Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel.
Hi there, spark.read.json usually takes a filesystem path (usually HDFS) where there is a file containing JSON per new line. See also http://spark.apache.org/docs/latest/sql-programming-guide.html Hence, in your case val df4 = spark.read.json(rdd) // This line takes forever seems wrong. I guess you might want to first store rdd as a text file on HDFS and then read it using spark.read.json . Cheers, Anastasios On Sat, Nov 26, 2016 at 9:34 AM, kant kodali <kanth...@gmail.com> wrote: > up vote > down votefavorite > <http://stackoverflow.com/questions/40797231/apache-spark-or-spark-cassandra-connector-doesnt-look-like-it-is-reading-multipl?noredirect=1#> > > Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading > multiple partitions in parallel. > > Here is my code using spark-shell > > import org.apache.spark.sql._ > import org.apache.spark.sql.types.StringType > spark.sql("""CREATE TEMPORARY VIEW hello USING org.apache.spark.sql.cassandra > OPTIONS (table "hello", keyspace "db", cluster "Test Cluster", pushdown > "true")""") > val df = spark.sql("SELECT test from hello") > val df2 = df.select(df("test").cast(StringType).as("test")) > val rdd = df2.rdd.map { case Row(j: String) => j } > val df4 = spark.read.json(rdd) // This line takes forever > > I have about 700 million rows each row is about 1KB and this line > > val df4 = spark.read.json(rdd) takes forever as I get the following > output after 1hr 30 mins > > Stage 1:==> (4866 + 2) / 25256] > > so at this rate it will probably take days. > > I measured the network throughput rate of spark worker nodes using iftop > and it is about 2.2KB/s (kilobytes per second) which is too low so that > tells me it not reading partitions in parallel or at very least it is not > reading good chunk of data else it would be in MB/s. Any ideas on how to > fix it? > > -- -- Anastasios Zouzias <a...@zurich.ibm.com>
Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel.
up vote down votefavorite <http://stackoverflow.com/questions/40797231/apache-spark-or-spark-cassandra-connector-doesnt-look-like-it-is-reading-multipl?noredirect=1#> Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel. Here is my code using spark-shell import org.apache.spark.sql._ import org.apache.spark.sql.types.StringType spark.sql("""CREATE TEMPORARY VIEW hello USING org.apache.spark.sql.cassandra OPTIONS (table "hello", keyspace "db", cluster "Test Cluster", pushdown "true")""") val df = spark.sql("SELECT test from hello") val df2 = df.select(df("test").cast(StringType).as("test")) val rdd = df2.rdd.map { case Row(j: String) => j } val df4 = spark.read.json(rdd) // This line takes forever I have about 700 million rows each row is about 1KB and this line val df4 = spark.read.json(rdd) takes forever as I get the following output after 1hr 30 mins Stage 1:==> (4866 + 2) / 25256] so at this rate it will probably take days. I measured the network throughput rate of spark worker nodes using iftop and it is about 2.2KB/s (kilobytes per second) which is too low so that tells me it not reading partitions in parallel or at very least it is not reading good chunk of data else it would be in MB/s. Any ideas on how to fix it?
Re: Python - Spark Cassandra Connector on DC/OS
Sorry: Spark 2.0.0 On Tue, Nov 1, 2016 at 10:04 AM, Andrew Holway < andrew.hol...@otternetworks.de> wrote: > Hello, > > I've been getting pretty serious with DC/OS which I guess could be > described as a somewhat polished distribution of Mesos. I'm not sure how > relevant DC/OS is to this problem. > > I am using this pyspark program to test the cassandra connection: > http://bit.ly/2eWAfxm (github) > > I can that the df.printSchema() method is working ok but the df.show() > method is breaking with this error: > > Traceback (most recent call last): > File "/mnt/mesos/sandbox/squeeze.py", line 28, in > df.show() > File "/opt/spark/dist/python/lib/pyspark.zip/pyspark/sql/dataframe.py", > line 287, in show > File "/opt/spark/dist/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/opt/spark/dist/python/lib/pyspark.zip/pyspark/sql/utils.py", > line 63, in deco > File "/opt/spark/dist/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > o33.showString. > > Full output to stdout and stderr. : http://bit.ly/2f80f9e (gist) > > Versions: > > Spark 2.0.1 > Python Version: 3.4.3 (default, Sep 14 2016, 12:36:27) > [cqlsh 5.0.1 | Cassandra 2.2.8 | CQL spec 3.3.1 | Native protocol v4] > DC/OS v.1.8.4 > > Cheers, > > Andrew > > -- > Otter Networks UG > http://otternetworks.de > Gotenstraße 17 > 10829 Berlin > -- Otter Networks UG http://otternetworks.de Gotenstraße 17 10829 Berlin
Python - Spark Cassandra Connector on DC/OS
Hello, I've been getting pretty serious with DC/OS which I guess could be described as a somewhat polished distribution of Mesos. I'm not sure how relevant DC/OS is to this problem. I am using this pyspark program to test the cassandra connection: http://bit.ly/2eWAfxm (github) I can that the df.printSchema() method is working ok but the df.show() method is breaking with this error: Traceback (most recent call last): File "/mnt/mesos/sandbox/squeeze.py", line 28, in df.show() File "/opt/spark/dist/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 287, in show File "/opt/spark/dist/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ File "/opt/spark/dist/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/opt/spark/dist/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o33.showString. Full output to stdout and stderr. : http://bit.ly/2f80f9e (gist) Versions: Spark 2.0.1 Python Version: 3.4.3 (default, Sep 14 2016, 12:36:27) [cqlsh 5.0.1 | Cassandra 2.2.8 | CQL spec 3.3.1 | Native protocol v4] DC/OS v.1.8.4 Cheers, Andrew -- Otter Networks UG http://otternetworks.de Gotenstraße 17 10829 Berlin
Re: unresolved dependency: datastax#spark-cassandra-connector;2.0.0-s_2.11-M3-20-g75719df: not found
The "unresolved dependency" error is stating that the datastax dependency could not be located in the Maven repository. I believe that this should work if you change that portion of your command to the following. --packages com.datastax.spark:spark-cassandra-connector_2.10:2.0.0-M3 You can verify the available versions by searching Maven at http://search.maven.org. Thanks, Kevin On Wed, Sep 21, 2016 at 3:38 AM, muhammet pakyürek <mpa...@hotmail.com> wrote: > while i run the spark-shell as below > > spark-shell --jars '/home/ktuser/spark-cassandra- > connector/target/scala-2.11/root_2.11-2.0.0-M3-20-g75719df.jar' > --packages datastax:spark-cassandra-connector:2.0.0-s_2.11-M3-20-g75719df > --conf spark.cassandra.connection.host=localhost > > i get the error > unresolved dependency: datastax#spark-cassandra- > connector;2.0.0-s_2.11-M3-20-g75719df. > > > the second question even if i added > > libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % > "2.0.0-M3" > > to spark-cassandra-connector/sbt/sbt file jar files are > root_2.11-2.0.0-M3-20-g75719df > > > teh third question after build of connectpr scala 2.11 how do i integrate > it with pyspark? > >
unresolved dependency: datastax#spark-cassandra-connector;2.0.0-s_2.11-M3-20-g75719df: not found
while i run the spark-shell as below spark-shell --jars '/home/ktuser/spark-cassandra-connector/target/scala-2.11/root_2.11-2.0.0-M3-20-g75719df.jar' --packages datastax:spark-cassandra-connector:2.0.0-s_2.11-M3-20-g75719df --conf spark.cassandra.connection.host=localhost i get the error unresolved dependency: datastax#spark-cassandra-connector;2.0.0-s_2.11-M3-20-g75719df. the second question even if i added libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "2.0.0-M3" to spark-cassandra-connector/sbt/sbt file jar files are root_2.11-2.0.0-M3-20-g75719df teh third question after build of connectpr scala 2.11 how do i integrate it with pyspark?
cassandra 3.7 is compatible with datastax Spark Cassandra Connector 2.0?
Re: clear steps for installation of spark, cassandra and cassandra connector to run on spyder 2.3.7 using python 3.5 and anaconda 2.4 ipython 4.0
Spark has pretty extensive documentation, that should be your starting point. I do not use Cassandra much, but Cassandra connector should be a spark package, so look for spark package website. If I may say so, all docs should be one or two Google search away :) On 6 Sep 2016 20:34, "muhammet pakyürek" <mpa...@hotmail.com> wrote: > > > could u send me documents and links to satisfy all above requirements of > installation > of spark, cassandra and cassandra connector to run on spyder 2.3.7 using > python 3.5 and anaconda 2.4 ipython 4.0 > > > -- > >
clear steps for installation of spark, cassandra and cassandra connector to run on spyder 2.3.7 using python 3.5 and anaconda 2.4 ipython 4.0
could u send me documents and links to satisfy all above requirements of installation of spark, cassandra and cassandra connector to run on spyder 2.3.7 using python 3.5 and anaconda 2.4 ipython 4.0
Spark-Cassandra connector
Hi List I am trying to install the Spark-Cassandra connector through maven or sbt but neither works. Both of them try to connect to the Internet (which I do not have connection) to download certain files. Is there a way to install the files manually? I downloaded from the maven repository --> spark-cassandra-connector_2.10-1.6.0.jar Which is the version of Scala and Spark that I have ... but where to put it? BR Joaquin This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.
Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042
Hi Ted and Saurahb If I use —conf arguments with pyspark I am able to connect. Any idea how I can set these values programmatically? (I work on a notebook server and can not easily reconfigure the server This works extraPkgs="--packages com.databricks:spark-csv_2.11:1.3.0 \ --packages datastax:spark-cassandra-connector:1.6.0-M1-s_2.10" export PYSPARK_PYTHON=python3 export PYSPARK_DRIVER_PYTHON=python3 IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $extraPkgs --conf spark.cassandra.connection.host=localhost --conf spark.cassandra.connection.port=9043 $* df = sqlContext.read\ .format("org.apache.spark.sql.cassandra")\ .options(table="json_timeseries", keyspace="notification")\ .load() df.printSchema() df.show(truncate=False) I have tried using setContext.setConf() but it does not work. It does not seem to have any effect #sqlContext.setConf("spark.cassandra.connection.host","localhost") #sqlContext.setConf("spark.cassandra.connection.port","9043") #sqlContext.setConf("connection.host","localhost") #sqlContext.setConf("connection.port","9043") sqlContext.setConf("host","localhost") sqlContext.setConf("port","9043”) Thanks Andy From: Saurabh Bajaj <bajaj.onl...@gmail.com> Date: Tuesday, March 8, 2016 at 9:13 PM To: Andrew Davidson <a...@santacruzintegration.com> Cc: Ted Yu <yuzhih...@gmail.com>, "user @spark" <user@spark.apache.org> Subject: Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042 > Hi Andy, > > I believe you need to set the host and port settings separately > spark.cassandra.connection.host > spark.cassandra.connection.port > https://github.com/datastax/spark-cassandra-connector/blob/master/doc/referenc > e.md#cassandra-connection-parameters > > Looking at the logs, it seems your port config is not being set and it's > falling back to default. > Let me know if that helps. > > Saurabh Bajaj > > On Tue, Mar 8, 2016 at 6:25 PM, Andy Davidson <a...@santacruzintegration.com> > wrote: >> Hi Ted >> >> I believe by default cassandra listens on 9042 >> >> From: Ted Yu <yuzhih...@gmail.com> >> Date: Tuesday, March 8, 2016 at 6:11 PM >> To: Andrew Davidson <a...@santacruzintegration.com> >> Cc: "user @spark" <user@spark.apache.org> >> Subject: Re: pyspark spark-cassandra-connector java.io.IOException: Failed >> to open native connection to Cassandra at {192.168.1.126}:9042 >> >>> Have you contacted spark-cassandra-connector related mailing list ? >>> >>> I wonder where the port 9042 came from. >>> >>> Cheers >>> >>> On Tue, Mar 8, 2016 at 6:02 PM, Andy Davidson >>> <a...@santacruzintegration.com> wrote: >>>> >>>> I am using spark-1.6.0-bin-hadoop2.6. I am trying to write a python >>>> notebook that reads a data frame from Cassandra. >>>> >>>> I connect to cassadra using an ssh tunnel running on port 9043. CQLSH works >>>> how ever I can not figure out how to configure my notebook. I have tried >>>> various hacks any idea what I am doing wrong >>>> >>>> : java.io.IOException: Failed to open native connection to Cassandra at >>>> {192.168.1.126}:9042 >>>> >>>> >>>> >>>> Thanks in advance >>>> >>>> Andy >>>> >>>> >>>> >>>> $ extraPkgs="--packages com.databricks:spark-csv_2.11:1.3.0 \ >>>> --packages datastax:spark-cassandra-connector:1.6.0-M1-s_2.11" >>>> >>>> $ export PYSPARK_PYTHON=python3 >>>> $ export PYSPARK_DRIVER_PYTHON=python3 >>>> $ IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $extraPkgs $* >>>> >>>> >>>> >>>> In [15]: >>>> 1 >>>> sqlContext.setConf("spark.cassandra.connection.host”,”127.0.0.1:9043 >>>> <http://127.0.0.1:9043> ") >>>> 2 >>>> df = sqlContext.read\ >>>> 3 >>>> .format("org.apache.spark.sql.cassandra")\ >>>> 4 >>>> .options(table=“time_series", keyspace="notification")\ >>>> 5 >>>> .load() >>>> 6 >>>> >>>> 7 >>>> df.printSchema() >>>> 8 >>>> df.show() >>>>
Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042
>From cassandra.yaml : native_transport_port: 9042 FYI On Tue, Mar 8, 2016 at 9:13 PM, Saurabh Bajaj <bajaj.onl...@gmail.com> wrote: > Hi Andy, > > I believe you need to set the host and port settings separately > spark.cassandra.connection.host > spark.cassandra.connection.port > > https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#cassandra-connection-parameters > > Looking at the logs, it seems your port config is not being set and it's > falling back to default. > Let me know if that helps. > > Saurabh Bajaj > > On Tue, Mar 8, 2016 at 6:25 PM, Andy Davidson < > a...@santacruzintegration.com> wrote: > >> Hi Ted >> >> I believe by default cassandra listens on 9042 >> >> From: Ted Yu <yuzhih...@gmail.com> >> Date: Tuesday, March 8, 2016 at 6:11 PM >> To: Andrew Davidson <a...@santacruzintegration.com> >> Cc: "user @spark" <user@spark.apache.org> >> Subject: Re: pyspark spark-cassandra-connector java.io.IOException: >> Failed to open native connection to Cassandra at {192.168.1.126}:9042 >> >> Have you contacted spark-cassandra-connector related mailing list ? >> >> I wonder where the port 9042 came from. >> >> Cheers >> >> On Tue, Mar 8, 2016 at 6:02 PM, Andy Davidson < >> a...@santacruzintegration.com> wrote: >> >>> >>> I am using spark-1.6.0-bin-hadoop2.6. I am trying to write a python >>> notebook that reads a data frame from Cassandra. >>> >>> *I connect to cassadra using an ssh tunnel running on port 9043.* CQLSH >>> works how ever I can not figure out how to configure my notebook. I have >>> tried various hacks any idea what I am doing wrong >>> >>> : java.io.IOException: Failed to open native connection to Cassandra at >>> {192.168.1.126}:9042 >>> >>> >>> >>> >>> Thanks in advance >>> >>> Andy >>> >>> >>> >>> $ extraPkgs="--packages com.databricks:spark-csv_2.11:1.3.0 \ >>> --packages >>> datastax:spark-cassandra-connector:1.6.0-M1-s_2.11" >>> >>> $ export PYSPARK_PYTHON=python3 >>> $ export PYSPARK_DRIVER_PYTHON=python3 >>> $ IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $extraPkgs $* >>> >>> >>> >>> In [15]: >>> 1 >>> >>> sqlContext.setConf("spark.cassandra.connection.host”,”127.0.0.1:9043") >>> >>> 2 >>> >>> df = sqlContext.read\ >>> >>> 3 >>> >>> .format("org.apache.spark.sql.cassandra")\ >>> >>> 4 >>> >>> .options(table=“time_series", keyspace="notification")\ >>> >>> 5 >>> >>> .load() >>> >>> 6 >>> >>> >>> >>> 7 >>> >>> df.printSchema() >>> >>> 8 >>> >>> df.show() >>> >>> ---Py4JJavaError >>> Traceback (most recent call >>> last) in () 1 >>> sqlContext.setConf("spark.cassandra.connection.host","localhost:9043")> >>> 2 df = sqlContext.read.format("org.apache.spark.sql.cassandra") >>> .options(table="kv", keyspace="notification").load() 3 4 >>> df.printSchema() 5 >>> df.show()/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/readwriter.py >>> in load(self, path, format, schema, **options)137 >>> return self._df(self._jreader.load(path))138 else:--> 139 >>> return self._df(self._jreader.load())140 141 >>> @since(1.4)/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py >>> in __call__(self, *args)811 answer = >>> self.gateway_client.send_command(command)812 return_value = >>> get_return_value(--> 813 answer, self.gateway_client, >>> self.target_id, self.name)814 815 for temp_arg in >>> temp_args:/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/utils.py >>> in deco(*a, **kw) 43 def deco(*a, **kw): 44 try:---> >>> 45 return f(*a, **kw) 46 except
Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042
Hi Andy, I believe you need to set the host and port settings separately spark.cassandra.connection.host spark.cassandra.connection.port https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#cassandra-connection-parameters Looking at the logs, it seems your port config is not being set and it's falling back to default. Let me know if that helps. Saurabh Bajaj On Tue, Mar 8, 2016 at 6:25 PM, Andy Davidson <a...@santacruzintegration.com > wrote: > Hi Ted > > I believe by default cassandra listens on 9042 > > From: Ted Yu <yuzhih...@gmail.com> > Date: Tuesday, March 8, 2016 at 6:11 PM > To: Andrew Davidson <a...@santacruzintegration.com> > Cc: "user @spark" <user@spark.apache.org> > Subject: Re: pyspark spark-cassandra-connector java.io.IOException: > Failed to open native connection to Cassandra at {192.168.1.126}:9042 > > Have you contacted spark-cassandra-connector related mailing list ? > > I wonder where the port 9042 came from. > > Cheers > > On Tue, Mar 8, 2016 at 6:02 PM, Andy Davidson < > a...@santacruzintegration.com> wrote: > >> >> I am using spark-1.6.0-bin-hadoop2.6. I am trying to write a python >> notebook that reads a data frame from Cassandra. >> >> *I connect to cassadra using an ssh tunnel running on port 9043.* CQLSH >> works how ever I can not figure out how to configure my notebook. I have >> tried various hacks any idea what I am doing wrong >> >> : java.io.IOException: Failed to open native connection to Cassandra at >> {192.168.1.126}:9042 >> >> >> >> >> Thanks in advance >> >> Andy >> >> >> >> $ extraPkgs="--packages com.databricks:spark-csv_2.11:1.3.0 \ >> --packages datastax:spark-cassandra-connector:1.6.0-M1-s_2.11" >> >> $ export PYSPARK_PYTHON=python3 >> $ export PYSPARK_DRIVER_PYTHON=python3 >> $ IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $extraPkgs $* >> >> >> >> In [15]: >> 1 >> >> sqlContext.setConf("spark.cassandra.connection.host”,”127.0.0.1:9043") >> >> 2 >> >> df = sqlContext.read\ >> >> 3 >> >> .format("org.apache.spark.sql.cassandra")\ >> >> 4 >> >> .options(table=“time_series", keyspace="notification")\ >> >> 5 >> >> .load() >> >> 6 >> >> >> >> 7 >> >> df.printSchema() >> >> 8 >> >> df.show() >> >> ---Py4JJavaError >> Traceback (most recent call >> last) in () 1 >> sqlContext.setConf("spark.cassandra.connection.host","localhost:9043")> >> 2 df = sqlContext.read.format("org.apache.spark.sql.cassandra") >> .options(table="kv", keyspace="notification").load() 3 4 >> df.printSchema() 5 >> df.show()/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/readwriter.py >> in load(self, path, format, schema, **options)137 >> return self._df(self._jreader.load(path))138 else:--> 139 >> return self._df(self._jreader.load())140 141 >> @since(1.4)/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py >> in __call__(self, *args)811 answer = >> self.gateway_client.send_command(command)812 return_value = >> get_return_value(--> 813 answer, self.gateway_client, >> self.target_id, self.name)814 815 for temp_arg in >> temp_args:/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/utils.py >> in deco(*a, **kw) 43 def deco(*a, **kw): 44 try:---> 45 >> return f(*a, **kw) 46 except >> py4j.protocol.Py4JJavaError as e: 47 s = >> e.java_exception.toString()/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py >> in get_return_value(answer, gateway_client, target_id, name)306 >> raise Py4JJavaError(307 "An error occurred >> while calling {0}{1}{2}.\n".--> 308 format(target_id, >> ".", name), value) >> 309 else:310 raise Py4JError( >
Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042
Hi Ted I believe by default cassandra listens on 9042 From: Ted Yu <yuzhih...@gmail.com> Date: Tuesday, March 8, 2016 at 6:11 PM To: Andrew Davidson <a...@santacruzintegration.com> Cc: "user @spark" <user@spark.apache.org> Subject: Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042 > Have you contacted spark-cassandra-connector related mailing list ? > > I wonder where the port 9042 came from. > > Cheers > > On Tue, Mar 8, 2016 at 6:02 PM, Andy Davidson <a...@santacruzintegration.com> > wrote: >> >> I am using spark-1.6.0-bin-hadoop2.6. I am trying to write a python notebook >> that reads a data frame from Cassandra. >> >> I connect to cassadra using an ssh tunnel running on port 9043. CQLSH works >> how ever I can not figure out how to configure my notebook. I have tried >> various hacks any idea what I am doing wrong >> >> : java.io.IOException: Failed to open native connection to Cassandra at >> {192.168.1.126}:9042 >> >> >> >> Thanks in advance >> >> Andy >> >> >> >> $ extraPkgs="--packages com.databricks:spark-csv_2.11:1.3.0 \ >> --packages datastax:spark-cassandra-connector:1.6.0-M1-s_2.11" >> >> $ export PYSPARK_PYTHON=python3 >> $ export PYSPARK_DRIVER_PYTHON=python3 >> $ IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $extraPkgs $* >> >> >> >> In [15]: >> 1 >> sqlContext.setConf("spark.cassandra.connection.host”,”127.0.0.1:9043 >> <http://127.0.0.1:9043> ") >> 2 >> df = sqlContext.read\ >> 3 >> .format("org.apache.spark.sql.cassandra")\ >> 4 >> .options(table=“time_series", keyspace="notification")\ >> 5 >> .load() >> 6 >> >> 7 >> df.printSchema() >> 8 >> df.show() >> ---Py >> 4JJavaError Traceback (most recent call last) >> in () 1 >> sqlContext.setConf("spark.cassandra.connection.host","localhost:9043")> 2 >> df = sqlContext.read.format("org.apache.spark.sql.cassandra") >> .options(table="kv", keyspace="notification").load() 3 4 >> df.printSchema() 5 >> df.show()/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/pyth >> on/pyspark/sql/readwriter.py in load(self, path, format, schema, **options) >> 137 return self._df(self._jreader.load(path))138 >> else:--> 139 return self._df(self._jreader.load())140 141 >> @since(1.4)/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/py >> thon/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args) >> 811 answer = self.gateway_client.send_command(command)812 >> return_value = get_return_value( >> --> 813 answer, self.gateway_client, self.target_id, self.name >> <http://self.name> ) >> 814 815 for temp_arg in >> temp_args:/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/pyt >> hon/pyspark/sql/utils.py in deco(*a, **kw) 43 def deco(*a, **kw): >> 44 try:---> 45 return f(*a, **kw) 46 except >> py4j.protocol.Py4JJavaError as e: 47 s = >> e.java_exception.toString()/Users/andrewdavidson/workSpace/spark/spark-1.6.0- >> bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py in >> get_return_value(answer, gateway_client, target_id, name)306 >> raise Py4JJavaError( >> 307 "An error occurred while calling >> {0}{1}{2}.\n".--> 308 format(target_id, ".", name), >> value) >> 309 else:310 raise Py4JError( >> >> Py4JJavaError: An error occurred while calling o280.load. >> : java.io.IOException: Failed to open native connection to Cassandra at >> {192.168.1.126}:9042 >> at >> com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$conne >> ctor$cql$CassandraConnector$$createSession(CassandraConnector.scala:162) >> at >> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(Cassandr >> aConnector.scala:148) >> at >> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(Cassandr >> aConnector.scala:148) >> at >> com.datastax.spark.connector.cql
Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042
Have you contacted spark-cassandra-connector related mailing list ? I wonder where the port 9042 came from. Cheers On Tue, Mar 8, 2016 at 6:02 PM, Andy Davidson <a...@santacruzintegration.com > wrote: > > I am using spark-1.6.0-bin-hadoop2.6. I am trying to write a python > notebook that reads a data frame from Cassandra. > > *I connect to cassadra using an ssh tunnel running on port 9043.* CQLSH > works how ever I can not figure out how to configure my notebook. I have > tried various hacks any idea what I am doing wrong > > : java.io.IOException: Failed to open native connection to Cassandra at > {192.168.1.126}:9042 > > > > > Thanks in advance > > Andy > > > > $ extraPkgs="--packages com.databricks:spark-csv_2.11:1.3.0 \ > --packages datastax:spark-cassandra-connector:1.6.0-M1-s_2.11" > > $ export PYSPARK_PYTHON=python3 > $ export PYSPARK_DRIVER_PYTHON=python3 > $ IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $extraPkgs $* > > > > In [15]: > 1 > > sqlContext.setConf("spark.cassandra.connection.host”,”127.0.0.1:9043") > > 2 > > df = sqlContext.read\ > > 3 > > .format("org.apache.spark.sql.cassandra")\ > > 4 > > .options(table=“time_series", keyspace="notification")\ > > 5 > > .load() > > 6 > > > > 7 > > df.printSchema() > > 8 > > df.show() > > ---Py4JJavaError > Traceback (most recent call > last) in () 1 > sqlContext.setConf("spark.cassandra.connection.host","localhost:9043")> 2 > df = sqlContext.read.format("org.apache.spark.sql.cassandra") > .options(table="kv", keyspace="notification").load() 3 4 > df.printSchema() 5 df.show() > /Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/readwriter.py > in load(self, path, format, schema, **options)137 return > self._df(self._jreader.load(path))138 else:--> 139 > return self._df(self._jreader.load())140 141 @since(1.4) > /Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py > in __call__(self, *args)811 answer = > self.gateway_client.send_command(command)812 return_value = > get_return_value(--> 813 answer, self.gateway_client, > self.target_id, self.name)814 815 for temp_arg in temp_args: > /Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/utils.py > in deco(*a, **kw) 43 def deco(*a, **kw): 44 try:---> 45 >return f(*a, **kw) 46 except > py4j.protocol.Py4JJavaError as e: 47 s = > e.java_exception.toString() > /Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py > in get_return_value(answer, gateway_client, target_id, name)306 >raise Py4JJavaError(307 "An error occurred > while calling {0}{1}{2}.\n".--> 308 format(target_id, > ".", name), value)309 else:310 raise > Py4JError( > Py4JJavaError: An error occurred while calling o280.load. > : java.io.IOException: Failed to open native connection to Cassandra at > {192.168.1.126}:9042 > at > com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:162) > at > com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:148) > at > com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:148) > at > com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:31) > at > com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:56) > at > com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:81) > at > com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109) > at > com.datastax.spark.connector.rdd.partitioner.CassandraRDDPartitioner$.getTokenFactory(CassandraRDDPartitioner.scala:184) > at > org.apache.spark.sql.cassandra.CassandraSourceRelation$.apply(CassandraSourceRelation.scala:267) > at > org.apache.spark.sql.cassandra.DefaultSource.createRelation(Defaul
pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042
I am using spark-1.6.0-bin-hadoop2.6. I am trying to write a python notebook that reads a data frame from Cassandra. I connect to cassadra using an ssh tunnel running on port 9043. CQLSH works how ever I can not figure out how to configure my notebook. I have tried various hacks any idea what I am doing wrong : java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042 Thanks in advance Andy $ extraPkgs="--packages com.databricks:spark-csv_2.11:1.3.0 \ --packages datastax:spark-cassandra-connector:1.6.0-M1-s_2.11" $ export PYSPARK_PYTHON=python3 $ export PYSPARK_DRIVER_PYTHON=python3 $ IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $extraPkgs $* In [15]: 1 sqlContext.setConf("spark.cassandra.connection.host”,”127.0.0.1:9043") 2 df = sqlContext.read\ 3 .format("org.apache.spark.sql.cassandra")\ 4 .options(table=“time_series", keyspace="notification")\ 5 .load() 6 7 df.printSchema() 8 df.show() --- Py4JJavaError Traceback (most recent call last) in () 1 sqlContext.setConf("spark.cassandra.connection.host","localhost:9043") > 2 df = sqlContext.read.format("org.apache.spark.sql.cassandra") .options(table="kv", keyspace="notification").load() 3 4 df.printSchema() 5 df.show() /Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/pyspa rk/sql/readwriter.py in load(self, path, format, schema, **options) 137 return self._df(self._jreader.load(path)) 138 else: --> 139 return self._df(self._jreader.load()) 140 141 @since(1.4) /Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/lib/p y4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args) 811 answer = self.gateway_client.send_command(command) 812 return_value = get_return_value( --> 813 answer, self.gateway_client, self.target_id, self.name) 814 815 for temp_arg in temp_args: /Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/pyspa rk/sql/utils.py in deco(*a, **kw) 43 def deco(*a, **kw): 44 try: ---> 45 return f(*a, **kw) 46 except py4j.protocol.Py4JJavaError as e: 47 s = e.java_exception.toString() /Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/lib/p y4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 306 raise Py4JJavaError( 307 "An error occurred while calling {0}{1}{2}.\n". --> 308 format(target_id, ".", name), value) 309 else: 310 raise Py4JError( Py4JJavaError: An error occurred while calling o280.load. : java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042 at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$conn ector$cql$CassandraConnector$$createSession(CassandraConnector.scala:162) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(Cassand raConnector.scala:148) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(Cassand raConnector.scala:148) at com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCo untedCache.scala:31) at com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.sca la:56) at com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraCon nector.scala:81) at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraC onnector.scala:109) at com.datastax.spark.connector.rdd.partitioner.CassandraRDDPartitioner$.getTok enFactory(CassandraRDDPartitioner.scala:184) at org.apache.spark.sql.cassandra.CassandraSourceRelation$.apply(CassandraSourc eRelation.scala:267) at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.sc ala:57) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(Resolve dDataSource.scala:158) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62 ) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl .java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java
Re: metrics not reported by spark-cassandra-connector
Hi Yin, Thanks for your reply. I didn't realize there is a specific mailing list for spark-Cassandra-connector. I will ask there. Thanks! -Sa On Tuesday, February 23, 2016, Yin Yang <yy201...@gmail.com> wrote: > Hi, Sa: > Have you asked on spark-cassandra-connector mailing list ? > > Seems you would get better response there. > > Cheers >
Re: metrics not reported by spark-cassandra-connector
Hi, Sa: Have you asked on spark-cassandra-connector mailing list ? Seems you would get better response there. Cheers
metrics not reported by spark-cassandra-connector
Hi there, I am trying to enable the metrics collection by spark-cassandra-connector, following the instruction here: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/11_metrics.md However, I was not able to see any metrics reported. I'm using spark-cassandra-connector_2.10:1.5.0, and spark 1.5.1. I am trying to send the metrics to statsD. My metrics.properties is as the following: *.sink.statsd.class=org.apache.spark.metrics.sink.StatsDSink *.sink.statsd.host=localhost *.sink.statsd.port=18125 executor.source.cassandra-connector.class=org.apache.spark.metrics.CassandraConnectorSource driver.source.cassandra-connector.class=org.apache.spark.metrics.CassandraConnectorSource I'm able to see other metrics, e.g. DAGScheduler, but not any from the CassandraConnectorSource. E.g. I tried to search "write-byte-meter", but didn't find it. I didn't see the metrics on the spark UI either. I didn't find any relevant error or info in the log that indicates the CassandraConnectorSource is actually registered by the spark metrics system. Any pointers would be very much appreciated! Thanks, Sa
Re: [Cassandra-Connector] No Such Method Error despite correct versions
Doh - minutes after my question I saw the same from a couple of days ago Indeed, using C* driver 3.0.0-rc1 seems to solve the issue Jan > On 22 Feb 2016, at 12:13, Jan Algermissen <algermissen1...@icloud.com> wrote: > > Hi, > > I am using > > Cassandra 2.1.5 > Spark 1.5.2 > Cassandra java-drive 3.0.0 > Cassandra-Connector 1.5.0-RC1 > > All with scala 2.11.7 > > Nevertheless, I get the following error from my Spark job: > > java.lang.NoSuchMethodError: > com.datastax.driver.core.TableMetadata.getIndexes()Ljava/util/List; > at com.datastax.spark.connector.cql.Schema$.getIndexMap(Schema.scala:198) > at > com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchPartitionKey(Schema.scala:202) > > I checked that getIndexMap() is present in the version used - it is. > > I also decompiled the class TableMetadata directly from the fat jar I am > deploying. The method is present. The only thing I noticed is that the > signature the decompiler shows is > > public Collection getIndexes() { > > and not List - but I guess that is just the recompilation. > > I am pretty lost regarding what steps to take to get rid of this error. > > Can anyone provide a clue? > > Jan > > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[Cassandra-Connector] No Such Method Error despite correct versions
Hi, I am using Cassandra 2.1.5 Spark 1.5.2 Cassandra java-drive 3.0.0 Cassandra-Connector 1.5.0-RC1 All with scala 2.11.7 Nevertheless, I get the following error from my Spark job: java.lang.NoSuchMethodError: com.datastax.driver.core.TableMetadata.getIndexes()Ljava/util/List; at com.datastax.spark.connector.cql.Schema$.getIndexMap(Schema.scala:198) at com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchPartitionKey(Schema.scala:202) I checked that getIndexMap() is present in the version used - it is. I also decompiled the class TableMetadata directly from the fat jar I am deploying. The method is present. The only thing I noticed is that the signature the decompiler shows is public Collection getIndexes() { and not List - but I guess that is just the recompilation. I am pretty lost regarding what steps to take to get rid of this error. Can anyone provide a clue? Jan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark-cassandra-connector BulkOutputWriter
Hello all, I looked through the cassandra spark integration ( https://github.com/datastax/spark-cassandra-connector) and couldn't find any usages of the BulkOutputWriter ( http://www.datastax.com/dev/blog/bulk-loading) - an awesome tool for creating local sstables, which could be later uploaded to a cassandra cluster. Seems like (sorry if I'm wrong), it uses just bulk insert statements. So, my question is: does anybody know if there are any plans to support bulk loading? Cheers, Alex.
RE: spark-cassandra-connector BulkOutputWriter
Alex – I suggest posting this question on the Spark Cassandra Connector mailing list. The SCC developers are pretty responsive. Mohammed Author: Big Data Analytics with Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/> From: Alexandr Dzhagriev [mailto:dzh...@gmail.com] Sent: Tuesday, February 9, 2016 6:52 AM To: user Subject: spark-cassandra-connector BulkOutputWriter Hello all, I looked through the cassandra spark integration (https://github.com/datastax/spark-cassandra-connector) and couldn't find any usages of the BulkOutputWriter (http://www.datastax.com/dev/blog/bulk-loading) - an awesome tool for creating local sstables, which could be later uploaded to a cassandra cluster. Seems like (sorry if I'm wrong), it uses just bulk insert statements. So, my question is: does anybody know if there are any plans to support bulk loading? Cheers, Alex.
Re: Spark 1.5.2 compatible spark-cassandra-connector
Hi, Vivek M I had ever tried 1.5.x spark-cassandra connector and indeed encounter some classpath issues, mainly for the guaua dependency. I believe that can be solved by some maven config, but have not tried that yet. Best, Sun. fightf...@163.com From: vivek.meghanat...@wipro.com Date: 2015-12-29 20:40 To: user@spark.apache.org Subject: Spark 1.5.2 compatible spark-cassandra-connector All, What is the compatible spark-cassandra-connector for spark 1.5.2? I can only find the latest connector version spark-cassandra-connector_2.10-1.5.0-M3 which has dependency with 1.5.1 spark. Can we use the same for 1.5.2? Any classpath issues needs to be handled or any jars needs to be excluded while packaging the application jar? http://central.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.10/1.5.0-M3/spark-cassandra-connector_2.10-1.5.0-M3.pom Regards, Vivek M The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
Re: Spark 1.5.2 compatible spark-cassandra-connector
2.10-1.5.0-M3 & spark 1.5.2 work for me. The jar is built by sbt-assembly. Just for reference. 发件人: "fightf...@163.com" <fightf...@163.com> 日期: Wednesday, December 30, 2015 at 10:22 至: "vivek.meghanat...@wipro.com" <vivek.meghanat...@wipro.com>, user <user@spark.apache.org> 主题: Re: Spark 1.5.2 compatible spark-cassandra-connector Hi, Vivek M I had ever tried 1.5.x spark-cassandra connector and indeed encounter some classpath issues, mainly for the guaua dependency. I believe that can be solved by some maven config, but have not tried that yet. Best, Sun. fightf...@163.com > > From: vivek.meghanat...@wipro.com > Date: 2015-12-29 20:40 > To: user@spark.apache.org > Subject: Spark 1.5.2 compatible spark-cassandra-connector > All, > What is the compatible spark-cassandra-connector for spark 1.5.2? I can only > find the latest connector version spark-cassandra-connector_2.10-1.5.0-M3 > which has dependency with 1.5.1 spark. Can we use the same for 1.5.2? Any > classpath issues needs to be handled or any jars needs to be excluded while > packaging the application jar? > > http://central.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2 > .10/1.5.0-M3/spark-cassandra-connector_2.10-1.5.0-M3.pom > > Regards, > Vivek M > The information contained in this electronic message and any attachments to > this message are intended for the exclusive use of the addressee(s) and may > contain proprietary, confidential or privileged information. If you are not > the intended recipient, you should not disseminate, distribute or copy this > e-mail. Please notify the sender immediately and destroy all copies of this > message and any attachments. WARNING: Computer viruses can be transmitted via > email. The recipient should check this email and any attachments for the > presence of viruses. The company accepts no liability for any damage caused by > any virus transmitted by this email. www.wipro.com
RE: Spark 1.5.2 compatible spark-cassandra-connector
Thank you mwy and Sun for your response. Yes basic things are working for me using this connector(guava issue was encountered earlier but with proper exclusion of old version we have resolved it). The current issue is strange one �C we have a kafka-spark-cassandra streaming job in spark. The all jobs are working fine in local cluster (local lab) using 1.3.0 spark. We are trying the same setup in Google cloud platform in 1.3.0 the jobs are up but it does not seems to be processing the kafka messages. Using 1.5.2 spark with 1.5.0-M3 connector + 0.8.2.2 kafka I am able to make most of the jobs work but one of them is processing only 1 out of 50 messages from kafka. We have wrote a test program(scala we use) to parse data from kafka stream �C it is working fine and always receives the messages. It connects to Cassandra gets some data works fine and prints the input data till that time, but when I enable the join,map,reduce login , it starts to miss the message(not printing above worked lines also). Please find the lines added below, once I add the commented line to the job , it does not print any incoming data while I am able to print till then. Any ideas how to debug? val cqlOfferSummaryRdd = ssc.sc.cql3Cassandra[OfferSummary](casOfferSummary) .map(summary => (summary.listingId, (summary.offerId, summary.redeem, summary.viewed, summary.reserved))) /** //RDD -> ((listingId, searchId), Iterable(offerId, redeemCount, viewCount)) val directListingOfferSummary = directViews.transform(Rdd => { Rdd.join(cqlOfferSummaryRdd,10)}) //RDD -> ((listingId), (Direct, (offerId, redeemCount, viewCount))) .map(rdd => ((rdd._1, rdd._2._1.searchId, rdd._2._2._1), (rdd._2._2._2, rdd._2._2._3, rdd._2._2._4))) //RDD -> ((listingId, searchId, offerId), (redeemCount, viewCount)) .reduceByKey((a, b) => (a._1 + b._1, a._2 + b._2, a._3 + b._3)) .map(rdd => ((rdd._1._1, rdd._1._2), (rdd._1._3, rdd._2._1, rdd._2._2, rdd._2._3))).groupByKey(10) //RDD -> ((listingId, searchId), Iterable(offerId, redeemCount, viewCount)) directListingOfferSummary.print() **/ Regards, Vivek From: mwy [mailto:wenyao...@dewmobile.net] Sent: 30 December 2015 08:27 To: fightf...@163.com; Vivek Meghanathan (WT01 - NEP) <vivek.meghanat...@wipro.com>; user <user@spark.apache.org> Subject: Re: Spark 1.5.2 compatible spark-cassandra-connector 2.10-1.5.0-M3 & spark 1.5.2 work for me. The jar is built by sbt-assembly. Just for reference. 发件人: "fightf...@163.com<mailto:fightf...@163.com>" <fightf...@163.com<mailto:fightf...@163.com>> 日期: Wednesday, December 30, 2015 at 10:22 至: "vivek.meghanat...@wipro.com<mailto:vivek.meghanat...@wipro.com>" <vivek.meghanat...@wipro.com<mailto:vivek.meghanat...@wipro.com>>, user <user@spark.apache.org<mailto:user@spark.apache.org>> 主题: Re: Spark 1.5.2 compatible spark-cassandra-connector Hi, Vivek M I had ever tried 1.5.x spark-cassandra connector and indeed encounter some classpath issues, mainly for the guaua dependency. I believe that can be solved by some maven config, but have not tried that yet. Best, Sun. fightf...@163.com<mailto:fightf...@163.com> From: vivek.meghanat...@wipro.com<mailto:vivek.meghanat...@wipro.com> Date: 2015-12-29 20:40 To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Spark 1.5.2 compatible spark-cassandra-connector All, What is the compatible spark-cassandra-connector for spark 1.5.2? I can only find the latest connector version spark-cassandra-connector_2.10-1.5.0-M3 which has dependency with 1.5.1 spark. Can we use the same for 1.5.2? Any classpath issues needs to be handled or any jars needs to be excluded while packaging the application jar? http://central.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.10/1.5.0-M3/spark-cassandra-connector_2.10-1.5.0-M3.pom Regards, Vivek M The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com<http://www.wipro.com> The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged inf
Spark 1.5.2 compatible spark-cassandra-connector
All, What is the compatible spark-cassandra-connector for spark 1.5.2? I can only find the latest connector version spark-cassandra-connector_2.10-1.5.0-M3 which has dependency with 1.5.1 spark. Can we use the same for 1.5.2? Any classpath issues needs to be handled or any jars needs to be excluded while packaging the application jar? http://central.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.10/1.5.0-M3/spark-cassandra-connector_2.10-1.5.0-M3.pom Regards, Vivek M The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
Re: error in spark cassandra connector
Mind providing a bit more detail ? Release of Spark version of Cassandra connector How job was submitted complete stack trace Thanks On Thu, Dec 24, 2015 at 2:06 AM, Vijay Kandiboyina <vi...@inndata.in> wrote: > java.lang.NoClassDefFoundError: > com/datastax/spark/connector/rdd/CassandraTableScanRDD > >
error in spark cassandra connector
java.lang.NoClassDefFoundError: com/datastax/spark/connector/rdd/CassandraTableScanRDD
Spark- Cassandra Connector Error
Hello, I receive the following error when I attempt to connect to a Cassandra keyspace and table: WARN NettyUtil: "Found Netty's native epoll transport, but not running on linux-based operating system. Using NIO instead" The full details and log can be viewed here: http://stackoverflow.com/questions/33896937/spark-connector-error-warn-nettyutil-found-nettys-native-epoll-transport-but Thank you for your help and for your support. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Cassandra-Connector-Error-tp25483.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark-Cassandra-connector
Have you considered asking this question on https://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user ? Cheers On Thu, Aug 20, 2015 at 10:57 PM, Samya samya.ma...@amadeus.com wrote: Hi All, I need to write an RDD to Cassandra using the sparkCassandraConnector from DataStax. My application is using Yarn. *Some basic Questions :* 1. Will a call to saveToCassandra(.), be using the same connection object between all task in a given executor? I mean is there 1 (one) connection object per executor, that is shared between tasks ? 2. If the above answer is YES, is there a way to create a connectionPool for each executor, so that multiple task can dump data to cassandra in parallel? Regards, Samya -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Cassandra-connector-tp24378.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark-Cassandra-connector
Hi All, I need to write an RDD to Cassandra using the sparkCassandraConnector from DataStax. My application is using Yarn. *Some basic Questions :* 1. Will a call to saveToCassandra(.), be using the same connection object between all task in a given executor? I mean is there 1 (one) connection object per executor, that is shared between tasks ? 2. If the above answer is YES, is there a way to create a connectionPool for each executor, so that multiple task can dump data to cassandra in parallel? Regards, Samya -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Cassandra-connector-tp24378.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Cassandra Connector issue
HI All, I have tried Commands as mentioned below but still it is errors dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars /home/missingmerch/ postgresql-9.4-1201.jdbc41.jar,/home/missingmerch/dse.jar,/home/missingmerch/spark- cassandra-connector-java_2.10-1.1.1.jar /home/missingmerch/etl-0.0. 1-SNAPSHOT.jar dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars /home/missingmerch/ postgresql-9.4-1201.jdbc41.jar,/home/missingmerch/dse.jar,/home/missingmerch/spark- cassandra-connector-java_2.10-1.1.1.jar,/home/missingmerch/etl-0.0. 1-SNAPSHOT.jar I understand only problem with the way I provide list of jar file in the command, if anybody using Datastax Enterprise could please provide thier inputs to get this issue resolved Thanks for your support Satish Chandra On Mon, Aug 10, 2015 at 7:16 PM, Dean Wampler deanwamp...@gmail.com wrote: I don't know if DSE changed spark-submit, but you have to use a comma-separated list of jars to --jars. It probably looked for HelloWorld in the second one, the dse.jar file. Do this: dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars /home/missingmerch/ postgresql-9.4-1201.jdbc41.jar,/home/missingmerch/dse.jar,/home/missingmerch/spark- cassandra-connector-java_2.10-1.1.1.jar /home/missingmerch/etl-0.0. 1-SNAPSHOT.jar I also removed the extra //. Or put file: in front of them so they are proper URLs. Note the snapshot jar isn't in the --jars list. I assume that's where HelloWorld is found. Confusing, yes it is... dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, Aug 10, 2015 at 8:23 AM, satish chandra j jsatishchan...@gmail.com wrote: Hi, Thanks for quick input, now I am getting class not found error *Command:* dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars ///home/missingmerch/postgresql-9.4-1201.jdbc41.jar ///home/missingmerch/dse.jar ///home/missingmerch/spark-cassandra-connector-java_2.10-1.1.1.jar ///home/missingmerch/etl-0.0.1-SNAPSHOT.jar *Error:* java.lang.ClassNotFoundException: HelloWorld at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:342) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Previously I could fix the issue by changing the order of arguments passing in DSE command line interface but now I am not sure why the issue again Please let me know if still I am missing anything in my Command as mentioned above(as insisted I have added dse.jar and spark-cassandra-connector-java_2.10.1.1.1.jar) Thanks for support Satish Chandra On Mon, Aug 10, 2015 at 6:19 PM, Dean Wampler deanwamp...@gmail.com wrote: Add the other Cassandra dependencies (dse.jar, spark-cassandra-connect-java_2.10) to your --jars argument on the command line. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, Aug 10, 2015 at 7:44 AM, satish chandra j jsatishchan...@gmail.com wrote: HI All, Please help me to fix Spark Cassandra Connector issue, find the details below *Command:* dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars ///home/missingmerch/postgresql-9.4-1201.jdbc41.jar ///home/missingmerch/etl-0.0.1-SNAPSHOT.jar *Error:* WARN 2015-08-10 06:33:35 org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. Exception in thread main java.lang.NoSuchMethodError: com.datastax.spark.connector.package$.toRDDFunctions(Lorg/apache/spark/rdd/RDD;Lscala/reflect/ClassTag;)Lcom/datastax/spark/connector/RDDFunctions; at HelloWorld$.main(HelloWorld.scala:29) at HelloWorld.main(HelloWorld.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606
Spark Cassandra Connector issue
HI All, Please help me to fix Spark Cassandra Connector issue, find the details below *Command:* dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars ///home/missingmerch/postgresql-9.4-1201.jdbc41.jar ///home/missingmerch/etl-0.0.1-SNAPSHOT.jar *Error:* WARN 2015-08-10 06:33:35 org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. Exception in thread main java.lang.NoSuchMethodError: com.datastax.spark.connector.package$.toRDDFunctions(Lorg/apache/spark/rdd/RDD;Lscala/reflect/ClassTag;)Lcom/datastax/spark/connector/RDDFunctions; at HelloWorld$.main(HelloWorld.scala:29) at HelloWorld.main(HelloWorld.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) *Code:* *import* *org.apache*.spark.SparkContext *import* *org.apache*.spark.SparkContext._ *import* *org.apache*.spark.SparkConf *import* *org.apache*.spark.rdd.JdbcRDD *import* *com.datastax*.spark.connector._ *import* com.datastax.spark.connector.cql.CassandraConnector *import* com.datastax.bdp.spark.DseSparkConfHelper._ *import* java.sql.{Connection, DriverManager, ResultSet, PreparedStatement, SQLException, Statement} *object* HelloWorld { *def* main(args: Array[String]) { *def* createSparkContext() = { *val** myJar = *getClass.getProtectionDomain.getCodeSource.getLocation.getPath *val* conf = *new* SparkConf().set(spark.cassandra.connection.host, 10.246.43.15) .setAppName(First Spark App) .setMaster(local) * .s*etJars(Array(myJar)) .set(cassandra.username, username) .set(cassandra.password, password) .forDse *new* SparkContext(conf) } *val* sc = createSparkContext() *val* user=hkonak0 *val** pass=*Winter18 Class.forName(org.postgresql.Driver).newInstance *val* url = jdbc:postgresql://gptester:5432/db_test *val* myRDD27 = *new* JdbcRDD( sc, ()= DriverManager.getConnection(url,user,pass),select * from wmax_vmax.arm_typ_txt LIMIT ? OFFSET ?,5,0,1,(r: ResultSet) = {(r.getInt( alarm_type_code),r.getString(language_code),r.getString( alrm_type_cd_desc))}) myRDD27.saveToCassandra(keyspace,arm_typ_txt,SomeColumns( alarm_type_code,language_code,alrm_type_cd_desc)) println(myRDD27.count()) println(myRDD27.first) sc.stop() sys.exit() } } *POM XML:* dependencies dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version1.2.2/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactId*hadoop*-client/artifactId version1.2.1/version /dependency dependency groupIdorg.scala-*lang*/groupId artifactId*scala*-library/artifactId version2.10.5/version /dependency dependency groupId*junit*/groupId artifactId*junit*/artifactId version3.8.1/version scopetest/scope /dependency dependency groupIdcom.datastax.dse/groupId artifactId*dse*/artifactId version4.7.2/version scopesystem/scope systemPathC:\workspace\*etl*\*lib*\dse.jar/ systemPath /dependency dependency groupIdcom.datastax.spark/groupId artifactIdspark-*cassandra*-connector-java_2.10/artifactId version1.1.1/version /dependency /dependencies Please let me know if any further details required to analyze the issue Regards, Satish Chandra
Re: Spark Cassandra Connector issue
Add the other Cassandra dependencies (dse.jar, spark-cassandra-connect-java_2.10) to your --jars argument on the command line. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, Aug 10, 2015 at 7:44 AM, satish chandra j jsatishchan...@gmail.com wrote: HI All, Please help me to fix Spark Cassandra Connector issue, find the details below *Command:* dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars ///home/missingmerch/postgresql-9.4-1201.jdbc41.jar ///home/missingmerch/etl-0.0.1-SNAPSHOT.jar *Error:* WARN 2015-08-10 06:33:35 org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. Exception in thread main java.lang.NoSuchMethodError: com.datastax.spark.connector.package$.toRDDFunctions(Lorg/apache/spark/rdd/RDD;Lscala/reflect/ClassTag;)Lcom/datastax/spark/connector/RDDFunctions; at HelloWorld$.main(HelloWorld.scala:29) at HelloWorld.main(HelloWorld.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) *Code:* *import* *org.apache*.spark.SparkContext *import* *org.apache*.spark.SparkContext._ *import* *org.apache*.spark.SparkConf *import* *org.apache*.spark.rdd.JdbcRDD *import* *com.datastax*.spark.connector._ *import* com.datastax.spark.connector.cql.CassandraConnector *import* com.datastax.bdp.spark.DseSparkConfHelper._ *import* java.sql.{Connection, DriverManager, ResultSet, PreparedStatement, SQLException, Statement} *object* HelloWorld { *def* main(args: Array[String]) { *def* createSparkContext() = { *val** myJar = *getClass.getProtectionDomain.getCodeSource.getLocation.getPath *val* conf = *new* SparkConf().set( spark.cassandra.connection.host, 10.246.43.15) .setAppName(First Spark App) .setMaster(local) * .s*etJars(Array(myJar)) .set(cassandra.username, username) .set(cassandra.password, password) .forDse *new* SparkContext(conf) } *val* sc = createSparkContext() *val* user=hkonak0 *val** pass=*Winter18 Class.forName(org.postgresql.Driver).newInstance *val* url = jdbc:postgresql://gptester:5432/db_test *val* myRDD27 = *new* JdbcRDD( sc, ()= DriverManager.getConnection(url,user,pass),select * from wmax_vmax.arm_typ_txt LIMIT ? OFFSET ?,5,0,1,(r: ResultSet) = {(r.getInt(alarm_type_code),r.getString(language_code),r.getString( alrm_type_cd_desc))}) myRDD27.saveToCassandra(keyspace,arm_typ_txt,SomeColumns( alarm_type_code,language_code,alrm_type_cd_desc)) println(myRDD27.count()) println(myRDD27.first) sc.stop() sys.exit() } } *POM XML:* dependencies dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version1.2.2/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactId*hadoop*-client/artifactId version1.2.1/version /dependency dependency groupIdorg.scala-*lang*/groupId artifactId*scala*-library/artifactId version2.10.5/version /dependency dependency groupId*junit*/groupId artifactId*junit*/artifactId version3.8.1/version scopetest/scope /dependency dependency groupIdcom.datastax.dse/groupId artifactId*dse*/artifactId version4.7.2/version scopesystem/scope systemPathC:\workspace\*etl*\*lib*\dse.jar/ systemPath /dependency dependency groupIdcom.datastax.spark/groupId artifactIdspark-*cassandra*-connector-java_2.10/ artifactId version1.1.1/version /dependency /dependencies Please let me know if any further details required to analyze the issue Regards, Satish Chandra
Re: Spark Cassandra Connector issue
I don't know if DSE changed spark-submit, but you have to use a comma-separated list of jars to --jars. It probably looked for HelloWorld in the second one, the dse.jar file. Do this: dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars /home/missingmerch/ postgresql-9.4-1201.jdbc41.jar,/home/missingmerch/dse.jar,/home/missingmerch/spark- cassandra-connector-java_2.10-1.1.1.jar /home/missingmerch/etl-0.0. 1-SNAPSHOT.jar I also removed the extra //. Or put file: in front of them so they are proper URLs. Note the snapshot jar isn't in the --jars list. I assume that's where HelloWorld is found. Confusing, yes it is... dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, Aug 10, 2015 at 8:23 AM, satish chandra j jsatishchan...@gmail.com wrote: Hi, Thanks for quick input, now I am getting class not found error *Command:* dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars ///home/missingmerch/postgresql-9.4-1201.jdbc41.jar ///home/missingmerch/dse.jar ///home/missingmerch/spark-cassandra-connector-java_2.10-1.1.1.jar ///home/missingmerch/etl-0.0.1-SNAPSHOT.jar *Error:* java.lang.ClassNotFoundException: HelloWorld at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:342) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Previously I could fix the issue by changing the order of arguments passing in DSE command line interface but now I am not sure why the issue again Please let me know if still I am missing anything in my Command as mentioned above(as insisted I have added dse.jar and spark-cassandra-connector-java_2.10.1.1.1.jar) Thanks for support Satish Chandra On Mon, Aug 10, 2015 at 6:19 PM, Dean Wampler deanwamp...@gmail.com wrote: Add the other Cassandra dependencies (dse.jar, spark-cassandra-connect-java_2.10) to your --jars argument on the command line. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, Aug 10, 2015 at 7:44 AM, satish chandra j jsatishchan...@gmail.com wrote: HI All, Please help me to fix Spark Cassandra Connector issue, find the details below *Command:* dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars ///home/missingmerch/postgresql-9.4-1201.jdbc41.jar ///home/missingmerch/etl-0.0.1-SNAPSHOT.jar *Error:* WARN 2015-08-10 06:33:35 org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. Exception in thread main java.lang.NoSuchMethodError: com.datastax.spark.connector.package$.toRDDFunctions(Lorg/apache/spark/rdd/RDD;Lscala/reflect/ClassTag;)Lcom/datastax/spark/connector/RDDFunctions; at HelloWorld$.main(HelloWorld.scala:29) at HelloWorld.main(HelloWorld.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) *Code:* *import* *org.apache*.spark.SparkContext *import* *org.apache*.spark.SparkContext._ *import* *org.apache*.spark.SparkConf *import* *org.apache*.spark.rdd.JdbcRDD *import* *com.datastax*.spark.connector._ *import* com.datastax.spark.connector.cql.CassandraConnector *import* com.datastax.bdp.spark.DseSparkConfHelper._ *import* java.sql.{Connection, DriverManager, ResultSet, PreparedStatement, SQLException, Statement} *object* HelloWorld { *def* main(args: Array[String]) { *def* createSparkContext() = { *val** myJar = *getClass.getProtectionDomain.getCodeSource.getLocation.getPath *val* conf = *new* SparkConf().set
Re: Spark Cassandra Connector issue
Hi, Thanks for quick input, now I am getting class not found error *Command:* dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars ///home/missingmerch/postgresql-9.4-1201.jdbc41.jar ///home/missingmerch/dse.jar ///home/missingmerch/spark-cassandra-connector-java_2.10-1.1.1.jar ///home/missingmerch/etl-0.0.1-SNAPSHOT.jar *Error:* java.lang.ClassNotFoundException: HelloWorld at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:342) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Previously I could fix the issue by changing the order of arguments passing in DSE command line interface but now I am not sure why the issue again Please let me know if still I am missing anything in my Command as mentioned above(as insisted I have added dse.jar and spark-cassandra-connector-java_2.10.1.1.1.jar) Thanks for support Satish Chandra On Mon, Aug 10, 2015 at 6:19 PM, Dean Wampler deanwamp...@gmail.com wrote: Add the other Cassandra dependencies (dse.jar, spark-cassandra-connect-java_2.10) to your --jars argument on the command line. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, Aug 10, 2015 at 7:44 AM, satish chandra j jsatishchan...@gmail.com wrote: HI All, Please help me to fix Spark Cassandra Connector issue, find the details below *Command:* dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars ///home/missingmerch/postgresql-9.4-1201.jdbc41.jar ///home/missingmerch/etl-0.0.1-SNAPSHOT.jar *Error:* WARN 2015-08-10 06:33:35 org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. Exception in thread main java.lang.NoSuchMethodError: com.datastax.spark.connector.package$.toRDDFunctions(Lorg/apache/spark/rdd/RDD;Lscala/reflect/ClassTag;)Lcom/datastax/spark/connector/RDDFunctions; at HelloWorld$.main(HelloWorld.scala:29) at HelloWorld.main(HelloWorld.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) *Code:* *import* *org.apache*.spark.SparkContext *import* *org.apache*.spark.SparkContext._ *import* *org.apache*.spark.SparkConf *import* *org.apache*.spark.rdd.JdbcRDD *import* *com.datastax*.spark.connector._ *import* com.datastax.spark.connector.cql.CassandraConnector *import* com.datastax.bdp.spark.DseSparkConfHelper._ *import* java.sql.{Connection, DriverManager, ResultSet, PreparedStatement, SQLException, Statement} *object* HelloWorld { *def* main(args: Array[String]) { *def* createSparkContext() = { *val** myJar = *getClass.getProtectionDomain.getCodeSource.getLocation.getPath *val* conf = *new* SparkConf().set( spark.cassandra.connection.host, 10.246.43.15) .setAppName(First Spark App) .setMaster(local) * .s*etJars(Array(myJar)) .set(cassandra.username, username) .set(cassandra.password, password) .forDse *new* SparkContext(conf) } *val* sc = createSparkContext() *val* user=hkonak0 *val** pass=*Winter18 Class.forName(org.postgresql.Driver).newInstance *val* url = jdbc:postgresql://gptester:5432/db_test *val* myRDD27 = *new* JdbcRDD( sc, ()= DriverManager.getConnection(url,user,pass),select * from wmax_vmax.arm_typ_txt LIMIT ? OFFSET ?,5,0,1,(r: ResultSet) = {(r.getInt(alarm_type_code),r.getString(language_code),r.getString( alrm_type_cd_desc))}) myRDD27.saveToCassandra(keyspace,arm_typ_txt,SomeColumns( alarm_type_code,language_code,alrm_type_cd_desc)) println(myRDD27.count()) println(myRDD27.first) sc.stop() sys.exit() } } *POM
Spark-Cassandra connector DataFrame
Hi, I would like to get the recommendations to use Spark-Cassandra connector DataFrame feature. I was trying to save a Dataframe containing 8 Million rows to Cassandra through the Spark-Cassandra connector. Based on the Spark log, this single action took about 60 minutes to complete. I think it was a very slow process. Are there some configurations I need to check when using this Spark-Cassandra connector DataFrame feature? From the Spark log, I can see saving the Dataframe to Cassandra was performed by 200 small steps. Cassandra database was connected and disconnected 4 times during the 60 minutes. This number matches the number of nodes the Cassandra cluster has. I understand this feature is Dataframe Experimental, and I am new to both Spark and Cassandra. Any suggestions are much appreciated. Thanks, Simon Wang
Re: Spark Cassandra connector number of Tasks
Looking for help with this. Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Cassandra-connector-number-of-Tasks-tp22820p22839.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark Cassandra connector number of Tasks
I am using the Spark Cassandra connector to work with a table with 3 million records. Using .where() API to work with only a certain rows in this table. Where clause filters the data to 1 rows. CassandraJavaUtil.javaFunctions(sparkContext) .cassandraTable(KEY_SPACE, MY_TABLE, CassandraJavaUtil.mapRowTo(MyClass.class)).where(cqlDataFilter, cqlFilterParams) Also using parameter spark.cassandra.input.split.size=1000 As this job is processed by Spark cluster, it created 3000 partitions instead of 10. On spark cluster 3000 tasks are being executed. As the data in our table grows to 30 million rows, this will create 30,000 tasks instead of 10. Is there a better way to approach process these 10,000 records with 10 tasks. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Cassandra-connector-number-of-Tasks-tp22820.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark Cassandra Connector
Hello, I am facing some difficulties on installing the Cassandra Spark connector. Specifically I am working on Cassandra 2.0.13 and Spark 1.2.1. I am trying to build the-create the JAR- for the connection but unfortunately I cannot see in which file I have to declare the dependencies libraryDependencies += com.datastax.spark %% spark-cassandra-connector % 1.2.0-rc3 in order to create the previous jar version of the connector and not the default one (i.e. spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar ) I am really new on working with sbt. Any guidance / help would be really appreciated it. Thank you very much. DS -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Cassandra-Connector-tp22558.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using the DataStax Cassandra Connector from PySpark
Did you receive any response on this? I am trying to load hbase classes and getting the same error py4j.protocol.Py4JError: Trying to call a package. . Even though the $HBASE_HOME/lib/* had already been added to the compute-classpath.sh 2014-10-21 16:02 GMT-07:00 Mike Sukmanowsky mike.sukmanow...@gmail.com: Hi there, I'm using Spark 1.1.0 and experimenting with trying to use the DataStax Cassandra Connector (https://github.com/datastax/spark-cassandra-connector) from within PySpark. As a baby step, I'm simply trying to validate that I have access to classes that I'd need via Py4J. Sample python program: from py4j.java_gateway import java_import from pyspark.conf import SparkConf from pyspark import SparkContext conf = SparkConf().set(spark.cassandra.connection.host, 127.0.0.1) sc = SparkContext(appName=Spark + Cassandra Example, conf=conf) java_import(sc._gateway.jvm, com.datastax.spark.connector.*) print sc._jvm.CassandraRow() CassandraRow corresponds to https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/CassandraRow.scala which is included in the JAR I submit. Feel free to download the JAR here https://dl.dropboxusercontent.com/u/4385786/pyspark-cassandra-0.1.0-SNAPSHOT-standalone.jar I'm currently running this Python example with: spark-submit --driver-class-path=/path/to/pyspark-cassandra-0.1.0-SNAPSHOT-standalone.jar --verbose src/python/cassandara_example.py But continually get the following error indicating that the classes aren't in fact on the classpath of the GatewayServer: Traceback (most recent call last): File /Users/mikesukmanowsky/Development/parsely/pyspark-cassandra/src/python/cassandara_example.py, line 37, in module main() File /Users/mikesukmanowsky/Development/parsely/pyspark-cassandra/src/python/cassandara_example.py, line 25, in main print sc._jvm.CassandraRow() File /Users/mikesukmanowsky/.opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 726, in __getattr__ py4j.protocol.Py4JError: Trying to call a package. The correct response from the GatewayServer should be: In [22]: gateway.jvm.CassandraRow() Out[22]: JavaObject id=o0 Also tried using --jars option instead and that doesn't seem to work either. Is there something I'm missing as to why the classes aren't available? -- Mike Sukmanowsky Aspiring Digital Carpenter *p*: +1 (416) 953-4248 *e*: mike.sukmanow...@gmail.com facebook http://facebook.com/mike.sukmanowsky | twitter http://twitter.com/msukmanowsky | LinkedIn http://www.linkedin.com/profile/view?id=10897143 | github https://github.com/msukmanowsky
Spark Cassandra Connector proper usage
I'm looking to use spark for some ETL, which will mostly consist of update statements (a column is a set, that'll be appended to, so a simple insert is likely not going to work). As such, it seems like issuing CQL queries to import the data is the best option. Using the Spark Cassandra Connector, I see I can do this: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md#connecting-manually-to-cassandra Now I don't want to open a session and close it for every row in the source (am I right in not wanting this? Usually, I have one session for the entire process, and keep using that in normal apps). However, it says that the connector is serializable, but the session is obviously not. So, wrapping the whole import inside a single withSessionDo seems like it'll cause problems. I was thinking of using something like this: class CassandraStorage(conf:SparkConf) { val session = CassandraConnector(conf).openSession() def store (t:Thingy) : Unit = { //session.execute cql goes here } } Is this a good approach? Do I need to worry about closing the session? Where / how best would I do that? Any pointers are appreciated. Thanks,Ashic.
RE: Spark Cassandra Connector proper usage
Hi Gerard, Thanks for the response. Here's the scenario: The target cassandra schema looks like this: create table foo ( id text primary key, bar int, things settext ) The source in question is a Sql Server source providing the necessary data. The source goes over the same id multiple times adding things to the things set each time. With inserts, it'll replace things with a new set of one element, instead of appending that item. As such, the query update foo set things = things + ? where id=? solves the problem. If I had to stick with saveToCassasndra, I'd have to read in the existing row for each row, and then write it back. Since this is happening in parallel on multiple machines, that would likely cause discrepancies where a node will read and update to older values. Hence my question about session management in order to issue custom update queries. Thanks, Ashic. Date: Thu, 23 Oct 2014 14:27:47 +0200 Subject: Re: Spark Cassandra Connector proper usage From: gerard.m...@gmail.com To: as...@live.com Ashic, With the Spark-cassandra connector you would typically create an RDD from the source table, update what you need, filter out what you don't update and write it back to Cassandra. Kr, Gerard On Oct 23, 2014 1:21 PM, Ashic Mahtab as...@live.com wrote: I'm looking to use spark for some ETL, which will mostly consist of update statements (a column is a set, that'll be appended to, so a simple insert is likely not going to work). As such, it seems like issuing CQL queries to import the data is the best option. Using the Spark Cassandra Connector, I see I can do this: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md#connecting-manually-to-cassandra Now I don't want to open a session and close it for every row in the source (am I right in not wanting this? Usually, I have one session for the entire process, and keep using that in normal apps). However, it says that the connector is serializable, but the session is obviously not. So, wrapping the whole import inside a single withSessionDo seems like it'll cause problems. I was thinking of using something like this: class CassandraStorage(conf:SparkConf) { val session = CassandraConnector(conf).openSession() def store (t:Thingy) : Unit = { //session.execute cql goes here } } Is this a good approach? Do I need to worry about closing the session? Where / how best would I do that? Any pointers are appreciated. Thanks,Ashic.
Re: Spark Cassandra Connector proper usage
Hi Ashic, At the moment I see two options: 1) You could use the CassandraConnector object to execute your specialized query. The recommended pattern is to to that within a rdd.foreachPartition(...) in order to amortize DB connection setup over the number of elements in on partition. Something like this: val sparkContext = ??? val cassandraConnector = CassandraConnector(conf) val dataRdd = ??? // I assume this is the source of data val rddThingById = dataRdd.map(elem = transformToIdByThing(elem) ) rddThingById.foreachPartition(partition = { cassandraConnector.withSessionDo{ session = partition.foreach(record = session.execute(update foo set things = things + ? where id=? , record.id, record.thing) } } 2) You could change your datamodel slightly in order to avoid the update operation. Actually, the cassandra representation of a set is nothing more than a column - timestamp, where the column name is an element of the set. So Set (a,b,c) = Column(a)- ts, Column(b) - ts, Column(c) - tx So, if you desugarize your datamodel, you could use something like: create table foo ( id text primary key, bar int, things text, ts timestamp, primary key ((id, bar), things) ) And your Spark process would be reduced to: val sparkContext = ??? val dataRdd = ??? // I assume this is the source of data dataRdd.map(elem = transformToIdBarThingByTimeStamp(elem) ).saveToCassandra(ks, foo,Columns(id, bar, thing, ts)) Hope this helps. -kr, Gerard. On Thu, Oct 23, 2014 at 2:48 PM, Ashic Mahtab as...@live.com wrote: Hi Gerard, Thanks for the response. Here's the scenario: The target cassandra schema looks like this: create table foo ( id text primary key, bar int, things settext ) The source in question is a Sql Server source providing the necessary data. The source goes over the same id multiple times adding things to the things set each time. With inserts, it'll replace things with a new set of one element, instead of appending that item. As such, the query update foo set things = things + ? where id=? solves the problem. If I had to stick with saveToCassasndra, I'd have to read in the existing row for each row, and then write it back. Since this is happening in parallel on multiple machines, that would likely cause discrepancies where a node will read and update to older values. Hence my question about session management in order to issue custom update queries. Thanks, Ashic. -- Date: Thu, 23 Oct 2014 14:27:47 +0200 Subject: Re: Spark Cassandra Connector proper usage From: gerard.m...@gmail.com To: as...@live.com Ashic, With the Spark-cassandra connector you would typically create an RDD from the source table, update what you need, filter out what you don't update and write it back to Cassandra. Kr, Gerard On Oct 23, 2014 1:21 PM, Ashic Mahtab as...@live.com wrote: I'm looking to use spark for some ETL, which will mostly consist of update statements (a column is a set, that'll be appended to, so a simple insert is likely not going to work). As such, it seems like issuing CQL queries to import the data is the best option. Using the Spark Cassandra Connector, I see I can do this: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md#connecting-manually-to-cassandra https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md#connecting-manually-to-cassandra https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md#connecting-manually-to-cassandra https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md#connecting-manually-to-cassandra Now I don't want to open a session and close it for every row in the source (am I right in not wanting this? Usually, I have one session for the entire process, and keep using that in normal apps). However, it says that the connector is serializable, but the session is obviously not. So, wrapping the whole import inside a single withSessionDo seems like it'll cause problems. I was thinking of using something like this: class CassandraStorage(conf:SparkConf) { val session = CassandraConnector(conf).openSession() def store (t:Thingy) : Unit = { //session.execute cql goes here } } Is this a good approach? Do I need to worry about closing the session? Where / how best would I do that? Any pointers are appreciated. Thanks, Ashic.
RE: Spark Cassandra Connector proper usage
Hi Gerard, I've gone with option 1, and seems to be working well. Option 2 is also quite interesting. Thanks for your help in this. Regards, Ashic. From: gerard.m...@gmail.com Date: Thu, 23 Oct 2014 17:07:56 +0200 Subject: Re: Spark Cassandra Connector proper usage To: as...@live.com CC: user@spark.apache.org Hi Ashic, At the moment I see two options: 1) You could use the CassandraConnector object to execute your specialized query. The recommended pattern is to to that within a rdd.foreachPartition(...) in order to amortize DB connection setup over the number of elements in on partition. Something like this: val sparkContext = ???val cassandraConnector = CassandraConnector(conf) val dataRdd = ??? // I assume this is the source of dataval rddThingById = dataRdd.map(elem = transformToIdByThing(elem) ) rddThingById.foreachPartition(partition = { cassandraConnector.withSessionDo{ session = partition.foreach(record = session.execute(update foo set things = things + ? where id=? , record.id, record.thing) } } 2) You could change your datamodel slightly in order to avoid the update operation. Actually, the cassandra representation of a set is nothing more than a column - timestamp, where the column name is an element of the set.So Set (a,b,c) = Column(a)- ts, Column(b) - ts, Column(c) - tx So, if you desugarize your datamodel, you could use something like:create table foo ( id text primary key, bar int, things text, ts timestamp, primary key ((id, bar), things)) And your Spark process would be reduced to:val sparkContext = ??? val dataRdd = ??? // I assume this is the source of datadataRdd.map(elem = transformToIdBarThingByTimeStamp(elem) ).saveToCassandra(ks, foo,Columns(id, bar, thing, ts)) Hope this helps. -kr, Gerard. On Thu, Oct 23, 2014 at 2:48 PM, Ashic Mahtab as...@live.com wrote: Hi Gerard, Thanks for the response. Here's the scenario: The target cassandra schema looks like this: create table foo ( id text primary key, bar int, things settext ) The source in question is a Sql Server source providing the necessary data. The source goes over the same id multiple times adding things to the things set each time. With inserts, it'll replace things with a new set of one element, instead of appending that item. As such, the query update foo set things = things + ? where id=? solves the problem. If I had to stick with saveToCassasndra, I'd have to read in the existing row for each row, and then write it back. Since this is happening in parallel on multiple machines, that would likely cause discrepancies where a node will read and update to older values. Hence my question about session management in order to issue custom update queries. Thanks, Ashic. Date: Thu, 23 Oct 2014 14:27:47 +0200 Subject: Re: Spark Cassandra Connector proper usage From: gerard.m...@gmail.com To: as...@live.com Ashic, With the Spark-cassandra connector you would typically create an RDD from the source table, update what you need, filter out what you don't update and write it back to Cassandra. Kr, Gerard On Oct 23, 2014 1:21 PM, Ashic Mahtab as...@live.com wrote: I'm looking to use spark for some ETL, which will mostly consist of update statements (a column is a set, that'll be appended to, so a simple insert is likely not going to work). As such, it seems like issuing CQL queries to import the data is the best option. Using the Spark Cassandra Connector, I see I can do this: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md#connecting-manually-to-cassandra Now I don't want to open a session and close it for every row in the source (am I right in not wanting this? Usually, I have one session for the entire process, and keep using that in normal apps). However, it says that the connector is serializable, but the session is obviously not. So, wrapping the whole import inside a single withSessionDo seems like it'll cause problems. I was thinking of using something like this: class CassandraStorage(conf:SparkConf) { val session = CassandraConnector(conf).openSession() def store (t:Thingy) : Unit = { //session.execute cql goes here } } Is this a good approach? Do I need to worry about closing the session? Where / how best would I do that? Any pointers are appreciated. Thanks,Ashic.
Spark Cassandra connector issue
Hi, I am creating a cassandra java rdd and transforming it using the where clause. It works fine when I run it outside the mapValues, but when I put the code in mapValues I get an error while creating the transformation. Below is my sample code: CassandraJavaRDDReferenceData cassandraRefTable = javaFunctions(sc ).cassandraTable(reference_data, dept_reference_data, ReferenceData.class); JavaPairRDDString, Employee joinedRdd = rdd.mapValues(new FunctionIPLocation, IPLocation() { public Employee call(Employee employee) throws Exception { ReferenceData data = null; if(employee.getDepartment() != null) { data = referenceTable.where(postal_plus=?, location .getPostalPlus()).first(); System.out.println(data.toCSV()); } if(data != null) { //call setters on employee } return employee; } } I get this error: java.lang.NullPointerException at org.apache.spark.rdd.RDD.init(RDD.scala:125) at com.datastax.spark.connector.rdd.CassandraRDD.init( CassandraRDD.scala:47) at com.datastax.spark.connector.rdd.CassandraRDD.copy(CassandraRDD.scala:70) at com.datastax.spark.connector.rdd.CassandraRDD.where(CassandraRDD.scala:77 ) at com.datastax.spark.connector.rdd.CassandraJavaRDD.where( CassandraJavaRDD.java:54) Thanks for help!! Regards Ankur
Using the DataStax Cassandra Connector from PySpark
Hi there, I'm using Spark 1.1.0 and experimenting with trying to use the DataStax Cassandra Connector (https://github.com/datastax/spark-cassandra-connector) from within PySpark. As a baby step, I'm simply trying to validate that I have access to classes that I'd need via Py4J. Sample python program: from py4j.java_gateway import java_import from pyspark.conf import SparkConf from pyspark import SparkContext conf = SparkConf().set(spark.cassandra.connection.host, 127.0.0.1) sc = SparkContext(appName=Spark + Cassandra Example, conf=conf) java_import(sc._gateway.jvm, com.datastax.spark.connector.*) print sc._jvm.CassandraRow() CassandraRow corresponds to https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/CassandraRow.scala which is included in the JAR I submit. Feel free to download the JAR here https://dl.dropboxusercontent.com/u/4385786/pyspark-cassandra-0.1.0-SNAPSHOT-standalone.jar I'm currently running this Python example with: spark-submit --driver-class-path=/path/to/pyspark-cassandra-0.1.0-SNAPSHOT-standalone.jar --verbose src/python/cassandara_example.py But continually get the following error indicating that the classes aren't in fact on the classpath of the GatewayServer: Traceback (most recent call last): File /Users/mikesukmanowsky/Development/parsely/pyspark-cassandra/src/python/cassandara_example.py, line 37, in module main() File /Users/mikesukmanowsky/Development/parsely/pyspark-cassandra/src/python/cassandara_example.py, line 25, in main print sc._jvm.CassandraRow() File /Users/mikesukmanowsky/.opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 726, in __getattr__ py4j.protocol.Py4JError: Trying to call a package. The correct response from the GatewayServer should be: In [22]: gateway.jvm.CassandraRow() Out[22]: JavaObject id=o0 Also tried using --jars option instead and that doesn't seem to work either. Is there something I'm missing as to why the classes aren't available? -- Mike Sukmanowsky Aspiring Digital Carpenter *p*: +1 (416) 953-4248 *e*: mike.sukmanow...@gmail.com facebook http://facebook.com/mike.sukmanowsky | twitter http://twitter.com/msukmanowsky | LinkedIn http://www.linkedin.com/profile/view?id=10897143 | github https://github.com/msukmanowsky
Re: Spark Cassandra connector issue
Is this because I am calling a transformation function on an rdd from inside another transformation function? Is it not allowed? Thanks Ankut On Oct 21, 2014 1:59 PM, Ankur Srivastava ankur.srivast...@gmail.com wrote: Hi Gerard, this is the code that may be helpful. public class ReferenceDataJoin implements Serializable { private static final long serialVersionUID = 1039084794799135747L; JavaPairRDDString, Employee rdd; CassandraJavaRDDReferenceData referenceTable; public PostalReferenceDataJoin(ListEmployee employees) { JavaSparkContext sc = SparkContextFactory.getSparkContextFactory().getSparkContext(); this.rdd = sc.parallelizePairs(employees); this. referenceTable = javaFunctions(sc).cassandraTable(reference_data, “dept_reference_data, ReferenceData.class); } public JavaPairRDDString, Employee execute() { JavaPairRDDString, Employee joinedRdd = rdd .mapValues(new FunctionEmployee, Employee() { private static final long serialVersionUID = -226016490083377260L; @Override public Employee call(Employee employee) throws Exception { ReferenceData data = null; if (employee.getDepartment() != null) { data = referenceTable.where(“dept=?, employee.getDepartment()).first();; System.out.println(employee.getDepartment() + + data); } if (data != null) { //setters on employee } return employee; } }); return joinedRdd; } } Thanks Ankur On Tue, Oct 21, 2014 at 11:11 AM, Gerard Maas gerard.m...@gmail.com wrote: Looks like that code does not correspond to the problem you're facing. I doubt it would even compile. Could you post the actual code? -kr, Gerard On Oct 21, 2014 7:27 PM, Ankur Srivastava ankur.srivast...@gmail.com wrote: Hi, I am creating a cassandra java rdd and transforming it using the where clause. It works fine when I run it outside the mapValues, but when I put the code in mapValues I get an error while creating the transformation. Below is my sample code: CassandraJavaRDDReferenceData cassandraRefTable = javaFunctions(sc ).cassandraTable(reference_data, dept_reference_data, ReferenceData.class); JavaPairRDDString, Employee joinedRdd = rdd.mapValues(new FunctionIPLocation, IPLocation() { public Employee call(Employee employee) throws Exception { ReferenceData data = null; if(employee.getDepartment() != null) { data = referenceTable.where(postal_plus=?, location .getPostalPlus()).first(); System.out.println(data.toCSV()); } if(data != null) { //call setters on employee } return employee; } } I get this error: java.lang.NullPointerException at org.apache.spark.rdd.RDD.init(RDD.scala:125) at com.datastax.spark.connector.rdd.CassandraRDD.init( CassandraRDD.scala:47) at com.datastax.spark.connector.rdd.CassandraRDD.copy( CassandraRDD.scala:70) at com.datastax.spark.connector.rdd.CassandraRDD.where( CassandraRDD.scala:77) at com.datastax.spark.connector.rdd.CassandraJavaRDD.where( CassandraJavaRDD.java:54) Thanks for help!! Regards Ankur
Spark Cassandra Connector Issue and performance
Hey all I tried spark connector with Cassandra and I ran into a problem that I was blocked on for couple of weeks. I managed to find a solution to the problem but I am not sure whether it was a bug of the connector/spark or not. I had three tables in Cassandra (Running Cassandra on 5 node cluster) and a large Spark cluster (5 worker node with each having 32 cores and 240G Memory). When I ran my job which extracts data from S3 and writes to 3 tables in Cassandra using around 1TB of memory and 160 cores, sometimes my job get stuck at last few task of a stage... After playing around for a while I realised that reducing number of cores to 2 per machine (10 Total) made the job stable. I gradually increased the number of cores and it hanged again once I had about 50 cores total. I would like to know if anyone else experienced this and if this is explainable? On another note I would like to know if people seeing good performance reading from cassandra using spark as oppose to reading data from HDFS. Kind of an open question but would like to see how others are using it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Cassandra-Connector-Issue-and-performance-tp15005.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Anyone have successful recipe for spark cassandra connector?
I'm running out of options trying to integrate cassandra, spark, and the spark-cassandra-connector. I quickly found out just grabbing the latest versions of everything (drivers, etc.) doesn't work--binary incompatibilities it would seem. So last I tried using versions of drivers from the spark-cassandra-connector's build. Better, but still no dice. Any successes out there? I'd really love to use the stack. If curious my ridiculously trivial example is here: https://github.com/gzoller/doesntwork https://github.com/gzoller/doesntwork If you run 'sbt test' you'll get a NoHostAvailableException exception complaining it tried /10.0.0.194:9042. I have no idea where that addr came from. I was trying to connect to local. Any ideas appreciated! Greg -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Anyone-have-successful-recipe-for-spark-cassandra-connector-tp14681.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
problem in using Spark-Cassandra connector
Hi, I am new to spark. I encountered an issue when trying to connect to Cassandra using Spark Cassandra connector. Can anyone help me. Following are the details. 1) Following Spark and Cassandra versions I am using on LUbuntu12.0. i)spark-1.0.2-bin-hadoop2 ii) apache-cassandra-2.0.10 2) In the Cassandra, i created a key space, table and inserted some data. 3)Following libs are specified when starting the spark-shell. antworks@INHN1I-DW1804:$ spark-shell --jars /home/antworks/lib-spark-cassandra/apache-cassandra-clientutil-2.0.10.jar,/home/antworks/lib-spark-cassandra/apache-cassandra-thrift-2.0.10.jar,/home/antworks/lib-spark-cassandra/cassandra-driver-core-2.0.2.jar,/home/antworks/lib-spark-cassandra/guava-15.0.jar,/home/antworks/lib-spark-cassandra/joda-convert-1.6.jar,/home/antworks/lib-spark-cassandra/joda-time-2.3.jar,/home/antworks/lib-spark-cassandra/libthrift-0.9.1.jar,/home/antworks/lib-spark-cassandra/spark-cassandra-connector_2.10-1.0.0-rc3.jar 4) when running the stmt val rdd = sc.cassandraTable(EmailKeySpace, Emails)encountered following issue. My application connecting to Cassandra and immediately disconnecting and throwing java.io.IOException: Table not found: EmailKeySpace.Emails Here is the stack trace. scala import com.datastax.spark.connector._ import com.datastax.spark.connector._ scala val rdd = sc.cassandraTable(EmailKeySpace, Emails) 14/09/11 23:06:51 WARN FrameCompressor: Cannot find LZ4 class, you should make sure the LZ4 library is in the classpath if you intend to use it. LZ4 compression will not be available for the protocol. 14/09/11 23:06:51 INFO Cluster: New Cassandra host /172.23.1.68:9042 added 14/09/11 23:06:51 INFO CassandraConnector: Connected to Cassandra cluster: AWCluster 14/09/11 23:06:52 INFO CassandraConnector: Disconnected from Cassandra cluster: AWCluster java.io.IOException: Table not found: EmailKeySpace.Emails at com.datastax.spark.connector.rdd.CassandraRDD.tableDef$lzycompute(CassandraRDD.scala:208) at com.datastax.spark.connector.rdd.CassandraRDD.tableDef(CassandraRDD.scala:205) at com.datastax.spark.connector.rdd.CassandraRDD.init(CassandraRDD.scala:212) at com.datastax.spark.connector.SparkContextFunctions.cassandraTable(SparkContextFunctions.scala:48) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:15) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC$$iwC$$iwC.init(console:22) at $iwC$$iwC$$iwC.init(console:24) at $iwC$$iwC.init(console:26) at $iwC.init(console:28) at init(console:30) at .init(console:34) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303
Re: problem in using Spark-Cassandra connector
You will have to create create KeySpace and Table. See the message, Table not found: EmailKeySpace.Emails Looks like you have not created the Emails table. On Thu, Sep 11, 2014 at 6:04 PM, Karunya Padala karunya.pad...@infotech-enterprises.com wrote: Hi, I am new to spark. I encountered an issue when trying to connect to Cassandra using Spark Cassandra connector. Can anyone help me. Following are the details. 1) Following Spark and Cassandra versions I am using on LUbuntu12.0. i)spark-1.0.2-bin-hadoop2 ii) apache-cassandra-2.0.10 2) In the Cassandra, i created a key space, table and inserted some data. 3)Following libs are specified when starting the spark-shell. antworks@INHN1I-DW1804:$ spark-shell --jars /home/antworks/lib-spark-cassandra/apache-cassandra-clientutil-2.0.10.jar,/home/antworks/lib-spark-cassandra/apache-cassandra-thrift-2.0.10.jar,/home/antworks/lib-spark-cassandra/cassandra-driver-core-2.0.2.jar,/home/antworks/lib-spark-cassandra/guava-15.0.jar,/home/antworks/lib-spark-cassandra/joda-convert-1.6.jar,/home/antworks/lib-spark-cassandra/joda-time-2.3.jar,/home/antworks/lib-spark-cassandra/libthrift-0.9.1.jar,/home/antworks/lib-spark-cassandra/spark-cassandra-connector_2.10-1.0.0-rc3.jar 4) when running the stmt val rdd = sc.cassandraTable(EmailKeySpace, Emails)encountered following issue. My application connecting to Cassandra and immediately disconnecting and throwing java.io.IOException: Table not found: EmailKeySpace.Emails Here is the stack trace. scala import com.datastax.spark.connector._ import com.datastax.spark.connector._ scala val rdd = sc.cassandraTable(EmailKeySpace, Emails) 14/09/11 23:06:51 WARN FrameCompressor: Cannot find LZ4 class, you should make sure the LZ4 library is in the classpath if you intend to use it. LZ4 compression will not be available for the protocol. 14/09/11 23:06:51 INFO Cluster: New Cassandra host /172.23.1.68:9042 added 14/09/11 23:06:51 INFO CassandraConnector: Connected to Cassandra cluster: AWCluster 14/09/11 23:06:52 INFO CassandraConnector: Disconnected from Cassandra cluster: AWCluster java.io.IOException: Table not found: EmailKeySpace.Emails at com.datastax.spark.connector.rdd.CassandraRDD.tableDef$lzycompute(CassandraRDD.scala:208) at com.datastax.spark.connector.rdd.CassandraRDD.tableDef(CassandraRDD.scala:205) at com.datastax.spark.connector.rdd.CassandraRDD.init(CassandraRDD.scala:212) at com.datastax.spark.connector.SparkContextFunctions.cassandraTable(SparkContextFunctions.scala:48) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:15) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC$$iwC$$iwC.init(console:22) at $iwC$$iwC$$iwC.init(console:24) at $iwC$$iwC.init(console:26) at $iwC.init(console:28) at init(console:30) at .init(console:34) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala
RE: problem in using Spark-Cassandra connector
I have created key space called EmailKeySpace’and table called Emails and inserted some data in the Cassandra. See my Cassandra console screen shot. [cid:image001.png@01CFCDEB.8FB55CB0] Regards, Karunya. From: Reddy Raja [mailto:areddyr...@gmail.com] Sent: 11 September 2014 18:07 To: Karunya Padala Cc: u...@spark.incubator.apache.org Subject: Re: problem in using Spark-Cassandra connector You will have to create create KeySpace and Table. See the message, Table not found: EmailKeySpace.Emails Looks like you have not created the Emails table. On Thu, Sep 11, 2014 at 6:04 PM, Karunya Padala karunya.pad...@infotech-enterprises.commailto:karunya.pad...@infotech-enterprises.com wrote: Hi, I am new to spark. I encountered an issue when trying to connect to Cassandra using Spark Cassandra connector. Can anyone help me. Following are the details. 1) Following Spark and Cassandra versions I am using on LUbuntu12.0. i)spark-1.0.2-bin-hadoop2 ii) apache-cassandra-2.0.10 2) In the Cassandra, i created a key space, table and inserted some data. 3)Following libs are specified when starting the spark-shell. antworks@INHN1I-DW1804:$ spark-shell --jars /home/antworks/lib-spark-cassandra/apache-cassandra-clientutil-2.0.10.jar,/home/antworks/lib-spark-cassandra/apache-cassandra-thrift-2.0.10.jar,/home/antworks/lib-spark-cassandra/cassandra-driver-core-2.0.2.jar,/home/antworks/lib-spark-cassandra/guava-15.0.jar,/home/antworks/lib-spark-cassandra/joda-convert-1.6.jar,/home/antworks/lib-spark-cassandra/joda-time-2.3.jar,/home/antworks/lib-spark-cassandra/libthrift-0.9.1.jar,/home/antworks/lib-spark-cassandra/spark-cassandra-connector_2.10-1.0.0-rc3.jar 4) when running the stmt val rdd = sc.cassandraTable(EmailKeySpace, Emails)encountered following issue. My application connecting to Cassandra and immediately disconnecting and throwing java.io.IOException: Table not found: EmailKeySpace.Emails Here is the stack trace. scala import com.datastax.spark.connector._ import com.datastax.spark.connector._ scala val rdd = sc.cassandraTable(EmailKeySpace, Emails) 14/09/11 23:06:51 WARN FrameCompressor: Cannot find LZ4 class, you should make sure the LZ4 library is in the classpath if you intend to use it. LZ4 compression will not be available for the protocol. 14/09/11 23:06:51 INFO Cluster: New Cassandra host /172.23.1.68:9042http://172.23.1.68:9042 added 14/09/11 23:06:51 INFO CassandraConnector: Connected to Cassandra cluster: AWCluster 14/09/11 23:06:52 INFO CassandraConnector: Disconnected from Cassandra cluster: AWCluster java.io.IOException: Table not found: EmailKeySpace.Emails at com.datastax.spark.connector.rdd.CassandraRDD.tableDef$lzycompute(CassandraRDD.scala:208) at com.datastax.spark.connector.rdd.CassandraRDD.tableDef(CassandraRDD.scala:205) at com.datastax.spark.connector.rdd.CassandraRDD.init(CassandraRDD.scala:212) at com.datastax.spark.connector.SparkContextFunctions.cassandraTable(SparkContextFunctions.scala:48) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:15) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC$$iwC$$iwC.init(console:22) at $iwC$$iwC$$iwC.init(console:24) at $iwC$$iwC.init(console:26) at $iwC.init(console:28) at init(console:30) at .init(console:34) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884
Cassandra connector
Hi, I am having difficulty getting the Cassandra connector running within the spark shell. My jars looks like: [wwilkins@phel-spark-001 logs]$ ls -altr /opt/connector/ total 14588 drwxr-xr-x. 5 root root4096 Sep 9 22:15 .. -rw-r--r-- 1 root root 242809 Sep 9 22:20 spark-cassandra-connector-master.zip -rw-r--r-- 1 root root 541332 Sep 9 22:20 cassandra-driver-core-2.0.3.jar -rw-r--r-- 1 root root 1855685 Sep 9 22:20 cassandra-thrift-2.0.9.jar -rw-r--r-- 1 root root 30085 Sep 9 22:20 commons-codec-1.2.jar -rw-r--r-- 1 root root 315805 Sep 9 22:20 commons-lang3-3.1.jar -rw-r--r-- 1 root root 60686 Sep 9 22:20 commons-logging-1.1.1.jar -rw-r--r-- 1 root root 2228009 Sep 9 22:20 guava-16.0.1.jar -rw-r--r-- 1 root root 433368 Sep 9 22:20 httpclient-4.2.5.jar -rw-r--r-- 1 root root 227275 Sep 9 22:20 httpcore-4.2.4.jar -rw-r--r-- 1 root root 1222059 Sep 9 22:20 ivy-2.3.0.jar.bak -rw-r--r-- 1 root root 38460 Sep 9 22:20 joda-convert-1.2.jar.bak -rw-r--r-- 1 root root 98818 Sep 9 22:20 joda-convert-1.6.jar -rw-r--r-- 1 root root 581571 Sep 9 22:20 joda-time-2.3.jar -rw-r--r-- 1 root root 217053 Sep 9 22:20 libthrift-0.9.1.jar -rw-r--r-- 1 root root 618 Sep 9 22:20 log4j.properties -rw-r--r-- 1 root root 165505 Sep 9 22:20 lz4-1.2.0.jar -rw-r--r-- 1 root root 85448 Sep 9 22:20 metrics-core-3.0.2.jar -rw-r--r-- 1 root root 1231993 Sep 9 22:20 netty-3.9.0.Final.jar -rw-r--r-- 1 root root 26083 Sep 9 22:20 slf4j-api-1.7.2.jar.bak -rw-r--r-- 1 root root 26084 Sep 9 22:20 slf4j-api-1.7.5.jar -rw-r--r-- 1 root root 1251514 Sep 9 22:20 snappy-java-1.0.5.jar -rw-r--r-- 1 root root 776782 Sep 9 22:20 spark-cassandra-connector_2.10-1.0.0-beta1.jar -rw-r--r-- 1 root root 997458 Sep 9 22:20 spark-cassandra-connector_2.10-1.0.0-SNAPSHOT.jar.bak3 -rwxr--r-- 1 root root 1113208 Sep 9 22:20 spark-cassandra-connector_2.10-1.1.0-SNAPSHOT.jar.bak -rw-r--r-- 1 root root 804 Sep 9 22:20 spark-cassandra-connector_2.10-1.1.0-SNAPSHOT.jar.bak2 I launch the shell with this command: /data/spark/bin/spark-shell --driver-class-path $(echo /opt/connector/*.jar |sed 's/ /:/g') I run these commands: sc.stop import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import com.datastax.spark.connector._ val conf = new SparkConf() conf.set(spark.cassandra.connection.host, 10.208.59.164) val sc = new SparkContext(conf) val table = sc.cassandraTable(retail, ordercf) And I get this problem: scala val table = sc.cassandraTable(retail, ordercf) java.lang.AbstractMethodError at org.apache.spark.Logging$class.log(Logging.scala:52) at com.datastax.spark.connector.cql.CassandraConnector$.log(CassandraConnector.scala:145) at org.apache.spark.Logging$class.logDebug(Logging.scala:63) at com.datastax.spark.connector.cql.CassandraConnector$.logDebug(CassandraConnector.scala:145) at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createCluster(CassandraConnector.scala:155) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$4.apply(CassandraConnector.scala:152) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$4.apply(CassandraConnector.scala:152) at com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:36) at com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:61) at com.datastax.spark.connector.cql.CassandraConnector.withClusterDo(CassandraConnector.scala:72) at com.datastax.spark.connector.cql.Schema.keyspaces$lzycompute(Schema.scala:111) at com.datastax.spark.connector.cql.Schema.keyspaces(Schema.scala:110) at com.datastax.spark.connector.cql.Schema.tables$lzycompute(Schema.scala:123) at com.datastax.spark.connector.cql.Schema.tables(Schema.scala:122) at com.datastax.spark.connector.rdd.CassandraRDD.tableDef$lzycompute(CassandraRDD.scala:196) at com.datastax.spark.connector.rdd.CassandraRDD.tableDef(CassandraRDD.scala:195) at com.datastax.spark.connector.rdd.CassandraRDD.init(CassandraRDD.scala:202) at com.datastax.spark.connector.package$SparkContextFunctions.cassandraTable(package.scala:94) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:22) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:27) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:29) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC$$iwC$$iwC.init(console:33) at $iwC$$iwC$$iwC.init(console:35) at $iwC$$iwC.init(console:37) at $iwC.init(console:39) at init(console:41) at .init(console:45) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console
Re: Cassandra connector
Are you using spark 1.1 ? If yes you would have to update the datastax cassandra connector code and remove ref to log methods from CassandraConnector.scala Regards, Gaurav -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-connector-tp13896p13897.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Cassandra connector
Thank you! I am running spark 1.0 but, your suggestion worked for me. I rem'ed out all //logDebug In both CassandraConnector.scala and Schema.scala I am moving again. Regards, Wade -Original Message- From: gtinside [mailto:gtins...@gmail.com] Sent: Wednesday, September 10, 2014 8:49 AM To: u...@spark.incubator.apache.org Subject: Re: Cassandra connector Are you using spark 1.1 ? If yes you would have to update the datastax cassandra connector code and remove ref to log methods from CassandraConnector.scala Regards, Gaurav -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-connector-tp13896p13897.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark-cassandra-connector 1.0.0-rc5: java.io.NotSerializableException
Hi, My version of Spark is 1.0.2. I am trying to use Spark-cassandra-connector to execute an update csql statement inside an CassandraConnector(conf).withSessionDo block : CassandraConnector(conf).withSessionDo { session = { myRdd.foreach { case (ip, values) = session.execute( {An csql update statement}) } } } The above code is in the main method of the driver object. When I run the above code (in local mode), I get an exception : java.io.NotSerializableException: com.datastax.spark.connector.cql.SessionProxy Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: com.datastax.spark.connector.cql.SessionProxy at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:772) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:906) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16$$anonfun$apply$1.apply(DAGScheduler.scala:903) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16$$anonfun$apply$1.apply(DAGScheduler.scala:903) Is there a way to use an RDD inside an CassandraConnector(conf).withSessionDo block ? Thanks in advance for any assistance ! Shing - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to use spark-cassandra-connector in spark-shell?
try to add following jars in classpath xxx/cassandra-all-2.0.6.jar:xxx/cassandra-thrift-2.0.6.jar:xxx/libthrift-0.9.1.jar:xxx/cassandra-driver-spark_2.10-1.0.0-SNAPSHOT.jar:xxx/cassandra-java-driver-2.0.2/cassandra-driver-core-2.0.2.jar:xxx/cassandra-java-driver-2.0.2/cassandra-driver-dse-2.0.2.jar then in spark-shell import org.apache.spark.{SparkConf, SparkContext} val conf = new SparkConf(true).set(cassandra.connection.host, your-cassandra-host) val sc = new SparkContext(local[1], cassandra-driver, conf) import com.datastax.driver.spark._ sc.cassandraTable(db1, table1).select(key).count -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-spark-cassandra-connector-in-spark-shell-tp11757p11781.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to use spark-cassandra-connector in spark-shell?
near bottom: http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html On Fri, Aug 8, 2014 at 2:00 AM, chutium teng@gmail.com wrote: try to add following jars in classpath xxx/cassandra-all-2.0.6.jar:xxx/cassandra-thrift-2.0.6.jar:xxx/libthrift-0.9.1.jar:xxx/cassandra-driver-spark_2.10-1.0.0-SNAPSHOT.jar:xxx/cassandra-java-driver-2.0.2/cassandra-driver-core-2.0.2.jar:xxx/cassandra-java-driver-2.0.2/cassandra-driver-dse-2.0.2.jar then in spark-shell import org.apache.spark.{SparkConf, SparkContext} val conf = new SparkConf(true).set(cassandra.connection.host, your-cassandra-host) val sc = new SparkContext(local[1], cassandra-driver, conf) import com.datastax.driver.spark._ sc.cassandraTable(db1, table1).select(key).count -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-spark-cassandra-connector-in-spark-shell-tp11757p11781.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Thomas Nieborowski 510-207-7049 mobile 510-339-1716 home
How to use spark-cassandra-connector in spark-shell?
Hello Is it possible to use spark-cassandra-connector in spark-shell? Thanks Gary
Re: How to use spark-cassandra-connector in spark-shell?
Yes, I've done it before. On Thu, Aug 7, 2014 at 10:18 PM, Gary Zhao garyz...@gmail.com wrote: Hello Is it possible to use spark-cassandra-connector in spark-shell? Thanks Gary
Re: How to use spark-cassandra-connector in spark-shell?
Thanks Andrew. How did you do it? On Thu, Aug 7, 2014 at 10:20 PM, Andrew Ash and...@andrewash.com wrote: Yes, I've done it before. On Thu, Aug 7, 2014 at 10:18 PM, Gary Zhao garyz...@gmail.com wrote: Hello Is it possible to use spark-cassandra-connector in spark-shell? Thanks Gary
Re: How to use spark-cassandra-connector in spark-shell?
I don't remember the details, but I think it just took adding the spark-cassandra-connector jar to the spark shell's classpath with --jars or maybe ADD_JARS and then it worked. On Thu, Aug 7, 2014 at 10:24 PM, Gary Zhao garyz...@gmail.com wrote: Thanks Andrew. How did you do it? On Thu, Aug 7, 2014 at 10:20 PM, Andrew Ash and...@andrewash.com wrote: Yes, I've done it before. On Thu, Aug 7, 2014 at 10:18 PM, Gary Zhao garyz...@gmail.com wrote: Hello Is it possible to use spark-cassandra-connector in spark-shell? Thanks Gary
spark-cassandra-connector issue
Hello I'm trying to modify Spark sample app to integrate with Cassandra, however I saw exception when submitting the app. Anyone knows why it happens? Exception in thread main java.lang.NoClassDefFoundError: com/datastax/spark/connector/rdd/reader/RowReaderFactory at SimpleApp.main(SimpleApp.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: com.datastax.spark.connector.rdd.reader.RowReaderFactory at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 8 more Source codes: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import com.datastax.spark.connector._ object SimpleApp { def main(args: Array[String]) { val conf = new SparkConf(true) .set(spark.cassandra.connection.host, 10.20.132.44) .setAppName(Simple Application) val logFile = /home/gzhao/spark/spark-1.0.2-bin-hadoop1/README.md // Should be some file on your system val sc = new SparkContext(spark://mcs-spark-slave1-staging:7077, idfa_map, conf) val rdd = sc.cassandraTable(idfa_map, bcookie_idfa) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line = line.contains(a)).count() val numBs = logData.filter(line = line.contains(b)).count() println(Lines with a: %s, Lines with b: %s.format(numAs, numBs)) } }