How to see Cassandra List / Set / Map values from Spark Hive Thrift JDBC?

2016-02-08 Thread Matthew Johnson
Hi all,



I have asked this question here on StackOverflow:



http://stackoverflow.com/questions/35222365/spark-sql-hivethriftserver2-get-liststring-from-cassandra-in-squirrelsql



But hoping I get more luck from this group. When I write a Java SparkSQL
application to query a CassandraSQLContext, and print out a List column, it
prints out the values eg:



WrappedArray(mj485, pd153, ws019)



But when I query this same column from SquirrelSQL connecting to the Spark
HiveThriftServer2 JDBC, I get:







Does anyone have any suggestions as to how I can get HiveThriftServer to
return the values, similar to how Java code does it?



Many thanks,

Matthew





PS I am running Spark 1.5.1, Cassandra 2.1.10


Re: Problem to persist Hibernate entity from Spark job

2015-09-06 Thread Matthew Johnson
I agree with Igor - I would either make sure session is ThreadLocal or,
more simply, why not create the session at the start of the saveInBatch
method and close it at the end? Creating a SessionFactory is an expensive
operation but creating a Session is a relatively cheap one.
On 6 Sep 2015 07:27, "Igor Berman"  wrote:

> how do you create your session? do you reuse it across threads? how do you
> create/close session manager?
> look for the problem in session creation, probably something deadlocked,
> as far as I remember hib.session should be created per thread
>
> On 6 September 2015 at 07:11, Zoran Jeremic 
> wrote:
>
>> Hi,
>>
>> I'm developing long running process that should find RSS feeds that all
>> users in the system have registered to follow, parse these RSS feeds,
>> extract new entries and store it back to the database as Hibernate
>> entities, so user can retrieve it. I want to use Apache Spark to enable
>> parallel processing, since this process might take several hours depending
>> on the number of users.
>>
>> The approach I thought should work was to use
>> *useridsRDD.foreachPartition*, so I can have separate hibernate session
>> for each partition. I created Database session manager that is initialized
>> for each partition which keeps hibernate session alive until the process is
>> over.
>>
>> Once all RSS feeds from one source are parsed and Feed entities are
>> created, I'm sending the whole list to Database Manager method that saves
>> the whole list in batch:
>>
>>> public   void saveInBatch(List entities) {
>>> try{
>>>   boolean isActive = session.getTransaction().isActive();
>>> if ( !isActive) {
>>> session.beginTransaction();
>>> }
>>>for(Object entity:entities){
>>>  session.save(entity);
>>> }
>>>session.getTransaction().commit();
>>>  }catch(Exception ex){
>>> if(session.getTransaction()!=null) {
>>> session.getTransaction().rollback();
>>> ex.printStackTrace();
>>>}
>>>   }
>>>
>>> However, this works only if I have one Spark partition. If there are two
>> or more partitions, the whole process is blocked once I try to save the
>> first entity. In order to make the things simpler, I tried to simplify Feed
>> entity, so it doesn't refer and is not referred from any other entity. It
>> also doesn't have any collection.
>>
>> I hope that some of you have already tried something similar and could
>> give me idea how to solve this problem
>>
>> Thanks,
>> Zoran
>>
>>
>


RE: Code review - Spark SQL command-line client for Cassandra

2015-06-23 Thread Matthew Johnson
Awesome, thanks Pawan – for now I’ll give spark-notebook a go until
Zeppelin catches up to Spark 1.4 (and when Zeppelin has a binary release –
my PC doesn’t seem too happy about building a Node.js app from source).
Thanks for the detailed instructions!!





*From:* pawan kumar [mailto:pkv...@gmail.com]
*Sent:* 22 June 2015 18:53
*To:* Matthew Johnson
*Cc:* Silvio Fiorito; Mohammed Guller; shahid ashraf; user
*Subject:* Re: Code review - Spark SQL command-line client for Cassandra



Hi Matthew,



you could add the dependencies yourself by using the %dep command in
zeppelin ( https://zeppelin.incubator.apache.org/docs/interpreter/spark.html).
I have not tried with zeppelin but have used spark-notebook
<https://github.com/andypetrella/spark-notebook> and got Cassandra
connector working. Below have provided samples.



*In Zeppelin: (Not Tested)*



*%*dep z.load("com.datastax.com:spark-cassandra-connector_2.11:1.4.0-M1")



Note: In order for Spark and Cassandra to work the Spark ,
Spark-Cassandra-Connector, Spark-notebook spark version should match. In
the above case it was 1.2.0



*If using spark-notebook: (Tested & works)*

Installed :

1.   Apache Spark 1.2.0

2.   Cassandra DSE - 1 node (just Cassandra and no analytics)

3.   Notebook:

wget
https://s3.eu-central-1.amazonaws.com/spark-notebook/tgz/spark-notebook-0.4.3-scala-2.10.4-spark-1.2.0-hadoop-2.4.0.tgz



Once notebook have been started :

http://ec2-xx-x-xx-xxx.us-west-x.compute.amazonaws.com:9000/#clusters



Select Standalone:

In SparkConf : update the spark master ip to EC2 : internal DNS name.



*In Spark Notebook:*

:dp "com.datastax.spark" % "spark-cassandra-connector_2.10" % "1.2.0-rc3"



import com.datastax.spark.connector._

import com.datastax.spark.connector.rdd.CassandraRDD



val cassandraHost:String = "localhost"

reset(lastChanges = _.set("spark.cassandra.connection.host", cassandraHost))

val rdd = sparkContext.cassandraTable("excelsior","test")

rdd.toArray.foreach(println)



Note: In order for Spark and Cassandra to work the Spark ,
Spark-Cassandra-Connector, Spark-notebook spark version should match. In
the above case it was 1.2.0











On Mon, Jun 22, 2015 at 9:52 AM, Matthew Johnson 
wrote:

Hi Pawan,



Looking at the changes for that git pull request, it looks like it just
pulls in the dependency (and transitives) for “spark-cassandra-connector”.
Since I am having to build Zeppelin myself anyway, would it be ok to just
add this myself for the connector for 1.4.0 (as found here
http://search.maven.org/#artifactdetails%7Ccom.datastax.spark%7Cspark-cassandra-connector_2.11%7C1.4.0-M1%7Cjar)?
What exactly is it that does not currently exist for Spark 1.4?



Thanks,

Matthew



*From:* pawan kumar [mailto:pkv...@gmail.com]
*Sent:* 22 June 2015 17:19
*To:* Silvio Fiorito
*Cc:* Mohammed Guller; Matthew Johnson; shahid ashraf; user@spark.apache.org
*Subject:* Re: Code review - Spark SQL command-line client for Cassandra



Hi,



Zeppelin has a cassandra-spark-connector built into the build. I have not
tried it yet may be you could let us know.



https://github.com/apache/incubator-zeppelin/pull/79



To build a Zeppelin version with the *Datastax Spark/Cassandra connector
<https://github.com/datastax/spark-cassandra-connector>*

mvn clean package *-Pcassandra-spark-1.x* -Dhadoop.version=xxx -Phadoop-x.x
-DskipTests

Right now the Spark/Cassandra connector is available for *Spark 1.1* and *Spark
1.2*. Support for *Spark 1.3* is not released yet (*but you can build you
own Spark/Cassandra connector version **1.3.0-SNAPSHOT*). Support for *Spark
1.4* does not exist yet

Please do not forget to add -Dspark.cassandra.connection.host=xxx to the
*ZEPPELIN_JAVA_OPTS*parameter in *conf/zeppelin-env.sh* file. Alternatively
you can add this parameter in the parameter list of the *Spark interpreter* on
the GUI



-Pawan











On Mon, Jun 22, 2015 at 9:04 AM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

Yes, just put the Cassandra connector on the Spark classpath and set the
connector config properties in the interpreter settings.



*From: *Mohammed Guller
*Date: *Monday, June 22, 2015 at 11:56 AM
*To: *Matthew Johnson, shahid ashraf


*Cc: *"user@spark.apache.org"
*Subject: *RE: Code review - Spark SQL command-line client for Cassandra



I haven’t tried using Zeppelin with Spark on Cassandra, so can’t say for
sure, but it should not be difficult.



Mohammed



*From:* Matthew Johnson [mailto:matt.john...@algomi.com
]
*Sent:* Monday, June 22, 2015 2:15 AM
*To:* Mohammed Guller; shahid ashraf
*Cc:* user@spark.apache.org
*Subject:* RE: Code review - Spark SQL command-line client for Cassandra



Thanks Mohammed, it’s good to know I’m not alone!



How easy is it to integrate Zeppelin with Spark on Cassandra? It looks like
it would only support Hadoop out of the box. Is it just a c

RE: Code review - Spark SQL command-line client for Cassandra

2015-06-22 Thread Matthew Johnson
Hi Pawan,



Looking at the changes for that git pull request, it looks like it just
pulls in the dependency (and transitives) for “spark-cassandra-connector”.
Since I am having to build Zeppelin myself anyway, would it be ok to just
add this myself for the connector for 1.4.0 (as found here
http://search.maven.org/#artifactdetails%7Ccom.datastax.spark%7Cspark-cassandra-connector_2.11%7C1.4.0-M1%7Cjar)?
What exactly is it that does not currently exist for Spark 1.4?



Thanks,

Matthew



*From:* pawan kumar [mailto:pkv...@gmail.com]
*Sent:* 22 June 2015 17:19
*To:* Silvio Fiorito
*Cc:* Mohammed Guller; Matthew Johnson; shahid ashraf; user@spark.apache.org
*Subject:* Re: Code review - Spark SQL command-line client for Cassandra



Hi,



Zeppelin has a cassandra-spark-connector built into the build. I have not
tried it yet may be you could let us know.



https://github.com/apache/incubator-zeppelin/pull/79



To build a Zeppelin version with the *Datastax Spark/Cassandra connector
<https://github.com/datastax/spark-cassandra-connector>*

mvn clean package *-Pcassandra-spark-1.x* -Dhadoop.version=xxx -Phadoop-x.x
-DskipTests

Right now the Spark/Cassandra connector is available for *Spark 1.1* and *Spark
1.2*. Support for *Spark 1.3* is not released yet (*but you can build you
own Spark/Cassandra connector version **1.3.0-SNAPSHOT*). Support for *Spark
1.4* does not exist yet

Please do not forget to add -Dspark.cassandra.connection.host=xxx to the
*ZEPPELIN_JAVA_OPTS*parameter in *conf/zeppelin-env.sh* file. Alternatively
you can add this parameter in the parameter list of the *Spark interpreter* on
the GUI



-Pawan











On Mon, Jun 22, 2015 at 9:04 AM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

Yes, just put the Cassandra connector on the Spark classpath and set the
connector config properties in the interpreter settings.



*From: *Mohammed Guller
*Date: *Monday, June 22, 2015 at 11:56 AM
*To: *Matthew Johnson, shahid ashraf


*Cc: *"user@spark.apache.org"
*Subject: *RE: Code review - Spark SQL command-line client for Cassandra



I haven’t tried using Zeppelin with Spark on Cassandra, so can’t say for
sure, but it should not be difficult.



Mohammed



*From:* Matthew Johnson [mailto:matt.john...@algomi.com
]
*Sent:* Monday, June 22, 2015 2:15 AM
*To:* Mohammed Guller; shahid ashraf
*Cc:* user@spark.apache.org
*Subject:* RE: Code review - Spark SQL command-line client for Cassandra



Thanks Mohammed, it’s good to know I’m not alone!



How easy is it to integrate Zeppelin with Spark on Cassandra? It looks like
it would only support Hadoop out of the box. Is it just a case of dropping
the Cassandra Connector onto the Spark classpath?



Cheers,

Matthew



*From:* Mohammed Guller [mailto:moham...@glassbeam.com]
*Sent:* 20 June 2015 17:27
*To:* shahid ashraf
*Cc:* Matthew Johnson; user@spark.apache.org
*Subject:* RE: Code review - Spark SQL command-line client for Cassandra



It is a simple Play-based web application. It exposes an URI for submitting
a SQL query. It then executes that query using CassandraSQLContext provided
by Spark Cassandra Connector. Since it is web-based, I added an
authentication and authorization layer to make sure that only users with
the right authorization can use it.



I am happy to open-source that code if there is interest. Just need to
carve out some time to clean it up and remove all the other services that
this web application provides.



Mohammed



*From:* shahid ashraf [mailto:sha...@trialx.com ]
*Sent:* Saturday, June 20, 2015 6:52 AM
*To:* Mohammed Guller
*Cc:* Matthew Johnson; user@spark.apache.org
*Subject:* RE: Code review - Spark SQL command-line client for Cassandra



Hi Mohammad
Can you provide more info about the Service u developed

On Jun 20, 2015 7:59 AM, "Mohammed Guller"  wrote:

Hi Matthew,

It looks fine to me. I have built a similar service that allows a user to
submit a query from a browser and returns the result in JSON format.



Another alternative is to leave a Spark shell or one of the notebooks
(Spark Notebook, Zeppelin, etc.) session open and run queries from there.
This model works only if people give you the queries to execute.



Mohammed



*From:* Matthew Johnson [mailto:matt.john...@algomi.com]
*Sent:* Friday, June 19, 2015 2:20 AM
*To:* user@spark.apache.org
*Subject:* Code review - Spark SQL command-line client for Cassandra



Hi all,



I have been struggling with Cassandra’s lack of adhoc query support (I know
this is an anti-pattern of Cassandra, but sometimes management come over
and ask me to run stuff and it’s impossible to explain that it will take me
a while when it would take about 10 seconds in MySQL) so I have put
together the following code snippet that bundles DataStax’s Cassandra Spark
connector and allows you to submit Spark SQL to it, outputting the results
in a text file.



Does anyone spot any obvious flaws in this plan?? (I have a lot more error
handling etc

RE: Help optimising Spark SQL query

2015-06-22 Thread Matthew Johnson
Hi James,



What version of Spark are you using? In Spark 1.2.2 I had an issue where
Spark would report a job as complete but I couldn’t find my results
anywhere – I just assumed it was me doing something wrong as I am still
quite new to Spark. However, since upgrading to 1.4.0 I have not seen this
issue, so might be worth upgrading if you are not already on 1.4.



Cheers,

Matthew





*From:* Lior Chaga [mailto:lio...@taboola.com]
*Sent:* 22 June 2015 17:24
*To:* James Aley
*Cc:* user
*Subject:* Re: Help optimising Spark SQL query



Hi James,



There are a few configurations that you can try:

https://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options



>From my experience, the codegen really boost things up. Just run
sqlContext.sql("spark.sql.codegen=true") before you execute your query. But
keep in mind that sometimes this is buggy (depending on your query), so
compare to results without codegen to be sure.

Also you can try changing the default partitions.



You can also use dataframes (since 1.3). Not sure they are better than
specifying the query in 1.3, but with spark 1.4 there should be an enormous
performance improvement in dataframes.



Lior



On Mon, Jun 22, 2015 at 6:28 PM, James Aley  wrote:

Hello,



A colleague of mine ran the following Spark SQL query:



select

  count(*) as uses,

  count (distinct cast(id as string)) as users

from usage_events

where

  from_unixtime(cast(timestamp_millis/1000 as bigint))

between '2015-06-09' and '2015-06-16'



The table contains billions of rows, but totals only 64GB of data across
~30 separate files, which are stored as Parquet with LZO compression in S3.



>From the referenced columns:



* id is Binary, which we cast to a String so that we can DISTINCT by it. (I
was already told this will improve in a later release, in a separate
thread.)

* timestamp_millis is a long, containing a unix timestamp with millisecond
resolution



This took nearly 2 hours to run on a 5 node cluster of r3.xlarge EC2
instances, using 20 executors, each with 4GB memory. I can see from
monitoring tools that the CPU usage is at 100% on all nodes, but incoming
network seems a bit low at 2.5MB/s, suggesting to me that this is CPU-bound.



Does that seem slow? Can anyone offer any ideas by glancing at the query as
to why this might be slow? We'll profile it meanwhile and post back if we
find anything ourselves.



A side issue - I've found that this query, and others, sometimes completes
but doesn't return any results. There appears to be no error that I can see
in the logs, and Spark reports the job as successful, but the connected
JDBC client (SQLWorkbenchJ in this case), just sits there forever waiting.
I did a quick Google and couldn't find anyone else having similar issues.





Many thanks,



James.


RE: Code review - Spark SQL command-line client for Cassandra

2015-06-22 Thread Matthew Johnson
Thanks Mohammed, it’s good to know I’m not alone!



How easy is it to integrate Zeppelin with Spark on Cassandra? It looks like
it would only support Hadoop out of the box. Is it just a case of dropping
the Cassandra Connector onto the Spark classpath?



Cheers,

Matthew



*From:* Mohammed Guller [mailto:moham...@glassbeam.com]
*Sent:* 20 June 2015 17:27
*To:* shahid ashraf
*Cc:* Matthew Johnson; user@spark.apache.org
*Subject:* RE: Code review - Spark SQL command-line client for Cassandra



It is a simple Play-based web application. It exposes an URI for submitting
a SQL query. It then executes that query using CassandraSQLContext provided
by Spark Cassandra Connector. Since it is web-based, I added an
authentication and authorization layer to make sure that only users with
the right authorization can use it.



I am happy to open-source that code if there is interest. Just need to
carve out some time to clean it up and remove all the other services that
this web application provides.



Mohammed



*From:* shahid ashraf [mailto:sha...@trialx.com ]
*Sent:* Saturday, June 20, 2015 6:52 AM
*To:* Mohammed Guller
*Cc:* Matthew Johnson; user@spark.apache.org
*Subject:* RE: Code review - Spark SQL command-line client for Cassandra



Hi Mohammad
Can you provide more info about the Service u developed

On Jun 20, 2015 7:59 AM, "Mohammed Guller"  wrote:

Hi Matthew,

It looks fine to me. I have built a similar service that allows a user to
submit a query from a browser and returns the result in JSON format.



Another alternative is to leave a Spark shell or one of the notebooks
(Spark Notebook, Zeppelin, etc.) session open and run queries from there.
This model works only if people give you the queries to execute.



Mohammed



*From:* Matthew Johnson [mailto:matt.john...@algomi.com]
*Sent:* Friday, June 19, 2015 2:20 AM
*To:* user@spark.apache.org
*Subject:* Code review - Spark SQL command-line client for Cassandra



Hi all,



I have been struggling with Cassandra’s lack of adhoc query support (I know
this is an anti-pattern of Cassandra, but sometimes management come over
and ask me to run stuff and it’s impossible to explain that it will take me
a while when it would take about 10 seconds in MySQL) so I have put
together the following code snippet that bundles DataStax’s Cassandra Spark
connector and allows you to submit Spark SQL to it, outputting the results
in a text file.



Does anyone spot any obvious flaws in this plan?? (I have a lot more error
handling etc in my code, but removed it here for brevity)



*private* *void* run(String sqlQuery) {

SparkContext scc = *new* SparkContext(conf);

CassandraSQLContext csql = *new* CassandraSQLContext(scc);

DataFrame sql = csql.sql(sqlQuery);

String folderName = "/tmp/output_" + System.*currentTimeMillis*();

*LOG*.info("Attempting to save SQL results in folder: " +
folderName);

sql.rdd().saveAsTextFile(folderName);

*LOG*.info("SQL results saved");

}



*public* *static* *void* main(String[] args) {



String sparkMasterUrl = args[0];

String sparkHost = args[1];

String sqlQuery = args[2];



SparkConf conf = *new* SparkConf();

conf.setAppName("Java Spark SQL");

conf.setMaster(sparkMasterUrl);

conf.set("spark.cassandra.connection.host", sparkHost);



JavaSparkSQL app = *new* JavaSparkSQL(conf);



app.run(sqlQuery, printToConsole);

}



I can then submit this to Spark with ‘spark-submit’:



Ø  *./spark-submit --class com.algomi.spark.JavaSparkSQL --master
spark://sales3:7077
spark-on-cassandra-0.0.1-SNAPSHOT-jar-with-dependencies.jar
spark://sales3:7077 sales3 "select * from mykeyspace.operationlog" *



It seems to work pretty well, so I’m pretty happy, but wondering why this
isn’t common practice (at least I haven’t been able to find much about it
on Google) – is there something terrible that I’m missing?



Thanks!

Matthew


Code review - Spark SQL command-line client for Cassandra

2015-06-19 Thread Matthew Johnson
Hi all,



I have been struggling with Cassandra’s lack of adhoc query support (I know
this is an anti-pattern of Cassandra, but sometimes management come over
and ask me to run stuff and it’s impossible to explain that it will take me
a while when it would take about 10 seconds in MySQL) so I have put
together the following code snippet that bundles DataStax’s Cassandra Spark
connector and allows you to submit Spark SQL to it, outputting the results
in a text file.



Does anyone spot any obvious flaws in this plan?? (I have a lot more error
handling etc in my code, but removed it here for brevity)



*private* *void* run(String sqlQuery) {

SparkContext scc = *new* SparkContext(conf);

CassandraSQLContext csql = *new* CassandraSQLContext(scc);

DataFrame sql = csql.sql(sqlQuery);

String folderName = "/tmp/output_" + System.*currentTimeMillis*();

*LOG*.info("Attempting to save SQL results in folder: " +
folderName);

sql.rdd().saveAsTextFile(folderName);

*LOG*.info("SQL results saved");

}



*public* *static* *void* main(String[] args) {



String sparkMasterUrl = args[0];

String sparkHost = args[1];

String sqlQuery = args[2];



SparkConf conf = *new* SparkConf();

conf.setAppName("Java Spark SQL");

conf.setMaster(sparkMasterUrl);

conf.set("spark.cassandra.connection.host", sparkHost);



JavaSparkSQL app = *new* JavaSparkSQL(conf);



app.run(sqlQuery, printToConsole);

}



I can then submit this to Spark with ‘spark-submit’:



Ø  *./spark-submit --class com.algomi.spark.JavaSparkSQL --master
spark://sales3:7077
spark-on-cassandra-0.0.1-SNAPSHOT-jar-with-dependencies.jar
spark://sales3:7077 sales3 "select * from mykeyspace.operationlog" *



It seems to work pretty well, so I’m pretty happy, but wondering why this
isn’t common practice (at least I haven’t been able to find much about it
on Google) – is there something terrible that I’m missing?



Thanks!

Matthew


Spark on Cassandra

2015-04-29 Thread Matthew Johnson
Hi all,



I am new to Spark, but excited to use it with our Cassandra cluster. I have
read in a few places that Spark can interact directly with Cassandra now,
so I decided to download it and have a play – I am happy to run it in
standalone cluster mode initially. When I go to download it (
http://spark.apache.org/downloads.html) I see a bunch of pre-built versions
for Hadoop and MapR, but no mention of Cassandra – if I am running it in
standalone cluster mode, does it matter which pre-built package I download?
Would all of them work? Or do I have to build it myself from source with
some special config for Cassandra?



Thanks!

Matt