Multiple Thrift servers on one Spark cluster

2015-08-06 Thread Bojan Kostic
Hi,

Is there a way to instantiate multiple Thrift servers on one Spark Cluster?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Thrift-servers-on-one-Spark-cluster-tp24148.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Add row IDs column to data frame

2015-04-09 Thread Bojan Kostic
Hi,

I just checked and i can see that there is  method called withColumn:
def  withColumn(colName: String, col: Column
http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Column.html
): DataFrame
http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrame.html

Returns a new DataFrame
http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrame.html
by adding a column.

I can't test it now... But i think it should work.

As i see it whole idea for data frames is to make them like data frames in
R. And in R you can do that easily.

It was late last night and i was tired but my idea was that you can iterate
over first set add some index to every log using acumulators and then
iterate over other set and add index from other acumulator then create
tuple with keys from indexes and join. It is ugly and not efficient, and
you should avoid it. :]

Best

Bojan

On Thu, Apr 9, 2015 at 1:35 AM, barmaley [via Apache Spark User List] 
ml-node+s1001560n22430...@n3.nabble.com wrote:

 Hi Bojan,

 Could you please expand your idea on how to append to RDD? I can think of
 how to append a constant value to each row on RDD:

 //oldRDD - RDD[Array[String]]
 val c = const
 val newRDD = oldRDD.map(r=c+:r)

 But how to append a custom column to RDD? Something like:

 val colToAppend = sc.makeRDD(1 to oldRDD.count().toInt)
 //or sc.parallelize(1 to oldRDD.count().toInt)
 //or (1 to 1 to oldRDD.count().toInt).toArray


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Add-row-IDs-column-to-data-frame-tp22385p22430.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=Ymxvb2Q5cmF2ZW5AZ21haWwuY29tfDF8NTk3ODE0NzQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Append-column-to-Data-Frame-or-RDD-tp22385p22432.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: SQL can't not create Hive database

2015-04-09 Thread Bojan Kostic
I think it uses local dir, hdfs dir path starts with hdfs://

Check permissions on folders, and also check logs. There should be more info
about exception.

Best
Bojan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-can-t-not-create-Hive-database-tp22435p22439.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Caching and Actions

2015-04-09 Thread Bojan Kostic
You can use toDebugString to see all the steps in job.

Best
Bojan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Caching-and-Actions-tp22418p22433.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Add row IDs column to data frame

2015-04-08 Thread Bojan Kostic
You could convert DF to RDD, then in map phase or in join add new column,
and then again convert to DF. I know this is not elegant solution and maybe
it is not a solution at all. :) But this is the first thing that popped in
my mind.
I am new also to DF api.
Best
Bojan
On Apr 9, 2015 00:37, olegshirokikh [via Apache Spark User List] 
ml-node+s1001560n22427...@n3.nabble.com wrote:

 More generic version of a question below:

 Is it possible to append a column to existing DataFrame at all? I
 understand that this is not an easy task in Spark environment, but is there
 any workaround?

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Add-row-IDs-column-to-data-frame-tp22385p22427.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=Ymxvb2Q5cmF2ZW5AZ21haWwuY29tfDF8NTk3ODE0NzQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Add-row-IDs-column-to-data-frame-tp22385p22428.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark 1.3 build with hive support fails

2015-03-30 Thread Bojan Kostic
Try building with scala 2.10.

Best
Bojan
On Mar 31, 2015 01:51, nightwolf [via Apache Spark User List] 
ml-node+s1001560n22309...@n3.nabble.com wrote:

 I am having the same problems. Did you find a fix?

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-build-with-hive-support-fails-tp22215p22309.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=Ymxvb2Q5cmF2ZW5AZ21haWwuY29tfDF8NTk3ODE0NzQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-build-with-hive-support-fails-tp22215p22312.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Nested Case Classes (Found and Required Same)

2015-03-04 Thread Bojan Kostic
Did you find any other way for this issue?
I just found out that i have 22 columns data set... And now i am searching
for best solution.

Anyone else have experienced with this problem?

Best
Bojan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Nested-Case-Classes-Found-and-Required-Same-tp14096p21908.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark + Tableau

2014-11-11 Thread Bojan Kostic
I finally solved issue with Spark Tableau connection.
Thanks Denny Lee for blog post:
https://www.concur.com/blog/en-us/connect-tableau-to-sparksql
Solution was to use Authentication type Username. And then use username for
metastore.

Best regards
Bojan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720p18591.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SQL COUNT DISTINCT

2014-11-05 Thread Bojan Kostic
Here is the link on jira:  https://issues.apache.org/jira/browse/SPARK-4243
https://issues.apache.org/jira/browse/SPARK-4243  




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818p18166.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SQL COUNT DISTINCT

2014-11-03 Thread Bojan Kostic
Hi Michael,
Thanks for response. I did test with query that you send me. And it works
really faster:
Old queries stats by phases:
3.2min
17s
Your query stats by phases:
0.3 s
16 s
20 s

But will this improvement also affect when you want to count distinct on 2
or more fields:
SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4)
FROM parquetFile

Should i still create Jira issue/improvement for this?

@Nick
That also make sense. But should i just get count of my data to driver node?

I just started to learn about Spark(and it is great) so sorry if i ask
stupid questions or anything like that.

Best regards
Bojan




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818p17939.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SQL COUNT DISTINCT

2014-10-31 Thread Bojan Kostic
While i testing Spark SQL i noticed that COUNT DISTINCT works really slow.
Map partitions phase finished fast, but collect phase is slow.
It's only runs on single executor.
Should this run this way?

And here is the simple code which i use for testing:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val parquetFile = sqlContext.parquetFile(/bojan/test/2014-10-20/)
parquetFile.registerTempTable(parquetFile)
val count = sqlContext.sql(SELECT COUNT(DISTINCT f2) FROM parquetFile)
count.map(t = t(0)).collect().foreach(println)

I guess because of the distinct process must be on single node. But i wonder
can i add some parallelism to the collect process.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark + Tableau

2014-10-30 Thread Bojan Kostic
I'm testing beta driver from Databricks for Tableua.
And unfortunately i encounter some issues.
While beeline connection works without problems, Tableau can't connect to
spark thrift server.

Error from driver(Tableau):
Unable to connect to the ODBC Data Source. Check that the necessary drivers
are installed and that the connection properties are valid.
[Simba][SparkODBC] (34) Error from Spark: ETIMEDOUT.

Unable to connect to the server test.server.com. Check that the server is
running and that you have access privileges to the requested database.
Unable to connect to the server. Check that the server is running and that
you have access privileges to the requested database.

Exception on Thrift server:
java.lang.RuntimeException: org.apache.thrift.transport.TTransportException
at
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:189)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.thrift.transport.TTransportException
at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at
org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at
org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:182)
at
org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125)
at
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253)
at
org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
at
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
... 4 more

Is there anyone else who's testing this driver, or did anyone saw this
message?

Best regards
Bojan Kostić



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark + Tableau

2014-10-30 Thread Bojan Kostic
I use beta driver SQL ODBC from Databricks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720p17727.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark + Tableau

2014-10-30 Thread Bojan Kostic
I'm  connecting to it remotly with tableau/beeline.

On Thu Oct 30 16:51:13 2014 GMT+0100, Denny Lee [via Apache Spark User List] 
wrote:
 
 
 When you are starting the thrift server service - are you connecting to it
 locally or is this on a remote server when you use beeline and/or Tableau?
 
 On Thu, Oct 30, 2014 at 8:00 AM, Bojan Kostic blood9ra...@gmail.com wrote:
 
  I use beta driver SQL ODBC from Databricks.
 
 
 
  --
  View this message in context:
  http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720p17727.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 
 
 
 
 ___
 If you reply to this email, your message will be added to the discussion 
 below:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720p17734.html
 To start a new topic under Apache Spark User List, email 
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Spark + Tableau, visit 
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=17720code=Ymxvb2Q5cmF2ZW5AZ21haWwuY29tfDE3NzIwfDU5NzgxNDc0Ng=

-- 
Sent from my Jolla



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720p17737.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.