Re: Create table before inserting in SQL

2023-02-02 Thread Harut Martirosyan
Thank you very much.

I understand the performance implications and that Spark will download it 
before modifying. 
The JDBC database is just extremely small, it’s the BI/aggregated layer.

What’s interesting is that here it says I can use JDBC

https://spark.apache.org/docs/3.3.1/sql-ref-syntax-dml-insert-overwrite-directory.html

But when I try to I get an error that the underlying datastore should be file 
based, I guess a documentation mistake.

Thank you one more time.

> On 2 Feb 2023, at 23:11, Mich Talebzadeh  wrote:
> 
> Please bear in mind that insert/update delete operations are DML, whereas 
> CREATE/DROP TABLE are DDL operations that are best performed in the native 
> database which I presume is a transactional.
> 
> Can you CREATE TABLE before (any insert of data) using the native JDBC 
> database syntax?
> 
> Alternatively you may be able to do so in Python or SCALA but I don't know 
> the way in pure SQL.
> 
> if your JDBC database is Hive you can do so easily
> 
> HTH
> 
> 
> 
>view my Linkedin profile 
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Thu, 2 Feb 2023 at 17:26, Harut Martirosyan  <mailto:harut.martiros...@gmail.com>> wrote:
>> Generally, the problem is that I don’t find a way to automatically create a 
>> JDBC table in the JDBC database when I want to insert data into it using 
>> Spark SQL only, not DataFrames API.
>> 
>>> On 2 Feb 2023, at 21:22, Harut Martirosyan >> <mailto:harut.martiros...@gmail.com>> wrote:
>>> 
>>> Hi, thanks for the reply.
>>> 
>>> Let’s imagine we have a parquet based table called parquet_table, now I 
>>> want to insert it into a new JDBC table, all using pure SQL.
>>> 
>>> If the JDBC table already exists, it’s easy, we do CREATE TABLE USING JDBC 
>>> and then we do INSERT INTO that table.
>>> 
>>> If the table doesn’t exist, is there a way to create it using Spark SQL 
>>> only? I don’t want to use DataFrames API, I know that I can use .write() 
>>> for that, but I want to keep it in pure SQL, since that is more 
>>> comprehendible for data analysts.
>>> 
>>>> On 2 Feb 2023, at 02:08, Mich Talebzadeh >>> <mailto:mich.talebza...@gmail.com>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> It is not very clear your statement below:
>>>> 
>>>> ".. If the table existed, I would create a table using JDBC in spark SQL 
>>>> and then insert into it, but I can't create a table if it doesn't exist in 
>>>> JDBC database..."
>>>> 
>>>> If the table exists in your JDBC database, why do you need to create it?
>>>> 
>>>> How do you verify if it exists? Can you share the code and the doc link?
>>>> 
>>>> HTH
>>>> 
>>>> 
>>>> 
>>>>view my Linkedin profile 
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>> 
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>> 
>>>>  
>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>> loss, damage or destruction of data or any other property which may arise 
>>>> from relying on this email's technical content is explicitly disclaimed. 
>>>> The author will in no case be liable for any monetary damages arising from 
>>>> such loss, damage or destruction.
>>>>  
>>>> 
>>>> 
>>>> On Wed, 1 Feb 2023 at 19:33, Harut Martirosyan 
>>>> mailto:harut.martiros...@gmail.com>> wrote:
>>>>> I have a resultset (defined in SQL), and I want to insert it into my JDBC 
>>>>> database using only SQL, not dataframes API.
>>>>> 
>>>>> If the table existed, I would create a table using JDBC in spark SQL and 
>>>>> then insert into it, but I can't create a table if it doesn't exist in 
>>>>> JDBC database.
>>>>> 
>>>>> How to do that using pure SQL (no python/scala/java)?
>>>>> 
>>>>> I am trying to use INSERT OVERWRITE DIRECTORY with JDBC file format 
>>>>> (according to the documentation) but as expected this functionality is 
>>>>> available only for File-based storage systems.
>>>>> 
>>>>> -- 
>>>>> RGRDZ Harut
>>> 
>> 



Re: Create table before inserting in SQL

2023-02-02 Thread Harut Martirosyan
Generally, the problem is that I don’t find a way to automatically create a 
JDBC table in the JDBC database when I want to insert data into it using Spark 
SQL only, not DataFrames API.

> On 2 Feb 2023, at 21:22, Harut Martirosyan  
> wrote:
> 
> Hi, thanks for the reply.
> 
> Let’s imagine we have a parquet based table called parquet_table, now I want 
> to insert it into a new JDBC table, all using pure SQL.
> 
> If the JDBC table already exists, it’s easy, we do CREATE TABLE USING JDBC 
> and then we do INSERT INTO that table.
> 
> If the table doesn’t exist, is there a way to create it using Spark SQL only? 
> I don’t want to use DataFrames API, I know that I can use .write() for that, 
> but I want to keep it in pure SQL, since that is more comprehendible for data 
> analysts.
> 
>> On 2 Feb 2023, at 02:08, Mich Talebzadeh  wrote:
>> 
>> Hi,
>> 
>> It is not very clear your statement below:
>> 
>> ".. If the table existed, I would create a table using JDBC in spark SQL and 
>> then insert into it, but I can't create a table if it doesn't exist in JDBC 
>> database..."
>> 
>> If the table exists in your JDBC database, why do you need to create it?
>> 
>> How do you verify if it exists? Can you share the code and the doc link?
>> 
>> HTH
>> 
>> 
>> 
>>view my Linkedin profile 
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>> 
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>> 
>>  
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> 
>> On Wed, 1 Feb 2023 at 19:33, Harut Martirosyan > <mailto:harut.martiros...@gmail.com>> wrote:
>>> I have a resultset (defined in SQL), and I want to insert it into my JDBC 
>>> database using only SQL, not dataframes API.
>>> 
>>> If the table existed, I would create a table using JDBC in spark SQL and 
>>> then insert into it, but I can't create a table if it doesn't exist in JDBC 
>>> database.
>>> 
>>> How to do that using pure SQL (no python/scala/java)?
>>> 
>>> I am trying to use INSERT OVERWRITE DIRECTORY with JDBC file format 
>>> (according to the documentation) but as expected this functionality is 
>>> available only for File-based storage systems.
>>> 
>>> -- 
>>> RGRDZ Harut
> 



Re: Create table before inserting in SQL

2023-02-02 Thread Harut Martirosyan
Hi, thanks for the reply.

Let’s imagine we have a parquet based table called parquet_table, now I want to 
insert it into a new JDBC table, all using pure SQL.

If the JDBC table already exists, it’s easy, we do CREATE TABLE USING JDBC and 
then we do INSERT INTO that table.

If the table doesn’t exist, is there a way to create it using Spark SQL only? I 
don’t want to use DataFrames API, I know that I can use .write() for that, but 
I want to keep it in pure SQL, since that is more comprehendible for data 
analysts.

> On 2 Feb 2023, at 02:08, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> It is not very clear your statement below:
> 
> ".. If the table existed, I would create a table using JDBC in spark SQL and 
> then insert into it, but I can't create a table if it doesn't exist in JDBC 
> database..."
> 
> If the table exists in your JDBC database, why do you need to create it?
> 
> How do you verify if it exists? Can you share the code and the doc link?
> 
> HTH
> 
> 
> 
>view my Linkedin profile 
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Wed, 1 Feb 2023 at 19:33, Harut Martirosyan  <mailto:harut.martiros...@gmail.com>> wrote:
>> I have a resultset (defined in SQL), and I want to insert it into my JDBC 
>> database using only SQL, not dataframes API.
>> 
>> If the table existed, I would create a table using JDBC in spark SQL and 
>> then insert into it, but I can't create a table if it doesn't exist in JDBC 
>> database.
>> 
>> How to do that using pure SQL (no python/scala/java)?
>> 
>> I am trying to use INSERT OVERWRITE DIRECTORY with JDBC file format 
>> (according to the documentation) but as expected this functionality is 
>> available only for File-based storage systems.
>> 
>> -- 
>> RGRDZ Harut



Create table before inserting in SQL

2023-02-01 Thread Harut Martirosyan
I have a resultset (defined in SQL), and I want to insert it into my JDBC
database using only SQL, not dataframes API.

If the table existed, I would create a table using JDBC in spark SQL and
then insert into it, but I can't create a table if it doesn't exist in JDBC
database.

How to do that using pure SQL (no python/scala/java)?

I am trying to use INSERT OVERWRITE DIRECTORY with JDBC file format
(according to the documentation) but as expected this functionality is
available only for File-based storage systems.

-- 
RGRDZ Harut


Custom metrics in py-spark 3

2022-04-14 Thread Harut Martirosyan
Hello.

We’re successfully exporting technical metrics to prometheus using built-in 
capabilities of Spark 3, but we need to add custom business metrics as well 
using python. Seems like there’s no documentation for that.

Thanks.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Simple but faster data streaming

2015-04-02 Thread Harut Martirosyan
Hi guys.

Is there a more lightweight way of stream processing with Spark? What we
want is a simpler way, preferably with no scheduling, which just streams
the data to destinations multiple.

We extensively use Spark Core, SQL, Streaming, GraphX, so it's our main
tool and don't want to add new things to the stack like Storm or Flume, but
from other side, it really takes much more resources on same streaming than
our previous setup with Flume, especially if we have multiple destinations
(triggers multiple actions/scheduling)


-- 
RGRDZ Harut


Re: RDD Persistance synchronization

2015-03-29 Thread Harut Martirosyan
Thanks to you again, Sean.

The thing is that, we persist and count that RDD in hope that all later
actions with it won't trigger previous recalculations, it's not really
about performance here, it's because recalculations contain UUID generation
which should be the same for further actions.

I understand that RDD concept is based on linage, and it kind of
contradicts our goal but, is there ay way to guarantee that it's persisted,
or make it fail when persisting fails?

On 29 March 2015 at 12:51, Sean Owen so...@cloudera.com wrote:

 persist() completes immediately since it only marks the RDD for
 persistence. count() triggers computation of rdd, and as rdd is
 computed it will be persisted. The following transform should
 therefore only start after count() and therefore after the persistence
 completes. I think there might be corner cases where you still see
 some of rdd computed, like, if a persisted block is lost or otherwise
 unavailable later.

 On Sun, Mar 29, 2015 at 9:07 AM, Harut Martirosyan
 harut.martiros...@gmail.com wrote:
  Hi.
 
  rdd.persist()
  rdd.count()
 
  rdd.transform()...
 
  is there a chance transform() runs before persist() is complete?
 
  --
  RGRDZ Harut




-- 
RGRDZ Harut


RDD Persistance synchronization

2015-03-29 Thread Harut Martirosyan
Hi.

rdd.persist()
rdd.count()

rdd.transform()...

is there a chance transform() runs before persist() is complete?

-- 
RGRDZ Harut


Re: Parallel actions from driver

2015-03-27 Thread Harut Martirosyan
This is exactly my case also, it worked, thanks Sean.

On 26 March 2015 at 23:35, Sean Owen so...@cloudera.com wrote:

 You can do this much more simply, I think, with Scala's parallel
 collections (try .par). There's nothing wrong with doing this, no.

 Here, something is getting caught in your closure, maybe
 unintentionally, that's not serializable. It's not directly related to
 the parallelism.

 On Thu, Mar 26, 2015 at 3:54 PM, Aram Mkrtchyan
 aram.mkrtchyan...@gmail.com wrote:
  Hi.
 
  I'm trying to trigger DataFrame's save method in parallel from my driver.
  For that purposes I use ExecutorService and Futures, here's my code:
 
 
  val futures = [1,2,3].map( t = pool.submit( new Runnable {
 
  override def run(): Unit = {
  val commons = events.filter(_._1 == t).map(_._2.common)
  saveAsParquetFile(sqlContext, commons, s$t/common)
  EventTypes.all.foreach { et =
  val eventData = events.filter(ev = ev._1 == t 
 ev._2.eventType ==
  et).map(_._2.data)
  saveAsParquetFile(sqlContext, eventData, s$t/$et)
  }
  }
 
  }))
  futures.foreach(_.get)
 
  It throws Task is not Serializable exception. Is it legal to use
 threads
  in driver to trigger actions?

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
RGRDZ Harut


Standalone Scheduler VS YARN Performance

2015-03-24 Thread Harut Martirosyan
What is performance overhead caused by YARN, or what configurations are
being changed when the app is ran through YARN?

The following example:

sqlContext.sql(SELECT dayStamp(date),
count(distinct deviceId) AS c
FROM full
GROUP BY dayStamp(date)
ORDER BY c
DESC LIMIT 10)
.collect()

runs on shell when we use standalone scheduler:
./spark-shell --master sparkmaster:7077 --executor-memory 20g
--executor-cores 10  --driver-memory 10g --num-executors 8

and fails due to losing an executor, when we run it through YARN.
./spark-shell --master yarn-client --executor-memory 20g --executor-cores
10  --driver-memory 10g --num-executors 8

There are no evident logs, just messages that executors are being lost, and
connection refused errors, (apparently due to executor failures)
The cluster is the same, 8 nodes, 64Gb RAM each.
Format is parquet.

-- 
RGRDZ Harut


Spark SQL: Day of month from Timestamp

2015-03-24 Thread Harut Martirosyan
Hi guys.

Basically, we had to define a UDF that does that, is there a built in
function that we can use for it?

-- 
RGRDZ Harut


Re: Visualizing Spark Streaming data

2015-03-20 Thread Harut Martirosyan
Hey Jeffrey.
Thanks for reply.

I already have something similar, I use Grafana and Graphite, and for
simple metric streaming we've got all set-up right.

My question is about interactive patterns. For instance, dynamically choose
an event to monitor, dynamically choose group-by field or any sort of
filter, then view results. This is easy when you have 1 user, but if you
have team of analysts all specifying their own criteria, it becomes hard to
manage them all.

On 20 March 2015 at 12:02, Jeffrey Jedele jeffrey.jed...@gmail.com wrote:

 Hey Harut,
 I don't think there'll by any general practices as this part heavily
 depends on your environment, skills and what you want to achieve.

 If you don't have a general direction yet, I'd suggest you to have a look
 at Elasticsearch+Kibana. It's very easy to set up, powerful and therefore
 gets a lot of traction currently.

 Regards,
 Jeff

 2015-03-20 8:43 GMT+01:00 Harut harut.martiros...@gmail.com:

 I'm trying to build a dashboard to visualize stream of events coming from
 mobile devices.
 For example, I have event called add_photo, from which I want to calculate
 trending tags for added photos for last x minutes. Then I'd like to
 aggregate that by country, etc. I've built the streaming part, which reads
 from Kafka, and calculates needed results and get appropriate RDDs, the
 question is now how to connect it to UI.

 Is there any general practices on how to pass parameters to spark from
 some
 custom built UI, how to organize data retrieval, what intermediate
 storages
 to use, etc.

 Thanks in advance.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Visualizing-Spark-Streaming-data-tp22160.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
RGRDZ Harut


Re: Visualizing Spark Streaming data

2015-03-20 Thread Harut Martirosyan
But it requires all possible combinations of your filters as separate
metrics, moreover, it only can show time based information, you cannot
group by say country.

On 20 March 2015 at 19:09, Irfan Ahmad ir...@cloudphysics.com wrote:

 Grafana allows pretty slick interactive use patterns, especially with
 graphite as the back-end. In a multi-user environment, why not have each
 user just build their own independent dashboards and name them under some
 simple naming convention?


 *Irfan Ahmad*
 CTO | Co-Founder | *CloudPhysics* http://www.cloudphysics.com
 Best of VMworld Finalist
 Best Cloud Management Award
 NetworkWorld 10 Startups to Watch
 EMA Most Notable Vendor

 On Fri, Mar 20, 2015 at 1:06 AM, Harut Martirosyan 
 harut.martiros...@gmail.com wrote:

 Hey Jeffrey.
 Thanks for reply.

 I already have something similar, I use Grafana and Graphite, and for
 simple metric streaming we've got all set-up right.

 My question is about interactive patterns. For instance, dynamically
 choose an event to monitor, dynamically choose group-by field or any sort
 of filter, then view results. This is easy when you have 1 user, but if you
 have team of analysts all specifying their own criteria, it becomes hard to
 manage them all.

 On 20 March 2015 at 12:02, Jeffrey Jedele jeffrey.jed...@gmail.com
 wrote:

 Hey Harut,
 I don't think there'll by any general practices as this part heavily
 depends on your environment, skills and what you want to achieve.

 If you don't have a general direction yet, I'd suggest you to have a
 look at Elasticsearch+Kibana. It's very easy to set up, powerful and
 therefore gets a lot of traction currently.

 Regards,
 Jeff

 2015-03-20 8:43 GMT+01:00 Harut harut.martiros...@gmail.com:

 I'm trying to build a dashboard to visualize stream of events coming
 from
 mobile devices.
 For example, I have event called add_photo, from which I want to
 calculate
 trending tags for added photos for last x minutes. Then I'd like to
 aggregate that by country, etc. I've built the streaming part, which
 reads
 from Kafka, and calculates needed results and get appropriate RDDs, the
 question is now how to connect it to UI.

 Is there any general practices on how to pass parameters to spark from
 some
 custom built UI, how to organize data retrieval, what intermediate
 storages
 to use, etc.

 Thanks in advance.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Visualizing-Spark-Streaming-data-tp22160.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





 --
 RGRDZ Harut





-- 
RGRDZ Harut