Re: spark architecture question -- Pleas Read

2017-01-28 Thread Sachin Naik
I strongly agree with Jorn and Russell. There are different solutions for data 
movement depending upon your needs frequency, bi-directional drivers. workflow, 
handling duplicate records. This is a space is known as " Change Data Capture - 
CDC" for short. If you need more information, I would be happy to chat with 
you.  I built some products in this space that extensively used connection 
pooling over ODBC/JDBC. 

Happy to chat if you need more information. 

-Sachin Naik

>>Hard to tell. Can you give more insights >>on what you try to achieve and 
>>what the data is about?
>>For example, depending on your use case sqoop can make sense or not.
Sent from my iPhone

> On Jan 27, 2017, at 11:22 PM, Russell Spitzer  
> wrote:
> 
> You can treat Oracle as a JDBC source 
> (http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases)
>  and skip Sqoop, HiveTables and go straight to Queries. Then you can skip 
> hive on the way back out (see the same link) and write directly to Oracle. 
> I'll leave the performance questions for someone else. 
> 
>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu  wrote:
>> 
>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu  wrote:
>> Hi Team,
>> 
>> RIght now our existing flow is
>> 
>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive 
>> Context)-->Destination Hive table -->sqoop export to Oracle
>> 
>> Half of the Hive UDFS required is developed in Java UDF..
>> 
>> SO Now I want to know if I run the native scala UDF's than runninng hive 
>> java udfs in spark-sql will there be any performance difference
>> 
>> 
>> Can we skip the Sqoop Import and export part and 
>> 
>> Instead directly load data from oracle to spark and code Scala UDF's for 
>> transformations and export output data back to oracle?
>> 
>> RIght now the architecture we are using is
>> 
>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL--> Hive 
>> --> Oracle 
>> what would be optimal architecture to process data from oracle using spark 
>> ?? can i anyway better this process ?
>> 
>> 
>> 
>> 
>> Regards,
>> Sirisha 
>> 


Re: Design patterns for Spark implementation

2016-12-08 Thread Sachin Naik
Not sure if you are aware of these

1) Edx/Berkely/Databricks has three Spark related certifications. Might be a 
good start. 

2) Fair understanding of scala/distributed collection patterns to better 
appreciate the internals of Spark. Coursera has three scala courses. I know 
there are other language bindings. The Edx course goes in great detail on 
those. 

3) Advanced Analytics on Spark book. 

--sachin

Sent from my iPhone

> On Dec 8, 2016, at 11:38 AM, Peter Figliozzi  wrote:
> 
> Keeping in mind Spark is a parallel computing engine, Spark does not change 
> your data infrastructure/data architecture.  These days it's relatively 
> convenient to read data from a variety of sources (S3, HDFS, Cassandra, ...) 
> and ditto on the output side.  
> 
> For example, for one of my use-cases, I store 10's of gigs of time-series 
> data in Cassandra.  It just so happens I like to analyze all of it at once 
> using Spark, which writes a very nice, small text file table of results I 
> look at using Python/Pandas, in a Jupyter notebook, on a laptop. 
> 
> If we didn't have Spark, I'd still be doing the input side (Cassandra) and 
> output side (small text file, ingestible by a laptop) the same way.  The only 
> difference would be, instead of importing and processing in Spark, my 
> fictional group of 5,000 assistants would each download a portion of the data 
> into their Excel spreadsheet, then have a big meeting to produce my small 
> text file.
> 
> So my view is the nature of your data and specific objectives determine your 
> infrastructure and architecture, not the presence or absence of Spark.
> 
> 
> 
> 
> 
>> On Sat, Dec 3, 2016 at 10:59 AM, Vasu Gourabathina  
>> wrote:
>> Hi,
>> 
>> I know this is a broad question. If this is not the right forum, appreciate 
>> if you can point to other sites/areas that may be helpful.
>> 
>> Before posing this question, I did use our friend Google, but sanitizing the 
>> query results from my need angle hasn't been easy.
>> 
>> Who I am: 
>>- Have done data processing and analytics, but relatively new to Spark 
>> world
>> 
>> What I am looking for:
>>   - Architecture/Design of a ML system using Spark
>>   - In particular, looking for best practices that can support/bridge both 
>> Engineering and Data Science teams
>> 
>> Engineering:
>>- Build a system that has typical engineering needs, data processing, 
>> scalability, reliability, availability, fault-tolerance etc.
>>- System monitoring etc.
>> Data Science:
>>- Build a system for Data Science team to do data exploration activities
>>- Develop models using supervised learning and tweak models
>> 
>> Data:
>>   - Batch and incremental updates - mostly structured or semi-structured 
>> (some data from transaction systems, weblogs, click stream etc.)
>>   - Steaming, in near term, but not to begin with
>> 
>> Data Storage:
>>   - Data is expected to grow on a daily basis...so, system should be able to 
>> support and handle big data
>>   - May be, after further analysis, there might be a possibility/need to 
>> archive some of the data...it all depends on how the ML models were built 
>> and results were stored/used for future usage
>> 
>> Data Analysis:
>>   - Obvious data related aspects, such as data cleansing, data 
>> transformation, data partitioning etc
>>   - May be run models on windows of data. For example: last 1-year, 2-years 
>> etc.
>> 
>> ML models:
>>   - Ability to store model versions and previous results
>>   - Compare results of different variants of models
>>  
>> Consumers:
>>   - RESTful webservice clients to look at the results
>> 
>> So, the questions I have are:
>> 1) Are there architectural and design patterns that I can use based on 
>> industry best-practices. In particular:
>>   - data ingestion
>>   - data storage (for eg. go with HDFS or not)
>>   - data partitioning, especially in Spark world
>>   - running parallel ML models and combining results etc.
>>   - consumption of final results by clients (for eg. by pushing results 
>> to Cassandra, NoSQL dbs etc.)
>> 
>> Again, I know this is a broad questionPointers to some best-practices in 
>> some of the areas, if not all, would be highly appreciated. Open to purchase 
>> any books that may have relevant information.
>> 
>> Thanks much folks,
>> Vasu.
>> 
> 


Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Sachin Naik
I agree with Sean - using virtual box on windows and using linux vm is a lot 
easier than trying to circumvent the cygwin oddities. a lot of functionality 
might not work in cygwin and you will end up trying to do back patches. Unless 
there is a compelling reason - cygwin support seems not required 


@sachinnaik from iphone


On Jul 28, 2015, at 1:25 PM, Sean Owen  wrote:

> That's for the Windows interpreter rather than bash-running Cygwin. I
> don't know it's worth doing a lot of legwork for Cygwin, but, if it's
> really just a few lines of classpath translation in one script, seems
> reasonable.
> 
> On Tue, Jul 28, 2015 at 9:13 PM, Steve Loughran  
> wrote:
>> 
>> there's a spark-submit.cmd file for windows. Does that work?
>> 
>> On 27 Jul 2015, at 21:19, Proust GZ Feng  wrote:
>> 
>> Hi, Spark Users
>> 
>> Looks like Spark 1.4.0 cannot work with Cygwin due to the removing of Cygwin
>> support in bin/spark-class
>> 
>> The changeset is
>> https://github.com/apache/spark/commit/517975d89d40a77c7186f488547eed11f79c1e97#diff-fdf4d3e600042c63ffa17b692c4372a3
>> 
>> The changeset said "Add a library for launching Spark jobs
>> programmatically", but how to use it in Cygwin?
>> I'm wondering any solutions available to make it work in Windows?
>> 
>> 
>> Thanks
>> Proust
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org