Re: quickly counting the number of rows in a partition?

2015-01-14 Thread Michael Segel
Sorry, but the accumulator is still going to require you to walk through the 
RDD to get an accurate count, right? 
Its not being persisted? 

On Jan 14, 2015, at 5:17 AM, Ganelin, Ilya ilya.gane...@capitalone.com wrote:

 Alternative to doing a naive toArray is to declare an accumulator per 
 partition and use that. It's specifically what they were designed to do. See 
 the programming guide.
 
 
 
 Sent with Good (www.good.com)
 
 
 -Original Message-
 From: Tobias Pfeiffer [t...@preferred.jp]
 Sent: Tuesday, January 13, 2015 08:06 PM Eastern Standard Time
 To: Kevin Burton
 Cc: Ganelin, Ilya; user@spark.apache.org
 Subject: Re: quickly counting the number of rows in a partition?
 
 Hi,
 
 On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya ilya.gane...@capitalone.com 
 wrote:
 Use the mapPartitions function. It returns an iterator to each partition. 
 Then just get that length by converting to an array.
  
 On Tue, Jan 13, 2015 at 2:50 PM, Kevin Burton bur...@spinn3r.com wrote:
 Doesn’t that just read in all the values?  The count isn’t pre-computed? It’s 
 not the end of the world if it’s not but would be faster.
 
 Well, converting to an array may not work due to memory constraints, 
 counting the items in the iterator may be better. However, there is no 
 pre-computed value. For counting, you need to compute all values in the 
 RDD, in general. If you think of
 
 items.map(x = /* throw exception */).count()
 
 then even though the count you want to get does not necessarily require the 
 evaluation of the function in map() (i.e., the number is the same), you may 
 not want to get the count if that code actually fails.
 
 Tobias
 
 The information contained in this e-mail is confidential and/or proprietary 
 to Capital One and/or its affiliates. The information transmitted herewith is 
 intended only for use by the individual or entity to which it is addressed.  
 If the reader of this message is not the intended recipient, you are hereby 
 notified that any review, retransmission, dissemination, distribution, 
 copying or other use of, or taking of any action in reliance upon this 
 information is strictly prohibited. If you have received this communication 
 in error, please contact the sender and delete the material from your 
 computer.



Re: HW imbalance

2015-01-29 Thread Michael Segel
@Sandy, 

There are two issues. 
The spark context (executor) and then the cluster under YARN. 

If you have a box where each yarn job needs 3GB,  and your machine has 36GB 
dedicated as a YARN resource, you can run 12 executors on the single node. 
If you have a box that has 72GB dedicated to YARN, you can run up to 24 
contexts (executors) in parallel. 

Assuming that you’re not running any other jobs. 

The larger issue is if your version of Hadoop will easily let you run with 
multiple profiles or not. Ambari (1.6 and early does not.) Its supposed to be 
fixed in 1.7 but I haven’t evaluated it yet. 
Cloudera? YMMV

If I understood the question raised by the OP, its more about a heterogeneous 
cluster than spark.

-Mike

On Jan 26, 2015, at 5:02 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Hi Antony,
 
 Unfortunately, all executors for any single Spark application must have the 
 same amount of memory.  It's possibly to configure YARN with different 
 amounts of memory for each host (using yarn.nodemanager.resource.memory-mb), 
 so other apps might be able to take advantage of the extra memory.
 
 -Sandy
 
 On Mon, Jan 26, 2015 at 8:34 AM, Michael Segel msegel_had...@hotmail.com 
 wrote:
 If you’re running YARN, then you should be able to mix and max where YARN is 
 managing the resources available on the node. 
 
 Having said that… it depends on which version of Hadoop/YARN. 
 
 If you’re running Hortonworks and Ambari, then setting up multiple profiles 
 may not be straight forward. (I haven’t seen the latest version of Ambari) 
 
 So in theory, one profile would be for your smaller 36GB of ram, then one 
 profile for your 128GB sized machines. 
 Then as your request resources for your spark job, it should schedule the 
 jobs based on the cluster’s available resources. 
 (At least in theory.  I haven’t tried this so YMMV) 
 
 HTH
 
 -Mike
 
 On Jan 26, 2015, at 4:25 PM, Antony Mayi antonym...@yahoo.com.INVALID wrote:
 
 should have said I am running as yarn-client. all I can see is specifying 
 the generic executor memory that is then to be used in all containers.
 
 
 On Monday, 26 January 2015, 16:48, Charles Feduke charles.fed...@gmail.com 
 wrote:
 
 
 You should look at using Mesos. This should abstract away the individual 
 hosts into a pool of resources and make the different physical 
 specifications manageable.
 
 I haven't tried configuring Spark Standalone mode to have different specs on 
 different machines but based on spark-env.sh.template:
 
 # - SPARK_WORKER_CORES, to set the number of cores to use on this machine
 # - SPARK_WORKER_MEMORY, to set how much total memory workers have to give 
 executors (e.g. 1000m, 2g)
 # - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. 
 -Dx=y)
 it looks like you should be able to mix. (Its not clear to me whether 
 SPARK_WORKER_MEMORY is uniform across the cluster or for the machine where 
 the config file resides.)
 
 On Mon Jan 26 2015 at 8:07:51 AM Antony Mayi antonym...@yahoo.com.invalid 
 wrote:
 Hi,
 
 is it possible to mix hosts with (significantly) different specs within a 
 cluster (without wasting the extra resources)? for example having 10 nodes 
 with 36GB RAM/10CPUs now trying to add 3 hosts with 128GB/10CPUs - is there 
 a way to utilize the extra memory by spark executors (as my understanding is 
 all spark executors must have same memory).
 
 thanks,
 Antony.
 
 
 
 



Re: HW imbalance

2015-01-30 Thread Michael Segel
Sorry, but I think there’s a disconnect. 

When you launch a job under YARN on any of the hadoop clusters, the number of 
mappers/reducers is not set and is dependent on the amount of available 
resources. 
So under Ambari, CM, or MapR’s Admin, you should be able to specify the amount 
of resources available on any node which is to be allocated to YARN’s RM. 
So if your node has 32GB allocated, you can run N jobs concurrently based on 
the amount of resources you request when you submit your application. 

If you have 64GB allocated, you can run up to 2N jobs concurrently based on the 
same memory constraints. 

In terms of job scheduling, where and when a job can run is going to be based 
on available resources.  So if you want to run a job that needs 16GB of 
resources, and all of your nodes are busy and only have 4GB per node available 
to YARN, your 16GB job will wait until there is at least that much resources 
available.   

To your point, if you say you need 4GB per task, then it must be the same per 
task for that job. The larger the cluster node, in this case memory, the more 
jobs you can run. 

This is of course assuming you could over subscribe a node in terms of cpu 
cores if you have memory available. 

YMMV

HTH
-Mike

On Jan 30, 2015, at 7:10 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 My answer was based off the specs that Antony mentioned: different amounts of 
 memory, but 10 cores on all the boxes.  In that case, a single Spark 
 application's homogeneously sized executors won't be able to take advantage 
 of the extra memory on the bigger boxes.
 
 Cloudera Manager can certainly configure YARN with different resource 
 profiles for different nodes if that's what you're wondering.
 
 -Sandy
 
 On Thu, Jan 29, 2015 at 11:03 PM, Michael Segel msegel_had...@hotmail.com 
 wrote:
 @Sandy, 
 
 There are two issues. 
 The spark context (executor) and then the cluster under YARN. 
 
 If you have a box where each yarn job needs 3GB,  and your machine has 36GB 
 dedicated as a YARN resource, you can run 12 executors on the single node. 
 If you have a box that has 72GB dedicated to YARN, you can run up to 24 
 contexts (executors) in parallel. 
 
 Assuming that you’re not running any other jobs. 
 
 The larger issue is if your version of Hadoop will easily let you run with 
 multiple profiles or not. Ambari (1.6 and early does not.) Its supposed to be 
 fixed in 1.7 but I haven’t evaluated it yet. 
 Cloudera? YMMV
 
 If I understood the question raised by the OP, its more about a heterogeneous 
 cluster than spark.
 
 -Mike
 
 On Jan 26, 2015, at 5:02 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
 
 Hi Antony,
 
 Unfortunately, all executors for any single Spark application must have the 
 same amount of memory.  It's possibly to configure YARN with different 
 amounts of memory for each host (using yarn.nodemanager.resource.memory-mb), 
 so other apps might be able to take advantage of the extra memory.
 
 -Sandy
 
 On Mon, Jan 26, 2015 at 8:34 AM, Michael Segel msegel_had...@hotmail.com 
 wrote:
 If you’re running YARN, then you should be able to mix and max where YARN is 
 managing the resources available on the node. 
 
 Having said that… it depends on which version of Hadoop/YARN. 
 
 If you’re running Hortonworks and Ambari, then setting up multiple profiles 
 may not be straight forward. (I haven’t seen the latest version of Ambari) 
 
 So in theory, one profile would be for your smaller 36GB of ram, then one 
 profile for your 128GB sized machines. 
 Then as your request resources for your spark job, it should schedule the 
 jobs based on the cluster’s available resources. 
 (At least in theory.  I haven’t tried this so YMMV) 
 
 HTH
 
 -Mike
 
 On Jan 26, 2015, at 4:25 PM, Antony Mayi antonym...@yahoo.com.INVALID 
 wrote:
 
 should have said I am running as yarn-client. all I can see is specifying 
 the generic executor memory that is then to be used in all containers.
 
 
 On Monday, 26 January 2015, 16:48, Charles Feduke 
 charles.fed...@gmail.com wrote:
 
 
 You should look at using Mesos. This should abstract away the individual 
 hosts into a pool of resources and make the different physical 
 specifications manageable.
 
 I haven't tried configuring Spark Standalone mode to have different specs 
 on different machines but based on spark-env.sh.template:
 
 # - SPARK_WORKER_CORES, to set the number of cores to use on this machine
 # - SPARK_WORKER_MEMORY, to set how much total memory workers have to give 
 executors (e.g. 1000m, 2g)
 # - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. 
 -Dx=y)
 it looks like you should be able to mix. (Its not clear to me whether 
 SPARK_WORKER_MEMORY is uniform across the cluster or for the machine where 
 the config file resides.)
 
 On Mon Jan 26 2015 at 8:07:51 AM Antony Mayi antonym...@yahoo.com.invalid 
 wrote:
 Hi,
 
 is it possible to mix hosts with (significantly) different specs within a 
 cluster (without

Re: Spark or Storm

2015-06-17 Thread Michael Segel
Actually the reverse.

Spark Streaming is really a micro batch system where the smallest window is 1/2 
a second (500ms). 
So for CEP, its not really a good idea. 

So in terms of options…. spark streaming, storm, samza, akka and others… 

Storm is probably the easiest to pick up,  spark streaming / akka may give you 
more flexibility and akka would work for CEP. 

Just my $0.02

 On Jun 16, 2015, at 9:40 PM, Spark Enthusiast sparkenthusi...@yahoo.in 
 wrote:
 
 I have a use-case where a stream of Incoming events have to be aggregated and 
 joined to create Complex events. The aggregation will have to happen at an 
 interval of 1 minute (or less).
 
 The pipeline is :
   send events 
  enrich event
 Upstream services --- KAFKA - event Stream 
 Processor  Complex Event Processor  Elastic Search.
 
 From what I understand, Storm will make a very good ESP and Spark Streaming 
 will make a good CEP.
 
 But, we are also evaluating Storm with Trident.
 
 How does Spark Streaming compare with Storm with Trident?
 
 Sridhar Chellappa
 
 
 
  
 
 
 
 On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com wrote:
 
 
 I have a similar scenario where we need to bring data from kinesis to hbase. 
 Data volecity is 20k per 10 mins. Little manipulation of data will be 
 required but that's regardless of the tool so we will be writing that piece 
 in Java pojo.
 All env is on aws. Hbase is on a long running EMR and kinesis on a separate 
 cluster.
 TIA.
 Best
 Ayan
 On 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com 
 mailto:wrbri...@gmail.com wrote:
 The programming models for the two frameworks are conceptually rather 
 different; I haven't worked with Storm for quite some time, but based on my 
 old experience with it, I would equate Spark Streaming more with Storm's 
 Trident API, rather than with the raw Bolt API. Even then, there are 
 significant differences, but it's a bit closer.
 
 If you can share your use case, we might be able to provide better guidance.
 
 Regards,
 Will
 
 On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com 
 mailto:asoni.le...@gmail.com wrote:
 
 Hi All,
 
 I am evaluating spark VS storm ( spark streaming  ) and i am not able to see 
 what is equivalent of Bolt in storm inside spark.
 
 Any help will be appreciated on this ?
 
 Thanks ,
 Ashish
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org
 
 
 



Re: TCP/IP speedup

2015-08-02 Thread Michael Segel
This may seem like a silly question… but in following Mark’s link, the 
presentation talks about the TPC-DS benchmark. 

Here’s my question… what benchmark results? 

If you go over to the TPC.org http://tpc.org/ website they have no TPC-DS 
benchmarks listed. 
(Either audited or unaudited) 

So what gives? 

Note: There are TPCx-HS benchmarks listed… 

Thx

-Mike

 On Aug 1, 2015, at 5:45 PM, Mark Hamstra m...@clearstorydata.com wrote:
 
 https://spark-summit.org/2015/events/making-sense-of-spark-performance/ 
 https://spark-summit.org/2015/events/making-sense-of-spark-performance/
 
 On Sat, Aug 1, 2015 at 3:24 PM, Simon Edelhaus edel...@gmail.com 
 mailto:edel...@gmail.com wrote:
 Hi All!
 
 How important would be a significant performance improvement to TCP/IP 
 itself, in terms of 
 overall job performance improvement. Which part would be most significantly 
 accelerated? 
 Would it be HDFS?
 
 -- ttfn
 Simon Edelhaus
 California 2015
 




Re: Research ideas using spark

2015-07-16 Thread Michael Segel
Ok… 

After having some off-line exchanges with Shashidhar Rao came up with an idea…

Apply machine learning to either implement or improve autoscaling up or down 
within a Storm/Akka cluster. 

While I don’t know what constitutes an acceptable PhD thesis, or senior project 
for undergrads… this is a real life problem that actually has some real value. 

First, storm doesn’t scale down.  Unless there’s been some improvements in the 
last year, you really can’t easily scale down the number of workers and 
transfer state to another worker. 
Looking at Akka, that would be an easier task because of the actor model. 
However, I don’t know Akka that well, so I can’t say if this is already 
implemented. 

So besides the mechanism to scale (up and down), you then have the issue of 
machine learning in terms of load and how to properly scale. 
This could be as simple as a PID function that watches the queues between 
spout/bolts and bolt/bolt, or something more advanced. This is where the 
research part of the project comes in. (What do you monitor, and how do you 
calculate and determine when to scale up or down, weighing in the cost(s) of 
the action of scaling.) 

Again its a worthwhile project, something that actually has business value, 
especially in terms of Lambda and other groovy greek lettered names for cluster 
designs (Zeta? ;-) ) 
Where you have both M/R (computational) and subjective real time (including 
micro batch) occurring either on the same cluster or within the same DC 
infrastructure. 


Again I don’t know if this is worthy of a PhD thesis, Masters Thesis, or Senior 
Project, but it is something that one could sink one’s teeth into and 
potentially lead to a commercial grade project if done properly. 

Good luck with it.

HTH 

-Mike




 On Jul 15, 2015, at 12:40 PM, vaquar khan vaquar.k...@gmail.com wrote:
 
 I would suggest study spark ,flink,strom and based on your understanding and 
 finding prepare your research paper.
 
 May be you will invented new spark ☺
 
 Regards, 
 Vaquar khan
 
 On 16 Jul 2015 00:47, Michael Segel msegel_had...@hotmail.com 
 mailto:msegel_had...@hotmail.com wrote:
 Silly question… 
 
 When thinking about a PhD thesis… do you want to tie it to a specific 
 technology or do you want to investigate an idea but then use a specific 
 technology. 
 Or is this an outdated way of thinking? 
 
 I am doing my PHD thesis on large scale machine learning e.g  Online 
 learning, batch and mini batch learning.”
 
 So before we look at technologies like Spark… could the OP break down a more 
 specific concept or idea that he wants to pursue? 
 
 Looking at what Jorn said… 
 
 Using machine learning to better predict workloads in terms of managing 
 clusters… This could be interesting… but is it enough for a PhD thesis, or of 
 interest to the OP? 
 
 
 On Jul 15, 2015, at 9:43 AM, Jörn Franke jornfra...@gmail.com 
 mailto:jornfra...@gmail.com wrote:
 
 Well one of the strength of spark is standardized general distributed 
 processing allowing many different types of processing, such as graph 
 processing, stream processing etc. The limitation is that it is less 
 performant than one system focusing only on one type of processing (eg graph 
 processing). I miss - and this may not be spark specific - some artificial 
 intelligence to manage a cluster, e.g. Predicting workloads, how long a job 
 may run based on previously executed similar jobs etc. Furthermore, many 
 optimizations you have do to manually, e.g. Bloom filters, partitioning etc 
 - if you find here as well some intelligence that does this automatically 
 based on previously executed jobs taking into account that optimizations 
 themselves change over time would be great... You may also explore feature 
 interaction
 
 Le mar. 14 juil. 2015 à 7:19, Shashidhar Rao raoshashidhar...@gmail.com 
 mailto:raoshashidhar...@gmail.com a écrit :
 Hi,
 
 I am doing my PHD thesis on large scale machine learning e.g  Online 
 learning, batch and mini batch learning.
 
 Could somebody help me with ideas especially in the context of Spark and to 
 the above learning methods. 
 
 Some ideas like improvement to existing algorithms, implementing new 
 features especially the above learning methods and algorithms that have not 
 been implemented etc.
 
 If somebody could help me with some ideas it would really accelerate my work.
 
 Plus few ideas on research papers regarding Spark or Mahout.
 
 Thanks in advance.
 
 Regards 
 
 



Re: spark streaming job to hbase write

2015-07-16 Thread Michael Segel
You ask an interesting question… 

Lets set aside spark, and look at the overall ingestion pattern. 

Its really an ingestion pattern where your input in to the system is from a 
queue. 

Are the events discrete or continuous? (This is kinda important.) 

If the events are continuous then more than likely you’re going to be ingesting 
data where the key is somewhat sequential. If you use put(), you end up with 
hot spotting. And you’ll end up with regions half full. 
So you would be better off batching up the data and doing bulk imports. 

If the events are discrete, then you’ll want to use put() because the odds are 
you will not be using a sequential key. (You could, but I’d suggest that you 
rethink your primary key) 

Depending on the rate of ingestion, you may want to do a manual flush. (It 
depends on the velocity of data to be ingested and your use case )
(Remember what caching occurs and where when dealing with HBase.) 

A third option… Depending on how you use the data, you may want to avoid 
storing the data in HBase, and only use HBase as an index to where you store 
the data files for quick access.  Again it depends on your data ingestion flow 
and how you intend to use the data. 

So really this is less a spark issue than an HBase issue when it comes to 
design. 

HTH

-Mike
 On Jul 15, 2015, at 11:46 AM, Shushant Arora shushantaror...@gmail.com 
 wrote:
 
 Hi
 
 I have a requirement of writing in hbase table from Spark streaming app after 
 some processing.
 Is Hbase put operation the only way of writing to hbase or is there any 
 specialised connector or rdd of spark for hbase write.
 
 Should Bulk load to hbase from streaming  app be avoided if output of each 
 batch interval is just few mbs?
 
 Thanks
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Research ideas using spark

2015-07-15 Thread Michael Segel
Silly question… 

When thinking about a PhD thesis… do you want to tie it to a specific 
technology or do you want to investigate an idea but then use a specific 
technology. 
Or is this an outdated way of thinking? 

I am doing my PHD thesis on large scale machine learning e.g  Online learning, 
batch and mini batch learning.”

So before we look at technologies like Spark… could the OP break down a more 
specific concept or idea that he wants to pursue? 

Looking at what Jorn said… 

Using machine learning to better predict workloads in terms of managing 
clusters… This could be interesting… but is it enough for a PhD thesis, or of 
interest to the OP? 


 On Jul 15, 2015, at 9:43 AM, Jörn Franke jornfra...@gmail.com wrote:
 
 Well one of the strength of spark is standardized general distributed 
 processing allowing many different types of processing, such as graph 
 processing, stream processing etc. The limitation is that it is less 
 performant than one system focusing only on one type of processing (eg graph 
 processing). I miss - and this may not be spark specific - some artificial 
 intelligence to manage a cluster, e.g. Predicting workloads, how long a job 
 may run based on previously executed similar jobs etc. Furthermore, many 
 optimizations you have do to manually, e.g. Bloom filters, partitioning etc - 
 if you find here as well some intelligence that does this automatically based 
 on previously executed jobs taking into account that optimizations themselves 
 change over time would be great... You may also explore feature interaction
 
 Le mar. 14 juil. 2015 à 7:19, Shashidhar Rao raoshashidhar...@gmail.com 
 mailto:raoshashidhar...@gmail.com a écrit :
 Hi,
 
 I am doing my PHD thesis on large scale machine learning e.g  Online 
 learning, batch and mini batch learning.
 
 Could somebody help me with ideas especially in the context of Spark and to 
 the above learning methods. 
 
 Some ideas like improvement to existing algorithms, implementing new features 
 especially the above learning methods and algorithms that have not been 
 implemented etc.
 
 If somebody could help me with some ideas it would really accelerate my work.
 
 Plus few ideas on research papers regarding Spark or Mahout.
 
 Thanks in advance.
 
 Regards 




Re: Silly question about building Spark 1.4.1

2015-07-20 Thread Michael Segel
Thanks Dean… 

I was building based on the information found on the Spark 1.4.1 documentation. 

So I have to ask the following:

Shouldn’t the examples be updated to reflect Hadoop 2.6 or are the vendors’ 
distro not up to 2.6 and that’s why its still showing 2.4? 

Also I’m trying to build with support for Scala 2.11  
Are there any known issues between Scala 2.11 and Hive and hive-thrift server? 

Dean, the reason I asked about needed to specify the Hive and Hive-Thriftserver 
options is that at the end of the build I see the following:
“
[INFO] Spark Project SQL .. SUCCESS [02:06 min]
[INFO] Spark Project ML Library ... SUCCESS [02:23 min]
[INFO] Spark Project Tools  SUCCESS [ 13.305 s]
[INFO] Spark Project Hive . SUCCESS [01:55 min]
[INFO] Spark Project REPL . SUCCESS [ 40.488 s]
[INFO] Spark Project YARN . SUCCESS [ 38.793 s]
[INFO] Spark Project Assembly . SUCCESS [01:10 min]
[INFO] Spark Project External Twitter . SUCCESS [ 14.907 s]
[INFO] Spark Project External Flume Sink .. SUCCESS [ 21.748 s]
[INFO] Spark Project External Flume ... SUCCESS [ 31.754 s]
[INFO] Spark Project External MQTT  SUCCESS [ 17.921 s]
[INFO] Spark Project External ZeroMQ .. SUCCESS [ 18.037 s]
[INFO] Spark Project External Kafka ... SUCCESS [ 41.941 s]
[INFO] Spark Project Examples . SUCCESS [01:56 min]
[INFO] Spark Project External Kafka Assembly .. SUCCESS [ 24.806 s]
[INFO] Spark Project YARN Shuffle Service . SUCCESS [  5.204 s]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 22:40 min
[INFO] Finished at: 2015-07-20T12:54:23-07:00
[INFO] Final Memory: 109M/2332M
[INFO]
“
Granted this may be something completely different which is why the next time I 
do a build, I’m going to capture the stderr/stdout to a file. 

Thx for the quick response. 



 On Jul 20, 2015, at 1:11 PM, Ted Yu yuzhih...@gmail.com wrote:
 
 In master (as well as 1.4.1) I don't see hive profile in pom.xml
 
 I do find hive-provided profile, though.
 
 FYI
 
 On Mon, Jul 20, 2015 at 1:05 PM, Dean Wampler deanwamp...@gmail.com 
 mailto:deanwamp...@gmail.com wrote:
 hadoop-2.6 is supported (look for profile XML in the pom.xml file).
 
 For Hive, add -Phive -Phive-thriftserver  (See 
 http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables 
 http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables) 
 for more details.
 
 dean
 
 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition 
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com/
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com http://polyglotprogramming.com/
 
 On Mon, Jul 20, 2015 at 2:55 PM, Michael Segel msegel_had...@hotmail.com 
 mailto:msegel_had...@hotmail.com wrote:
 Sorry, 
 
 Should have sent this to user… 
 
 However… it looks like the docs page may need some editing? 
 
 Thx
 
 -Mike
 
 
 Begin forwarded message:
 
 From: Michael Segel msegel_had...@hotmail.com 
 mailto:msegel_had...@hotmail.com
 Subject: Silly question about building Spark 1.4.1
 Date: July 20, 2015 at 12:26:40 PM MST
 To: d...@spark.apache.org mailto:d...@spark.apache.org
 
 Hi, 
 
 I’m looking at the online docs for building spark 1.4.1 … 
 
 http://spark.apache.org/docs/latest/building-spark.html 
 http://spark.apache.org/docs/latest/building-spark.html 
 
 I was interested in building spark for Scala 2.11 (latest scala) and also 
 for Hive and JDBC support. 
 
 The docs say:
 “
 To produce a Spark package compiled with Scala 2.11, use the -Dscala-2.11 
 property:
 dev/change-version-to-2.11.sh http://change-version-to-2.11.sh/
 mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package
 “ 
 So… 
 Is there a reason I shouldn’t build against hadoop-2.6 ? 
 
 If I want to add the Thirft and Hive support, is it possible? 
 Looking at the Scala build, it looks like hive support is being built? 
 (Looking at the stdout messages…)
 Should the docs be updated? Am I missing something? 
 (Dean W. can confirm, I am completely brain dead. ;-) 
 
 Thx
 
 -Mike
 PS. Yes I can probably download a prebuilt image, but I’m a glutton for 
 punishment. ;-) 
 
 
 
 



Fwd: Silly question about building Spark 1.4.1

2015-07-20 Thread Michael Segel
Sorry, 

Should have sent this to user… 

However… it looks like the docs page may need some editing? 

Thx

-Mike


 Begin forwarded message:
 
 From: Michael Segel msegel_had...@hotmail.com
 Subject: Silly question about building Spark 1.4.1
 Date: July 20, 2015 at 12:26:40 PM MST
 To: d...@spark.apache.org
 
 Hi, 
 
 I’m looking at the online docs for building spark 1.4.1 … 
 
 http://spark.apache.org/docs/latest/building-spark.html 
 http://spark.apache.org/docs/latest/building-spark.html 
 
 I was interested in building spark for Scala 2.11 (latest scala) and also for 
 Hive and JDBC support. 
 
 The docs say:
 “
 To produce a Spark package compiled with Scala 2.11, use the -Dscala-2.11 
 property:
 dev/change-version-to-2.11.sh
 mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package
 “ 
 So… 
 Is there a reason I shouldn’t build against hadoop-2.6 ? 
 
 If I want to add the Thirft and Hive support, is it possible? 
 Looking at the Scala build, it looks like hive support is being built? 
 (Looking at the stdout messages…)
 Should the docs be updated? Am I missing something? 
 (Dean W. can confirm, I am completely brain dead. ;-) 
 
 Thx
 
 -Mike
 PS. Yes I can probably download a prebuilt image, but I’m a glutton for 
 punishment. ;-) 
 



Re: Spark Job Server with Yarn and Kerberos

2016-01-04 Thread Michael Segel
Its been a while... but this isn’t a spark issue. 

A spark job on YARN runs as a regular job. 
What happens when you run a regular M/R job by that user? 

I don’t think we did anything special...



> On Jan 4, 2016, at 12:22 PM, Mike Wright  > wrote:
> 
> Has anyone used Spark Job Server on a "kerberized" cluster in YARN-Client 
> mode? When Job Server contacts the YARN resource manager, we see a "Cannot 
> impersonate root" error and am not sure what we have misconfigured.
> 
> Thanks.
> 
> ___
> 
> Mike Wright
> Principal Architect, Software Engineering
> S Capital IQ and SNL
> 
> 434-951-7816 p
> 434-244-4466 f
> 540-470-0119 m
> 
> mwri...@snl.com 
> 
> 



Re: stopping a process usgin an RDD

2016-01-04 Thread Michael Segel
Not really a good idea. 

It breaks the paradigm. 

If I understand the OP’s idea… they want to halt processing the RDD, but not 
the entire job. 
So when it hits a certain condition, it will stop that task yet continue on to 
the next RDD. (Assuming you have more RDDs or partitions than you have task 
’slots’)  So if you fail enough RDDs, your job fails meaning you don’t get any 
results. 

The best you could do is a NOOP.  That is… if your condition is met on that 
RDD, your M/R job will not output anything to the collection so no more data is 
being added to the result set. 

The whole paradigm is to process the entire RDD at the time. 

You may spin cycles, but that’s not a really bad thing. 

HTH

-Mike

> On Jan 4, 2016, at 6:45 AM, Daniel Darabos  
> wrote:
> 
> You can cause a failure by throwing an exception in the code running on the 
> executors. The task will be retried (if spark.task.maxFailures > 1), and then 
> the stage is failed. No further tasks are processed after that, and an 
> exception is thrown on the driver. You could catch the exception and see if 
> it was caused by your own special exception.
> 
> On Mon, Jan 4, 2016 at 1:05 PM, domibd  > wrote:
> Hello,
> 
> Is there a way to stop under a condition a process (like map-reduce) using
> an RDD ?
> 
> (this could be use if the process does not always need to
>  explore all the RDD)
> 
> thanks
> 
> Dominique
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/stopping-a-process-usgin-an-RDD-tp25870.html
>  
> 
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 



Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Michael Segel
Doh! It would help if I use the email address to send to the list… 


Hi, 

Lets take a step back… 

Which version of Hive? 

Hive recently added transaction support so you have to know your isolation 
level. 

Also are you running spark as your execution engine, or are you talking about a 
spark app running w a hive context and then you drop the table from within a 
Hive shell while the spark app is still running? 

And you also have two different things happening… you’re mixing a DDL with a 
query.  How does hive know you have another app reading from the table? 
I mean what happens when you try a select * from foo; and in another shell try 
dropping foo?  and if you want to simulate a m/r job add something like an 
order by 1 clause. 

HTH

-Mike
> On Jun 8, 2016, at 2:36 PM, Michael Segel <mse...@segel.com> wrote:
> 
> Hi, 
> 
> Lets take a step back… 
> 
> Which version of Hive? 
> 
> Hive recently added transaction support so you have to know your isolation 
> level. 
> 
> Also are you running spark as your execution engine, or are you talking about 
> a spark app running w a hive context and then you drop the table from within 
> a Hive shell while the spark app is still running? 
> 
> You also have two different things happening… you’re mixing a DDL with a 
> query.  How does hive know you have another app reading from the table? 
> I mean what happens when you try a select * from foo; and in another shell 
> try dropping foo?  and if you want to simulate a m/r job add something like 
> an order by 1 clause. 
> 
> HTH
> 
> -Mike



Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Michael Segel

> On Jun 8, 2016, at 3:35 PM, Eugene Koifman  wrote:
> 
> if you split “create table test.dummy as select * from oraclehadoop.dummy;”
> into create table statement, followed by insert into test.dummy as select… 
> you should see the behavior you expect with Hive.
> Drop statement will block while insert is running.
> 
> Eugene
> 

OK, assuming true… 

Then the ddl statement is blocked because Hive sees the table in use. 

If you can confirm this to be the case, and if you can confirm the same for 
spark and then you can drop the table while spark is running, then you would 
have a bug since Spark in the hive context doesn’t set any locks or improperly 
sets locks. 

I would have to ask which version of hive did you build spark against?  
That could be another factor.

HTH

-Mike




Secondary Indexing?

2016-05-30 Thread Michael Segel
I’m not sure where to post this since its a bit of a philosophical question in 
terms of design and vision for spark. 

If we look at SparkSQL and performance… where does Secondary indexing fit in? 

The reason this is a bit awkward is that if you view Spark as querying RDDs 
which are temporary, indexing doesn’t make sense until you consider your use 
case and how long is ‘temporary’.
Then if you consider your RDD result set could be based on querying tables… and 
you could end up with an inverted table as an index… then indexing could make 
sense. 

Does it make sense to discuss this in user or dev email lists? Has anyone given 
this any thought in the past? 

Thx

-Mike


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Michael Segel
And you have MapR supporting Apache Drill. 

So these are all alternatives to Spark, and its not necessarily an either or 
scenario. You can have both. 

> On May 30, 2016, at 12:49 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> yep Hortonworks supports Tez for one reason or other which I am going 
> hopefully to test it as the query engine for hive. Tthough I think Spark will 
> be faster because of its in-memory support.
> 
> Also if you are independent then you better off dealing with Spark and Hive 
> without the need to support another stack like Tez.
> 
> Cloudera support Impala instead of Hive but it is not something I have used. .
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 30 May 2016 at 20:19, Michael Segel <msegel_had...@hotmail.com 
> <mailto:msegel_had...@hotmail.com>> wrote:
> Mich, 
> 
> Most people use vendor releases because they need to have the support. 
> Hortonworks is the vendor who has the most skin in the game when it comes to 
> Tez. 
> 
> If memory serves, Tez isn’t going to be M/R but a local execution engine? 
> Then LLAP is the in-memory piece to speed up Tez? 
> 
> HTH
> 
> -Mike
> 
>> On May 29, 2016, at 1:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> thanks I think the problem is that the TEZ user group is exceptionally 
>> quiet. Just sent an email to Hive user group to see anyone has managed to 
>> built a vendor independent version.
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 29 May 2016 at 21:23, Jörn Franke <jornfra...@gmail.com 
>> <mailto:jornfra...@gmail.com>> wrote:
>> Well I think it is different from MR. It has some optimizations which you do 
>> not find in MR. Especially the LLAP option in Hive2 makes it interesting. 
>> 
>> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
>> integrated in the Hortonworks distribution. 
>> 
>> 
>> On 29 May 2016, at 21:43, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>>> Hi Jorn,
>>> 
>>> I started building apache-tez-0.8.2 but got few errors. Couple of guys from 
>>> TEZ user group kindly gave a hand but I could not go very far (or may be I 
>>> did not make enough efforts) making it work.
>>> 
>>> That TEZ user group is very quiet as well.
>>> 
>>> My understanding is TEZ is MR with DAG but of course Spark has both plus 
>>> in-memory capability.
>>> 
>>> It would be interesting to see what version of TEZ works as execution 
>>> engine with Hive.
>>> 
>>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of 
>>> Hive etc as I am sure you already know.
>>> 
>>> Cheers,
>>> 
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>  
>>> 
>>> On 29 May 2016 at 20:19, Jörn Franke <jornfra...@gmail.com 
>>> <mailto:jornfra...@gmail.com>> wrote:
>>> Very interesting do you plan also a test with TEZ?
>>> 
>>> On 29 May 2016, at 13:40, Mich Talebzadeh <mich.talebza...@gmail.com 
>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>>> 
>>>> Basically took the original table imported using Sqoop and created and 
>>>> populated a new ORC table partitioned by year and month into 48 partitions 
>>>> as follows:
>>>> 
>>>>

Re: HiveContext standalone => without a Hive metastore

2016-05-30 Thread Michael Segel
Going from memory… Derby is/was Cloudscape which IBM acquired from Informix who 
bought the company way back when.  (Since IBM released it under Apache 
licensing, Sun Microsystems took it and created JavaDB…) 

I believe that there is a networking function so that you can either bring it 
up in stand alone mode or networking mode that allows simultaneous network 
connections (multi-user). 

If not you can always go MySQL.

HTH

> On May 26, 2016, at 1:36 PM, Mich Talebzadeh  
> wrote:
> 
> Well make sure than you set up a reasonable RDBMS as metastore. Ours is 
> Oracle but you can get away with others. Check the supported list in
> 
> hduser@rhes564:: :/usr/lib/hive/scripts/metastore/upgrade> ltr
> total 40
> drwxr-xr-x 2 hduser hadoop 4096 Feb 21 23:48 postgres
> drwxr-xr-x 2 hduser hadoop 4096 Feb 21 23:48 mysql
> drwxr-xr-x 2 hduser hadoop 4096 Feb 21 23:48 mssql
> drwxr-xr-x 2 hduser hadoop 4096 Feb 21 23:48 derby
> drwxr-xr-x 3 hduser hadoop 4096 May 20 18:44 oracle
> 
> you have few good ones in the list.  In general the base tables (without 
> transactional support) are around 55  (Hive 2) and don't take much space 
> (depending on the volume of tables). I attached a E-R diagram.
> 
> HTH
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 26 May 2016 at 19:09, Gerard Maas  > wrote:
> Thanks a lot for the advice!. 
> 
> I found out why the standalone hiveContext would not work:  it was trying to 
> deploy a derby db and the user had no rights to create the dir where there db 
> is stored:
> 
> Caused by: java.sql.SQLException: Failed to create database 'metastore_db', 
> see the next exception for details.
> 
>at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> 
>at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
> 
>... 129 more
> 
> Caused by: java.sql.SQLException: Directory 
> /usr/share/spark-notebook/metastore_db cannot be created.
> 
> 
> 
> Now, the new issue is that we can't start more than 1 context at the same 
> time. I think we will need to setup a proper metastore.
> 
> 
> 
> -kind regards, Gerard.
> 
> 
> 
> 
> 
> On Thu, May 26, 2016 at 3:06 PM, Mich Talebzadeh  > wrote:
> To use HiveContext witch is basically an sql api within Spark without proper 
> hive set up does not make sense. It is a super set of Spark SQLContext
> 
> In addition simple things like registerTempTable may not work.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 26 May 2016 at 13:01, Silvio Fiorito  > wrote:
> Hi Gerard,
> 
>  
> 
> I’ve never had an issue using the HiveContext without a hive-site.xml 
> configured. However, one issue you may have is if multiple users are starting 
> the HiveContext from the same path, they’ll all be trying to store the 
> default Derby metastore in the same location. Also, if you want them to be 
> able to persist permanent table metadata for SparkSQL then you’ll want to set 
> up a true metastore.
> 
>  
> 
> The other thing it could be is Hive dependency collisions from the classpath, 
> but that shouldn’t be an issue since you said it’s standalone (not a Hadoop 
> distro right?).
> 
>  
> 
> Thanks,
> 
> Silvio
> 
>  
> 
> From: Gerard Maas >
> Date: Thursday, May 26, 2016 at 5:28 AM
> To: spark users >
> Subject: HiveContext standalone => without a Hive metastore
> 
>  
> 
> Hi,
> 
>  
> 
> I'm helping some folks setting up an analytics cluster with  Spark.
> 
> They want to use the HiveContext to enable the Window functions on 
> DataFrames(*) but they don't have any Hive installation, nor they need one at 
> the moment (if not necessary for this feature)
> 
>  
> 
> When we try to create a Hive context, we get the following error:
> 
>  
> 
> > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
> 
> java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
> 
>at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
> 
>  
> 
> Is my HiveContext failing b/c it wants to connect to an 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Michael Segel
Mich, 

Most people use vendor releases because they need to have the support. 
Hortonworks is the vendor who has the most skin in the game when it comes to 
Tez. 

If memory serves, Tez isn’t going to be M/R but a local execution engine? Then 
LLAP is the in-memory piece to speed up Tez? 

HTH

-Mike

> On May 29, 2016, at 1:35 PM, Mich Talebzadeh  
> wrote:
> 
> thanks I think the problem is that the TEZ user group is exceptionally quiet. 
> Just sent an email to Hive user group to see anyone has managed to built a 
> vendor independent version.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 29 May 2016 at 21:23, Jörn Franke  > wrote:
> Well I think it is different from MR. It has some optimizations which you do 
> not find in MR. Especially the LLAP option in Hive2 makes it interesting. 
> 
> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
> integrated in the Hortonworks distribution. 
> 
> 
> On 29 May 2016, at 21:43, Mich Talebzadeh  > wrote:
> 
>> Hi Jorn,
>> 
>> I started building apache-tez-0.8.2 but got few errors. Couple of guys from 
>> TEZ user group kindly gave a hand but I could not go very far (or may be I 
>> did not make enough efforts) making it work.
>> 
>> That TEZ user group is very quiet as well.
>> 
>> My understanding is TEZ is MR with DAG but of course Spark has both plus 
>> in-memory capability.
>> 
>> It would be interesting to see what version of TEZ works as execution engine 
>> with Hive.
>> 
>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of 
>> Hive etc as I am sure you already know.
>> 
>> Cheers,
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 29 May 2016 at 20:19, Jörn Franke > > wrote:
>> Very interesting do you plan also a test with TEZ?
>> 
>> On 29 May 2016, at 13:40, Mich Talebzadeh > > wrote:
>> 
>>> Hi,
>>> 
>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>> 
>>> Basically took the original table imported using Sqoop and created and 
>>> populated a new ORC table partitioned by year and month into 48 partitions 
>>> as follows:
>>> 
>>> 
>>> ​ 
>>> Connections use JDBC via beeline. Now for each partition using MR it takes 
>>> an average of 17 minutes as seen below for each PARTITION..  Now that is 
>>> just an individual partition and there are 48 partitions. 
>>> 
>>> In contrast doing the same operation with Spark engine took 10 minutes all 
>>> inclusive. I just gave up on MR. You can see the StartTime and FinishTime 
>>> from below
>>> 
>>> 
>>> 
>>> This is by no means indicate that Spark is much better than MR but shows 
>>> that some very good results can ve achieved using Spark engine.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> 
>>>  
>>> http://talebzadehmich.wordpress.com 
>>>  
>>> 
>>> On 24 May 2016 at 08:03, Mich Talebzadeh >> > wrote:
>>> Hi,
>>> 
>>> We use Hive as the database and use Spark as an all purpose query tool.
>>> 
>>> Whether Hive is the write database for purpose or one is better off with 
>>> something like Phoenix on Hbase, well the answer is it depends and your 
>>> mileage varies. 
>>> 
>>> So fit for purpose.
>>> 
>>> Ideally what wants is to use the fastest  method to get the results. How 
>>> fast we confine it to our SLA agreements in production and that helps us 
>>> from unnecessary further work as we technologists like to play around.
>>> 
>>> So in short, we use Spark most of the time and use Hive as the backend 
>>> engine for data storage, mainly ORC tables.
>>> 
>>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a 
>>> combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but 
>>> at the moment it is one of my projects.
>>> 
>>> We do not use any vendor's products as it enables us to move away  from 
>>> being tied down after years of SAP, Oracle and MS dependency to yet another 
>>> vendor. Besides there is some politics going on 

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Michael Segel
ld import data via sqoop and put it on HDFS. It has some mechanisms to 
> handle the lack of reliability by jdbc. 
> 
> Then you can process the data via Spark. You could also use jdbc rdd but I do 
> not recommend to use it, because you do not want to pull data all the time 
> out of the database when you execute your application. Furthermore, you have 
> to handle connection interruptions, the multiple 
> serialization/deserialization efforts, if one executor crashes you have to 
> repull some or all of the data from the database etc
> 
> Within the cluster it does not make sense to me to pull data via jdbc from 
> hive. All the benefits such as data locality, reliability etc would be gone.
> 
> Hive supports different execution engines (TEZ, Spark), formats (Orc, 
> parquet) and further optimizations to make the analysis fast. It always 
> depends on your use case.
> 
> On 22 Jun 2016, at 05:47, Michael Segel <msegel_had...@hotmail.com 
> <mailto:msegel_had...@hotmail.com>> wrote:
> 
>> 
>> Sorry, I think you misunderstood. 
>> Spark can read from JDBC sources so to say using beeline as a way to access 
>> data is not a spark application isn’t really true.  Would you say the same 
>> if you were pulling data in to spark from Oracle or DB2? 
>> There are a couple of different design patterns and use cases where data 
>> could be stored in Hive yet your only access method is via a JDBC or 
>> Thift/Rest service.  Think also of compute / storage cluster 
>> implementations. 
>> 
>> WRT to #2, not exactly what I meant, by exposing the data… and there are 
>> limitations to the thift service…
>> 
>>> On Jun 21, 2016, at 5:44 PM, ayan guha <guha.a...@gmail.com 
>>> <mailto:guha.a...@gmail.com>> wrote:
>>> 
>>> 1. Yes, in the sense you control number of executors from spark application 
>>> config. 
>>> 2. Any IO will be done from executors (never ever on driver, unless you 
>>> explicitly call collect()). For example, connection to a DB happens one for 
>>> each worker (and used by local executors). Also, if you run a reduceByKey 
>>> job and write to hdfs, you will find a bunch of files were written from 
>>> various executors. What happens when you want to expose the data to world: 
>>> Spark Thrift Server (STS), which is a long running spark application (ie 
>>> spark context) which can serve data from RDDs. 
>>> 
>>> Suppose I have a data source… like a couple of hive tables and I access the 
>>> tables via beeline. (JDBC)  -  
>>> This is NOT a spark application, and there is no RDD created. Beeline is 
>>> just a jdbc client tool. You use beeline to connect to HS2 or STS. 
>>> 
>>> In this case… Hive generates a map/reduce job and then would stream the 
>>> result set back to the client node where the RDD result set would be built. 
>>>  -- 
>>> This is never true. When you connect Hive from spark, spark actually reads 
>>> hive metastore and streams data directly from HDFS. Hive MR jobs do not 
>>> play any role here, making spark faster than hive. 
>>> 
>>> HTH
>>> 
>>> Ayan
>>> 
>>> On Wed, Jun 22, 2016 at 9:58 AM, Michael Segel <msegel_had...@hotmail.com 
>>> <mailto:msegel_had...@hotmail.com>> wrote:
>>> Ok, its at the end of the day and I’m trying to make sure I understand the 
>>> locale of where things are running.
>>> 
>>> I have an application where I have to query a bunch of sources, creating 
>>> some RDDs and then I need to join off the RDDs and some other lookup tables.
>>> 
>>> 
>>> Yarn has two modes… client and cluster.
>>> 
>>> I get it that in cluster mode… everything is running on the cluster.
>>> But in client mode, the driver is running on the edge node while the 
>>> workers are running on the cluster.
>>> 
>>> When I run a sparkSQL command that generates a new RDD, does the result set 
>>> live on the cluster with the workers, and gets referenced by the driver, or 
>>> does the result set get migrated to the driver running on the client? (I’m 
>>> pretty sure I know the answer, but its never safe to assume anything…)
>>> 
>>> The follow up questions:
>>> 
>>> 1) If I kill the  app running the driver on the edge node… will that cause 
>>> YARN to free up the cluster’s resources? (In cluster mode… that doesn’t 
>>> happen) What happens and how quickly?
>>> 
>>> 1a) If using the client mode… can I spin up and spin down t

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Michael Segel
JDBC reliability problem? 

Ok… a bit more explanation… 

Usually when you have to go back to a legacy system, its because the data set 
is usually metadata and is relatively small.  Its not the sort of data that 
gets ingested in to a data lake unless you’re also ingesting the metadata and 
are using HBase/MapRDB , Cassandra or something like that. 

On top of all of this… when dealing with PII, you have to consider data at 
rest. This is also why you will see some things that make parts of the design a 
bit counter intuitive and why you start to think of storage/compute models over 
traditional cluster design. 

A single threaded JDBC connection to build a local RDD should be fine and I 
would really be concerned if that was unreliable.  That would mean tools like 
Spark or Drill would have serious reliability issues when considering legacy 
system access. 

And of course I’m still trying to dig in to YARN’s client vs cluster option. 


thx

> On Jun 22, 2016, at 9:36 AM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> Thanks Mike for clarification.
> 
> I think there is another option to get data out of RDBMS through some form of 
> SELECT ALL COLUMNS TAB SEPARATED OR OTHER and put them in a flat file or 
> files. scp that file from the RDBMS directory to a private directory on HDFS 
> system  and push it into HDFS. That will by-pass the JDBC reliability problem 
> and I guess in this case one is in more control.
> 
> I do concur that there are security issues with this. For example that local 
> file system may have to have encryption etc  that will make it tedious. I 
> believe Jorn mentioned this somewhere.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 22 June 2016 at 15:59, Michael Segel <msegel_had...@hotmail.com 
> <mailto:msegel_had...@hotmail.com>> wrote:
> Hi, 
> 
> Just to clear a few things up… 
> 
> First I know its hard to describe some problems because they deal with client 
> confidential information. 
> (Also some basic ‘dead hooker’ thought problems to work through before facing 
> them at a client.) 
> The questions I pose here are very general and deal with some basic design 
> issues/consideration. 
> 
> 
> W.R.T JDBC / Beeline:
> There are many use cases where you don’t want to migrate some or all data to 
> HDFS.  This is why tools like Apache Drill exist. At the same time… there are 
> different cluster design patterns.  One such pattern is a storage/compute 
> model where you have multiple clusters acting either as compute clusters 
> which pull data from storage clusters. An example would be spinning up an EMR 
> cluster and running a M/R job where you read from S3 and output to S3.  Or 
> within your enterprise you have your Data Lake (Data Sewer) and then a 
> compute cluster for analytics. 
> 
> In addition, you have some very nasty design issues to deal with like 
> security. Yes, that’s the very dirty word nobody wants to deal with and in 
> most of these tools, security is an afterthought.  
> So you may not have direct access to the cluster or an edge node. You only 
> have access to a single port on a single machine through the firewall, which 
> is running beeline so you can pull data from your storage cluster. 
> 
> Its very possible that you have to pull data from the cluster thru beeline to 
> store the data within a spark job running on the cluster. (Oh the irony! ;-) 
> 
> Its important to understand that due to design constraints, options like 
> sqoop or running a query directly against Hive may not be possible and these 
> use cases do exist when dealing with PII information.
> 
> Its also important to realize that you may have to pull data from multiple 
> data sources for some not so obvious but important reasons… 
> So your spark app has to be able to generate an RDD from data in an RDBS 
> (Oracle, DB2, etc …) persist it for local lookups and then pull data from the 
> cluster and then output it either back to the cluster, another cluster, or 
> someplace else.  All of these design issues occur when you’re dealing with 
> large enterprises. 
> 
> 
> But I digress… 
> 
> The reason I started this thread was to get a better handle on where things 
> run when we make decisions to run either in client or cluster mode in YARN. 
> Issues like starting/stopping long running apps are an issue in production.  
> Tying up cluster resources while your application remains dormant waiting for 
> the next batch of events to c

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-21 Thread Michael Segel

Sorry, I think you misunderstood. 
Spark can read from JDBC sources so to say using beeline as a way to access 
data is not a spark application isn’t really true.  Would you say the same if 
you were pulling data in to spark from Oracle or DB2? 
There are a couple of different design patterns and use cases where data could 
be stored in Hive yet your only access method is via a JDBC or Thift/Rest 
service.  Think also of compute / storage cluster implementations. 

WRT to #2, not exactly what I meant, by exposing the data… and there are 
limitations to the thift service…

> On Jun 21, 2016, at 5:44 PM, ayan guha <guha.a...@gmail.com> wrote:
> 
> 1. Yes, in the sense you control number of executors from spark application 
> config. 
> 2. Any IO will be done from executors (never ever on driver, unless you 
> explicitly call collect()). For example, connection to a DB happens one for 
> each worker (and used by local executors). Also, if you run a reduceByKey job 
> and write to hdfs, you will find a bunch of files were written from various 
> executors. What happens when you want to expose the data to world: Spark 
> Thrift Server (STS), which is a long running spark application (ie spark 
> context) which can serve data from RDDs. 
> 
> Suppose I have a data source… like a couple of hive tables and I access the 
> tables via beeline. (JDBC)  -  
> This is NOT a spark application, and there is no RDD created. Beeline is just 
> a jdbc client tool. You use beeline to connect to HS2 or STS. 
> 
> In this case… Hive generates a map/reduce job and then would stream the 
> result set back to the client node where the RDD result set would be built.  
> -- 
> This is never true. When you connect Hive from spark, spark actually reads 
> hive metastore and streams data directly from HDFS. Hive MR jobs do not play 
> any role here, making spark faster than hive. 
> 
> HTH
> 
> Ayan
> 
> On Wed, Jun 22, 2016 at 9:58 AM, Michael Segel <msegel_had...@hotmail.com 
> <mailto:msegel_had...@hotmail.com>> wrote:
> Ok, its at the end of the day and I’m trying to make sure I understand the 
> locale of where things are running.
> 
> I have an application where I have to query a bunch of sources, creating some 
> RDDs and then I need to join off the RDDs and some other lookup tables.
> 
> 
> Yarn has two modes… client and cluster.
> 
> I get it that in cluster mode… everything is running on the cluster.
> But in client mode, the driver is running on the edge node while the workers 
> are running on the cluster.
> 
> When I run a sparkSQL command that generates a new RDD, does the result set 
> live on the cluster with the workers, and gets referenced by the driver, or 
> does the result set get migrated to the driver running on the client? (I’m 
> pretty sure I know the answer, but its never safe to assume anything…)
> 
> The follow up questions:
> 
> 1) If I kill the  app running the driver on the edge node… will that cause 
> YARN to free up the cluster’s resources? (In cluster mode… that doesn’t 
> happen) What happens and how quickly?
> 
> 1a) If using the client mode… can I spin up and spin down the number of 
> executors on the cluster? (Assuming that when I kill an executor any portion 
> of the RDDs associated with that executor are gone, however the spark context 
> is still alive on the edge node? [again assuming that the spark context lives 
> with the driver.])
> 
> 2) Any I/O between my spark job and the outside world… (e.g. walking through 
> the data set and writing out a data set to a file) will occur on the edge 
> node where the driver is located?  (This may seem kinda silly, but what 
> happens when you want to expose the result set to the world… ? )
> 
> Now for something slightly different…
> 
> Suppose I have a data source… like a couple of hive tables and I access the 
> tables via beeline. (JDBC)  In this case… Hive generates a map/reduce job and 
> then would stream the result set back to the client node where the RDD result 
> set would be built.  I realize that I could run Hive on top of spark, but 
> that’s a separate issue. Here the RDD will reside on the client only.  (That 
> is I could in theory run this as a single spark instance.)
> If I were to run this on the cluster… then the result set would stream thru 
> the beeline gate way and would reside back on the cluster sitting in RDDs 
> within each executor?
> 
> I realize that these are silly questions but I need to make sure that I know 
> the flow of the data and where it ultimately resides.  There really is a 
> method to my madness, and if I could explain it… these questions really would 
> make sense. ;-)
> 
> TIA,
> 
> -Mike
> 
> 
> ---

Re: Union of multiple RDDs

2016-06-21 Thread Michael Segel
By repartition I think you mean coalesce() where you would get one parquet file 
per partition? 

And this would be a new immutable copy so that you would want to write this new 
RDD to a different HDFS directory? 

-Mike

> On Jun 21, 2016, at 8:06 AM, Eugene Morozov  
> wrote:
> 
> Apurva, 
> 
> I'd say you have to apply repartition just once to the RDD that is union of 
> all your files.
> And it has to be done right before you do anything else.
> 
> If something is not needed on your files, then the sooner you project, the 
> better.
> 
> Hope, this helps.
> 
> --
> Be well!
> Jean Morozov
> 
> On Tue, Jun 21, 2016 at 4:48 PM, Apurva Nandan  > wrote:
> Hello,
> 
> I am trying to combine several small text files (each file is approx hundreds 
> of MBs to 2-3 gigs) into one big parquet file. 
> 
> I am loading each one of them and trying to take a union, however this leads 
> to enormous amounts of partitions, as union keeps on adding the partitions of 
> the input RDDs together.
> 
> I also tried loading all the files via wildcard, but that behaves almost the 
> same as union i.e. generates a lot of partitions.
> 
> One of the approach that I thought was to reparititon the rdd generated after 
> each union and then continue the process, but I don't know how efficient that 
> is.
> 
> Has anyone came across this kind of thing before?
> 
> - Apurva 
> 
> 
> 



Silly question about Yarn client vs Yarn cluster modes...

2016-06-21 Thread Michael Segel
Ok, its at the end of the day and I’m trying to make sure I understand the 
locale of where things are running. 

I have an application where I have to query a bunch of sources, creating some 
RDDs and then I need to join off the RDDs and some other lookup tables. 


Yarn has two modes… client and cluster. 

I get it that in cluster mode… everything is running on the cluster. 
But in client mode, the driver is running on the edge node while the workers 
are running on the cluster.

When I run a sparkSQL command that generates a new RDD, does the result set 
live on the cluster with the workers, and gets referenced by the driver, or 
does the result set get migrated to the driver running on the client? (I’m 
pretty sure I know the answer, but its never safe to assume anything…) 

The follow up questions:

1) If I kill the  app running the driver on the edge node… will that cause YARN 
to free up the cluster’s resources? (In cluster mode… that doesn’t happen) What 
happens and how quickly? 

1a) If using the client mode… can I spin up and spin down the number of 
executors on the cluster? (Assuming that when I kill an executor any portion of 
the RDDs associated with that executor are gone, however the spark context is 
still alive on the edge node? [again assuming that the spark context lives with 
the driver.]) 

2) Any I/O between my spark job and the outside world… (e.g. walking through 
the data set and writing out a data set to a file) will occur on the edge node 
where the driver is located?  (This may seem kinda silly, but what happens when 
you want to expose the result set to the world… ? ) 

Now for something slightly different… 

Suppose I have a data source… like a couple of hive tables and I access the 
tables via beeline. (JDBC)  In this case… Hive generates a map/reduce job and 
then would stream the result set back to the client node where the RDD result 
set would be built.  I realize that I could run Hive on top of spark, but 
that’s a separate issue. Here the RDD will reside on the client only.  (That is 
I could in theory run this as a single spark instance.) 
If I were to run this on the cluster… then the result set would stream thru the 
beeline gate way and would reside back on the cluster sitting in RDDs within 
each executor? 

I realize that these are silly questions but I need to make sure that I know 
the flow of the data and where it ultimately resides.  There really is a method 
to my madness, and if I could explain it… these questions really would make 
sense. ;-) 

TIA, 

-Mike


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Michael Segel
The only documentation on this… in terms of direction … (that I could find)

If your client is not close to the cluster (e.g. your PC) then you definitely 
want to go cluster to improve performance.
If your client is close to the cluster (e.g. an edge node) then you could go 
either client or cluster.  Note that by going client, more resources are going 
to be used on the edge node.

HTH

-Mike

> On Jun 22, 2016, at 1:51 PM, Marcelo Vanzin  wrote:
> 
> On Wed, Jun 22, 2016 at 1:32 PM, Mich Talebzadeh
>  wrote:
>> Does it also depend on the number of Spark nodes involved in choosing which
>> way to go?
> 
> Not really.
> 
> -- 
> Marcelo
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Michael Segel
LOL… I hate YARN, but unfortunately I don’t get to make the call on which tools 
we’re going to use, I just get paid to make stuff work on the tools provided. 
;-) 

Testing is somewhat problematic.  You have to really test at some large enough 
fraction of scale. 
Fortunately for this issue (YARN client/cluster) is just in how you launch a 
job so its really just benchmarking the time difference. 

But this gets to the question… what are the real differences between client and 
cluster modes? 

What are the pros/cons and use cases where one has advantages over the other?  
I couldn’t find anything on the web, so I’m here asking these ‘silly’ 
questions… ;-)

The issue of Mesos over YARN is that you now have to look at Myriad to help 
glue stuff together, or so I’m told. 
And you now may have to add an additional vendor support contract unless your 
Hadoop vendor is willing to support Mesos. 

But of course, YMMV and maybe someone else can add something of value than my 
$0.02 worth. ;-) 

-Mike

> On Jun 22, 2016, at 12:04 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> This is exactly the sort of topics that distinguish lab work from enterprise 
> practice :)
> 
> The question on YARN client versus YARN cluster mode. I am not sure how much 
> in real life it is going to make an impact if I choose one over the other?
> 
> These days I yell developers that it is perfectly valid to use Spark local 
> mode to their dev/unit testing. at least they know how to look after it with 
> the help of Spark WEB GUI.
> 
> Also anyone has tried using Mesos instead of Spark?. What does it offer above 
> YARN.
> 
> Cheers
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 22 June 2016 at 19:04, Michael Segel <msegel_had...@hotmail.com 
> <mailto:msegel_had...@hotmail.com>> wrote:
> JDBC reliability problem? 
> 
> Ok… a bit more explanation… 
> 
> Usually when you have to go back to a legacy system, its because the data set 
> is usually metadata and is relatively small.  Its not the sort of data that 
> gets ingested in to a data lake unless you’re also ingesting the metadata and 
> are using HBase/MapRDB , Cassandra or something like that. 
> 
> On top of all of this… when dealing with PII, you have to consider data at 
> rest. This is also why you will see some things that make parts of the design 
> a bit counter intuitive and why you start to think of storage/compute models 
> over traditional cluster design. 
> 
> A single threaded JDBC connection to build a local RDD should be fine and I 
> would really be concerned if that was unreliable.  That would mean tools like 
> Spark or Drill would have serious reliability issues when considering legacy 
> system access. 
> 
> And of course I’m still trying to dig in to YARN’s client vs cluster option. 
> 
> 
> thx
> 
>> On Jun 22, 2016, at 9:36 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> Thanks Mike for clarification.
>> 
>> I think there is another option to get data out of RDBMS through some form 
>> of SELECT ALL COLUMNS TAB SEPARATED OR OTHER and put them in a flat file or 
>> files. scp that file from the RDBMS directory to a private directory on HDFS 
>> system  and push it into HDFS. That will by-pass the JDBC reliability 
>> problem and I guess in this case one is in more control.
>> 
>> I do concur that there are security issues with this. For example that local 
>> file system may have to have encryption etc  that will make it tedious. I 
>> believe Jorn mentioned this somewhere.
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 22 June 2016 at 15:59, Michael Segel <msegel_had...@hotmail.com 
>> <mailto:msegel_had...@hotmail.com>> wrote:
>> Hi, 
>> 
>> Just to clear a few things up… 
>> 
>> First I know its hard to describe some problems because they deal with 
>> client confidential information. 
>> (Also some basic ‘dead hooker’ thought problems to work through before 
>> facing them at a client.) 
>> The questions I

Re: Spark Thrift Server Concurrency

2016-06-23 Thread Michael Segel
Hi, 
There are  a lot of moving parts and a lot of unknowns from your description. 
Besides the version stuff. 

How many executors, how many cores? How much memory? 
Are you persisting (memory and disk) or just caching (memory) 

During the execution… same tables… are  you seeing a lot of shuffling of data 
for some queries and not others? 

It sounds like an interesting problem… 

> On Jun 23, 2016, at 5:21 AM, Prabhu Joseph  wrote:
> 
> Hi All,
> 
>On submitting 20 parallel same SQL query to Spark Thrift Server, the query 
> execution time for some queries are less than a second and some are more than 
> 2seconds. The Spark Thrift Server logs shows all 20 queries are submitted at 
> same time 16/06/23 12:12:01 but the result schema are at different times.
> 
> 16/06/23 12:12:01 INFO SparkExecuteStatementOperation: Running query 'select 
> distinct val2 from philips1 where key>=1000 and key<=1500
> 
> 16/06/23 12:12:02 INFO SparkExecuteStatementOperation: Result Schema: 
> ArrayBuffer(val2#2110)
> 16/06/23 12:12:03 INFO SparkExecuteStatementOperation: Result Schema: 
> ArrayBuffer(val2#2182)
> 16/06/23 12:12:04 INFO SparkExecuteStatementOperation: Result Schema: 
> ArrayBuffer(val2#2344)
> 16/06/23 12:12:05 INFO SparkExecuteStatementOperation: Result Schema: 
> ArrayBuffer(val2#2362)
> 
> There are sufficient executors running on YARN. The concurrency is affected 
> by Single Driver. How to improve the concurrency and what are the best 
> practices.
> 
> Thanks,
> Prabhu Joseph



Re: temporary tables created by registerTempTable()

2016-02-15 Thread Michael Segel
I was just looking at that… 

Out of curiosity… if you make it a Hive Temp Table… who has access to the data? 

Just your app, or anyone with access to the same database?  (Would you be able 
to share data across different JVMs? ) 

(E.G - I have a reader who reads from source A that needs to publish the data 
to a bunch of minions (B)   ) 

Would this be an option? 

Thx

-Mike

> On Feb 15, 2016, at 7:54 AM, Mich Talebzadeh 
>  wrote:
> 
>> Hi,
>> 
>>  
>> It is my understanding that the registered temporary tables created by 
>> registerTempTable() used in Spark shell built on ORC files?
>> 
>> For example the following Data Frame just creates a logical abstraction
>> 
>> scala> var s = HiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID FROM 
>> oraclehadoop.sales")
>> s: org.apache.spark.sql.DataFrame = [AMOUNT_SOLD: decimal(10,0), TIME_ID: 
>> timestamp, CHANNEL_ID: bigint] 
>> 
>> Then I registar this data frame as temporary table using registerTempTable() 
>> call
>> 
>> s.registerTempTable("t_s")
>> 
>> Also I believe that s.registerTempTable("t_s") creates an in-memory table 
>> that is scoped to the cluster in which it was created. The data is stored 
>> using Hive's ORC format and this tempTable is stored in memory on all nodes 
>> of the cluster?  In other words every node in the cluster has a copy of 
>> tempTable in its memory?
>> 
>> Thanks,
>> 
>>  
>> -- 
>> Dr Mich Talebzadeh
>> 
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> 
>> http://talebzadehmich.wordpress.com
>> 
>> NOTE: The information in this email is proprietary and confidential. This 
>> message is for the designated recipient only, if you are not the intended 
>> recipient, you should destroy it immediately. Any information in this 
>> message shall not be understood as given or endorsed by Cloud Technology 
>> Partners Ltd, its subsidiaries or their employees, unless expressly so 
>> stated. It is the responsibility of the recipient to ensure that this email 
>> is virus free, therefore neither Cloud Technology partners Ltd, its 
>> subsidiaries nor their employees accept any responsibility.
>> 
>  
>  
> -- 
> Dr Mich Talebzadeh
> 
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> 
> http://talebzadehmich.wordpress.com
> 
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Cloud Technology Partners 
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
> the responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Cloud Technology partners Ltd, its subsidiaries nor their 
> employees accept any responsibility.
> 



Re: Sqoop on Spark

2016-04-06 Thread Michael Segel
I don’t think its necessarily a bad idea.

Sqoop is an ugly tool and it requires you to make some assumptions as a way to 
gain parallelism. (Not that most of the assumptions are not valid for most of 
the use cases…) 

Depending on what you want to do… your data may not be persisted on HDFS.  
There are use cases where your cluster is used for compute and not storage.

I’d say that spending time re-inventing the wheel can be a good thing. 
It would be a good idea for many to rethink their ingestion process so that 
they can have a nice ‘data lake’ and not a ‘data sewer’. (Stealing that term 
from Dean Wampler. ;-) 

Just saying. ;-) 

-Mike

> On Apr 5, 2016, at 10:44 PM, Jörn Franke  wrote:
> 
> I do not think you can be more resource efficient. In the end you have to 
> store the data anyway on HDFS . You have a lot of development effort for 
> doing something like sqoop. Especially with error handling. 
> You may create a ticket with the Sqoop guys to support Spark as an execution 
> engine and maybe it is less effort to plug it in there.
> Maybe if your cluster is loaded then you may want to add more machines or 
> improve the existing programs.
> 
> On 06 Apr 2016, at 07:33, ayan guha  > wrote:
> 
>> One of the reason in my mind is to avoid Map-Reduce application completely 
>> during ingestion, if possible. Also, I can then use Spark stand alone 
>> cluster to ingest, even if my hadoop cluster is heavily loaded. What you 
>> guys think?
>> 
>> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke > > wrote:
>> Why do you want to reimplement something which is already there?
>> 
>> On 06 Apr 2016, at 06:47, ayan guha > > wrote:
>> 
>>> Hi
>>> 
>>> Thanks for reply. My use case is query ~40 tables from Oracle (using index 
>>> and incremental only) and add data to existing Hive tables. Also, it would 
>>> be good to have an option to create Hive table, driven by job specific 
>>> configuration. 
>>> 
>>> What do you think?
>>> 
>>> Best
>>> Ayan
>>> 
>>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro >> > wrote:
>>> Hi,
>>> 
>>> It depends on your use case using sqoop.
>>> What's it like?
>>> 
>>> // maropu
>>> 
>>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha >> > wrote:
>>> Hi All
>>> 
>>> Asking opinion: is it possible/advisable to use spark to replace what sqoop 
>>> does? Any existing project done in similar lines?
>>> 
>>> -- 
>>> Best Regards,
>>> Ayan Guha
>>> 
>>> 
>>> 
>>> -- 
>>> ---
>>> Takeshi Yamamuro
>>> 
>>> 
>>> 
>>> -- 
>>> Best Regards,
>>> Ayan Guha
>> 
>> 
>> 
>> -- 
>> Best Regards,
>> Ayan Guha



Fwd: Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Michael Segel

Sorry for duplicate(s), I forgot to switch my email address. 

> Begin forwarded message:
> 
> From: Michael Segel <mse...@segel.com>
> Subject: Re: Can i have a hive context and sql context in the same app ?
> Date: April 12, 2016 at 4:05:26 PM MST
> To: Michael Armbrust <mich...@databricks.com>
> Cc: Natu Lauchande <nlaucha...@gmail.com>, "user@spark.apache.org" 
> <user@spark.apache.org>
> 
> Reading from multiple sources within the same application? 
> 
> How would you connect to Hive for some data and then reach out to lets say 
> Oracle or DB2 for some other data that you may want but isn’t available on 
> your cluster? 
> 
> 
>> On Apr 12, 2016, at 10:52 AM, Michael Armbrust <mich...@databricks.com 
>> <mailto:mich...@databricks.com>> wrote:
>> 
>> You can, but I'm not sure why you would want to.  If you want to isolate 
>> different users just use hiveContext.newSession().
>> 
>> On Tue, Apr 12, 2016 at 1:48 AM, Natu Lauchande <nlaucha...@gmail.com 
>> <mailto:nlaucha...@gmail.com>> wrote:
>> Hi,
>> 
>> Is it possible to have both a sqlContext and a hiveContext in the same 
>> application ?
>> 
>> If yes would there be any performance pernalties of doing so.
>> 
>> Regards,
>> Natu
>> 
> 



Silly question...

2016-04-12 Thread Michael Segel
Hi, 
This is probably a silly question on my part… 

I’m looking at the latest (spark 1.6.1 release) and would like to do a build w 
Hive and JDBC support. 

From the documentation, I see two things that make me scratch my head.

1) Scala 2.11 
"Spark does not yet support its JDBC component for Scala 2.11.”

So if we want to use JDBC, don’t use Scala 2.11.x (in this case its 2.11.8)

2) Hive Support
"To enable Hive integration for Spark SQL along with its JDBC server and CLI, 
add the -Phive and Phive-thriftserver profiles to your existing build options. 
By default Spark will build with Hive 0.13.1 bindings.”

So if we’re looking at a later release of Hive… lets say 1.1.x … still use the 
-Phive and Phive-thriftserver . Is there anything else we should consider? 

Just asking because I’ve noticed that this part of the documentation hasn’t 
changed much over the past releases. 

Thanks in Advance, 

-Mike



Re: Sqoop on Spark

2016-04-11 Thread Michael Segel
Depending on the Oracle release… 

You could use webHDFS to gain access to the cluster and see the CSV file as an 
external table. 

However, you would need to have an application that will read each block of the 
file in parallel. This works for loading in to the RDBMS itself.  Actually you 
could use sqoop in reverse to push data to the RDBMS provided that the block 
file is splittable.  This is a classic M/R problem. 

But I don’t think this is what the OP wants to do. They want to pull data from 
the RDBMs. If you could drop the table’s underlying file and can read directly 
from it… you can do a very simple bulk load/unload process. However you need to 
know the file’s format. 

Not sure what IBM or Oracle has done to tie their RDBMs to Big Data. 

As I and other posters to this thread have alluded to… this would be a block 
bulk load/unload tool. 


> On Apr 10, 2016, at 11:31 AM, Jörn Franke <jornfra...@gmail.com> wrote:
> 
> 
> I am not 100% sure, but you could export to CSV in Oracle using external 
> tables.
> 
> Oracle has also the Hadoop Loader, which seems to support Avro. However, I 
> think you need to buy the Big Data solution.
> 
> On 10 Apr 2016, at 16:12, Mich Talebzadeh <mich.talebza...@gmail.com 
> <mailto:mich.talebza...@gmail.com>> wrote:
> 
>> Yes I meant MR.
>> 
>> Again one cannot beat the RDBMS export utility. I was specifically referring 
>> to Oracle in above case that does not provide any specific text bases export 
>> except the binary one Exp, data pump etc).
>> 
>> In case of SAPO ASE, Sybase IQ, and MSSQL, one can use BCP (bulk copy) that 
>> can be parallelised either through range partitioning or simple round robin 
>> partitioning that can be used to get data out to file in parallel. Then once 
>> get data into Hive table through import etc.
>> 
>> In general if the source table is very large you can used either SAP 
>> Replication Server (SRS) or Oracle Golden Gate to get data to Hive. Both 
>> these replication tools provide connectors to Hive and they do a good job. 
>> If one has something like Oracle in Prod then there is likely a Golden Gate 
>> there. For bulk setting of Hive tables and data migration, replication 
>> server is good option.
>> 
>> HTH
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 10 April 2016 at 14:24, Michael Segel <msegel_had...@hotmail.com 
>> <mailto:msegel_had...@hotmail.com>> wrote:
>> Sqoop doesn’t use MapR… unless you meant to say M/R (Map Reduce) 
>> 
>> The largest problem with sqoop is that in order to gain parallelism you need 
>> to know how your underlying table is partitioned and to do multiple range 
>> queries. This may not be known, or your data may or may not be equally 
>> distributed across the ranges.  
>> 
>> If you’re bringing over the entire table, you may find dropping it and then 
>> moving it to HDFS and then doing a bulk load to be more efficient.
>> (This is less flexible than sqoop, but also stresses the database servers 
>> less. ) 
>> 
>> Again, YMMV
>> 
>> 
>>> On Apr 8, 2016, at 9:17 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>> 
>>> Well unless you have plenty of memory, you are going to have certain issues 
>>> with Spark.
>>> 
>>> I tried to load a billion rows table from oracle through spark using JDBC 
>>> and ended up with "Caused by: java.lang.OutOfMemoryError: Java heap space" 
>>> error.
>>> 
>>> Sqoop uses MapR and does it in serial mode which takes time and you can 
>>> also tell it to create Hive table. However, it will import data into Hive 
>>> table.
>>> 
>>> In any case the mechanism of data import is through JDBC, Spark uses memory 
>>> and DAG, whereas Sqoop relies on MapR.
>>> 
>>> There is of course another alternative.
>>> 
>>> Assuming that your Oracle table has a primary Key say "ID" (it would be 
>>> easier if it was a monotonically increasing number) or already partitioned.
>>> 
>>> You can create views based on the range of ID or for each partition. You 
>>> can then SELECT COLUMNS  co1, col2, coln from view and spool it t

Re: Sqoop on Spark

2016-04-10 Thread Michael Segel
There is an additional layer that coordinates all of it.
> I know Oracle has a similar technology I've used it and had to supply the 
> JDBC driver.
> 
> Teradata Connector is for batch data copy, QueryGrid is for interactive data 
> movement.
> 
> On Wed, Apr 6, 2016 at 4:05 PM, Yong Zhang <java8...@hotmail.com 
> <mailto:java8...@hotmail.com>> wrote:
> If they do that, they must provide a customized input format, instead of 
> through JDBC.
> 
> Yong
> 
> Date: Wed, 6 Apr 2016 23:56:54 +0100
> Subject: Re: Sqoop on Spark
> From: mich.talebza...@gmail.com <mailto:mich.talebza...@gmail.com>
> To: mohaj...@gmail.com <mailto:mohaj...@gmail.com>
> CC: jornfra...@gmail.com <mailto:jornfra...@gmail.com>; 
> msegel_had...@hotmail.com <mailto:msegel_had...@hotmail.com>; 
> guha.a...@gmail.com <mailto:guha.a...@gmail.com>; linguin@gmail.com 
> <mailto:linguin@gmail.com>; user@spark.apache.org 
> <mailto:user@spark.apache.org>
> 
> 
> SAP Sybase IQ does that and I believe SAP Hana as well.
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 
> 
> On 6 April 2016 at 23:49, Peyman Mohajerian <mohaj...@gmail.com 
> <mailto:mohaj...@gmail.com>> wrote:
> For some MPP relational stores (not operational) it maybe feasible to run 
> Spark jobs and also have data locality. I know QueryGrid (Teradata) and 
> PolyBase (microsoft) use data locality to move data between their MPP and 
> Hadoop. 
> I would guess (have no idea) someone like IBM already is doing that for 
> Spark, maybe a bit off topic!
> 
> On Wed, Apr 6, 2016 at 3:29 PM, Jörn Franke <jornfra...@gmail.com 
> <mailto:jornfra...@gmail.com>> wrote:
> Well I am not sure, but using a database as a storage, such as relational 
> databases or certain nosql databases (eg MongoDB) for Spark is generally a 
> bad idea - no data locality, it cannot handle real big data volumes for 
> compute and you may potentially overload an operational database. 
> And if your job fails for whatever reason (eg scheduling ) then you have to 
> pull everything out again. Sqoop and HDFS seems to me the more elegant 
> solution together with spark. These "assumption" on parallelism have to be 
> anyway made with any solution.
> Of course you can always redo things, but why - what benefit do you expect? A 
> real big data platform has to support anyway many different tools otherwise 
> people doing analytics will be limited. 
> 
> On 06 Apr 2016, at 20:05, Michael Segel <msegel_had...@hotmail.com 
> <mailto:msegel_had...@hotmail.com>> wrote:
> 
> I don’t think its necessarily a bad idea.
> 
> Sqoop is an ugly tool and it requires you to make some assumptions as a way 
> to gain parallelism. (Not that most of the assumptions are not valid for most 
> of the use cases…) 
> 
> Depending on what you want to do… your data may not be persisted on HDFS.  
> There are use cases where your cluster is used for compute and not storage.
> 
> I’d say that spending time re-inventing the wheel can be a good thing. 
> It would be a good idea for many to rethink their ingestion process so that 
> they can have a nice ‘data lake’ and not a ‘data sewer’. (Stealing that term 
> from Dean Wampler. ;-) 
> 
> Just saying. ;-) 
> 
> -Mike
> 
> On Apr 5, 2016, at 10:44 PM, Jörn Franke <jornfra...@gmail.com 
> <mailto:jornfra...@gmail.com>> wrote:
> 
> I do not think you can be more resource efficient. In the end you have to 
> store the data anyway on HDFS . You have a lot of development effort for 
> doing something like sqoop. Especially with error handling. 
> You may create a ticket with the Sqoop guys to support Spark as an execution 
> engine and maybe it is less effort to plug it in there.
> Maybe if your cluster is loaded then you may want to add more machines or 
> improve the existing programs.
> 
> On 06 Apr 2016, at 07:33, ayan guha <guha.a...@gmail.com 
> <mailto:guha.a...@gmail.com>> wrote:
> 
> One of the reason in my mind is to avoid Map-Reduce application completely 
> during ingestion, if possible. Also, I can then use Spark stand alone cluster 
> to ingest, even if my hadoop cluster is heavily loaded. What you guys think?
> 
> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfra...@gmail.com 
> <mailto:jornfra...@gmail.com>> wrote:
> Why do you want to reimplement something which is already there?
> 
> On 06 Ap

Re: Spark SQL Json Parse

2016-03-03 Thread Michael Segel
Why do you want to write out NULL if the column has no data? 
Just insert the fields that you have. 


> On Mar 3, 2016, at 9:10 AM, barisak  wrote:
> 
> Hi,
> 
> I have a problem with Json Parser. I am using spark streaming with
> hiveContext for keeping json format tweets. The flume collects tweets and
> sink to hdfs path. My spark streaming job checks the hdfs path and convert
> coming json tweets and insert them to hive table.
> 
> My problem is that ;
> 
> Some of tweets have location name information or in reply to information
> etc. but some of tweets have not any information for that columns.
> 
> In Spark streaming job, I try to make generic json handler for example if
> the tweet has no location information spark can write null value to hive
> table for location column.
> 
> Is there any way to make this?
> 
> I am using Spark 1.3.0 version
> 
> Thanks.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Json-Parse-tp26391.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Unable to Limit UI to localhost interface

2016-03-30 Thread Michael Segel
It sounds like when you start up spark, its using 0.0.0.0 which means it will 
listen on all interfaces.  
You should be able to limit which interface to use.  

The weird thing is that if you are specifying the IP Address and Port, Spark 
shouldn’t be listening on all of the interfaces for that port. 
(This could be a UI bug? ) 

The other issue… you need to put a firewall in front of your cluster/machine. 
This is probably a best practice issue. 



> On Mar 30, 2016, at 12:25 AM, Akhil Das  wrote:
> 
> In your case, you will be able to see the webui (unless restricted with 
> iptables) but you won't be able to submit jobs to that machine from a remote 
> machine since the spark master is spark://127.0.0.1:7077 
> 
> 
> Thanks
> Best Regards
> 
> On Tue, Mar 29, 2016 at 8:12 PM, David O'Gwynn  > wrote:
> /etc/hosts
> 
> 127.0.0.1 localhost
> 
> conf/slaves 
> 127.0.0.1
> 
> 
> On Mon, Mar 28, 2016 at 5:36 PM, Mich Talebzadeh  > wrote:
> in your /etc/hosts what do you have for localhost
> 
> 127.0.0.1 localhost.localdomain localhost
> 
> conf/slave should have one entry in your case
> 
> cat slaves
> # A Spark Worker will be started on each of the machines listed below.
> localhost
> ...
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 28 March 2016 at 15:32, David O'Gwynn  > wrote:
> Greetings to all,
> 
> I've search around the mailing list, but it would seem that (nearly?) 
> everyone has the opposite problem as mine. I made a stab at looking in the 
> source for an answer, but I figured I might as well see if anyone else has 
> run into the same problem as I.
> 
> I'm trying to limit my Master/Worker UI to run only on localhost. As it 
> stands, I have the following two environment variables set in my spark-env.sh:
> 
> SPARK_LOCAL_IP=127.0.0.1
> SPARK_MASTER_IP=127.0.0.1
> 
> and my slaves file contains one line: 127.0.0.1
> 
> The problem is that when I run "start-all.sh", I can nmap my box's public 
> interface and get the following:
> 
> PORT STATE SERVICE
> 22/tcp   open  ssh
> 8080/tcp open  http-proxy
> 8081/tcp open  blackice-icecap
> 
> Furthermore, I can go to my box's public IP at port 8080 in my browser and 
> get the master node's UI. The UI even reports that the URL/REST URLs to be 
> 127.0.0.1 :
> 
> Spark Master at spark://127.0.0.1:7077 
> URL: spark://127.0.0.1:7077 
> REST URL: spark://127.0.0.1:6066  (cluster mode)
> 
> I'd rather not have spark available in any way to the outside world without 
> an explicit SSH tunnel.
> 
> There are variables to do with setting the Web UI port, but I'm not concerned 
> with the port, only the network interface to which the Web UI binds.
> 
> Any help would be greatly appreciated.
> 
> 
> 
> 



Fwd: Spark and N-tier architecture

2016-03-29 Thread Michael Segel


> Begin forwarded message:
> 
> From: Michael Segel <mse...@segel.com>
> Subject: Re: Spark and N-tier architecture
> Date: March 29, 2016 at 4:16:44 PM MST
> To: Alexander Pivovarov <apivova...@gmail.com>
> Cc: Mich Talebzadeh <mich.talebza...@gmail.com>, Ashok Kumar 
> <ashok34...@yahoo.com>, User <user@spark.apache.org>
> 
> So… 
> 
> Is spark-jobserver an official part of spark or something else? 
> 
> From what I can find via a quick Google … this isn’t part of the core spark 
> distribution.
> 
>> On Mar 29, 2016, at 3:50 PM, Alexander Pivovarov <apivova...@gmail.com 
>> <mailto:apivova...@gmail.com>> wrote:
>> 
>> https://github.com/spark-jobserver/spark-jobserver 
>> <https://github.com/spark-jobserver/spark-jobserver>



Re: Silly question...

2016-04-13 Thread Michael Segel
Mich

Are you building your own releases from the source? 
Which version of Scala? 

Again, the builds seem to be ok and working, but I don’t want to hit some 
‘gotcha’ if I could avoid it. 


> On Apr 13, 2016, at 7:15 AM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> Hi,
> 
> I am not sure this helps.
> 
> we use Spark 1.6 and Hive 2. I also use JDBC (beeline for Hive)  plus Oracle 
> and Sybase. They all work fine.
> 
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 12 April 2016 at 23:42, Michael Segel <msegel_had...@hotmail.com 
> <mailto:msegel_had...@hotmail.com>> wrote:
> Hi, 
> This is probably a silly question on my part… 
> 
> I’m looking at the latest (spark 1.6.1 release) and would like to do a build 
> w Hive and JDBC support. 
> 
> From the documentation, I see two things that make me scratch my head.
> 
> 1) Scala 2.11 
> "Spark does not yet support its JDBC component for Scala 2.11.”
> 
> So if we want to use JDBC, don’t use Scala 2.11.x (in this case its 2.11.8)
> 
> 2) Hive Support
> "To enable Hive integration for Spark SQL along with its JDBC server and CLI, 
> add the -Phive and Phive-thriftserver profiles to your existing build 
> options. By default Spark will build with Hive 0.13.1 bindings.”
> 
> So if we’re looking at a later release of Hive… lets say 1.1.x … still use 
> the -Phive and Phive-thriftserver . Is there anything else we should 
> consider? 
> 
> Just asking because I’ve noticed that this part of the documentation hasn’t 
> changed much over the past releases. 
> 
> Thanks in Advance, 
> 
> -Mike
> 
> 



Re: Spark support for Complex Event Processing (CEP)

2016-04-28 Thread Michael Segel
I don’t.

I believe that there have been a  couple of hack-a-thons like one done in 
Chicago a few years back using public transportation data.

The first question is what sort of data do you get from the city? 

I mean it could be as simple as time_stamp, bus_id, route and GPS (x,y).   Or 
they could provide more information. Like last stop, distance to next stop, avg 
current velocity… 

Then there is the frequency of the updates. Every second? Every 3 seconds? 5 or 
6 seconds…

This will determine how much work you have to do. 

Maybe they provide the routes of the busses via a different API call since its 
relatively static.

This will drive your solution more than the underlying technology. 

Oh and whileI focused on bus, there are also rail and other modes of public 
transportation like light rail, trains, etc … 

HTH

-Mike


> On Apr 28, 2016, at 4:10 AM, Esa Heikkinen <esa.heikki...@student.tut.fi> 
> wrote:
> 
> 
> Do you know any good examples how to use Spark streaming in tracking public 
> transportation systems ?
> 
> Or Storm or some other tool example ?
> 
> Regards
> Esa Heikkinen
> 
> 28.4.2016, 3:16, Michael Segel kirjoitti:
>> Uhm… 
>> I think you need to clarify a couple of things…
>> 
>> First there is this thing called analog signal processing…. Is that 
>> continuous enough for you? 
>> 
>> But more to the point, Spark Streaming does micro batching so if you’re 
>> processing a continuous stream of tick data, you will have more than 50K of 
>> tics per second while there are markets open and trading.  Even at 50K a 
>> second, that would mean 1 every .02 ms or 50 ticks a ms. 
>> 
>> And you don’t want to wait until you have a batch to start processing, but 
>> you want to process when the data hits the queue and pull it from the queue 
>> as quickly as possible. 
>> 
>> Spark streaming will be able to pull batches in as little as 500ms. So if 
>> you pull a batch at t0 and immediately have a tick in your queue, you won’t 
>> process that data until t0+500ms. And said batch would contain 25,000 
>> entries. 
>> 
>> Depending on what you are doing… that 500ms delay can be enough to be fatal 
>> to your trading process. 
>> 
>> If you don’t like stock data, there are other examples mainly when pulling 
>> data from real time embedded systems. 
>> 
>> 
>> If you go back and read what I said, if your data flow is >> (much slower) 
>> than 500ms, and / or the time to process is >> 500ms ( much longer )  you 
>> could use spark streaming.  If not… and there are applications which require 
>> that type of speed…  then you shouldn’t use spark streaming. 
>> 
>> If you do have that constraint, then you can look at systems like 
>> storm/flink/samza / whatever where you have a continuous queue and listener 
>> and no micro batch delays.
>> Then for each bolt (storm) you can have a spark context for processing the 
>> data. (Depending on what sort of processing you want to do.) 
>> 
>> To put this in perspective… if you’re using spark streaming / akka / storm 
>> /etc to handle real time requests from the web, 500ms added delay can be a 
>> long time. 
>> 
>> Choose the right tool. 
>> 
>> For the OP’s problem. Sure Tracking public transportation could be done 
>> using spark streaming. It could also be done using half a dozen other tools 
>> because the rate of data generation is much slower than 500ms. 
>> 
>> HTH
>> 
>> 
>>> On Apr 27, 2016, at 4:34 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>> 
>>> couple of things.
>>> 
>>> There is no such thing as Continuous Data Streaming as there is no such 
>>> thing as Continuous Availability.
>>> 
>>> There is such thing as Discrete Data Streaming and  High Availability  but 
>>> they reduce the finite unavailability to minimum. In terms of business 
>>> needs a 5 SIGMA is good enough and acceptable. Even the candles set to a 
>>> predefined time interval say 2, 4, 15 seconds overlap. No FX savvy trader 
>>> makes a sell or buy decision on the basis of 2 seconds candlestick
>>> 
>>> The calculation itself in measurements is subject to finite error as 
>>> defined by their Confidence Level (CL) using Standard Deviation function.
>>> 
>>> OK so far I have never noticed a tool that requires that details of 
>>> granularity. Those stuff from Flink etc is in practical term is of little 
>>> value and does not make commercial sense.
>>> 
>

Re: Spark support for Complex Event Processing (CEP)

2016-04-29 Thread Michael Segel
If you’re getting the logs, then it really isn’t CEP unless you consider the 
event to be the log from the bus. 
This doesn’t sound like there is a time constraint. 

Your bus schedule is fairly fixed and changes infrequently. 
Your bus stops are relatively fixed points. (Within a couple of meters) 

So then you’re taking bus A who is scheduled to drive route 123 and you want to 
compare their nearest location to the bus stop at time T and see how close it 
is to the scheduled route. 


Or am I missing something? 

-Mike

> On Apr 29, 2016, at 3:54 AM, Esa Heikkinen <esa.heikki...@student.tut.fi> 
> wrote:
> 
> 
> Hi
> 
> I try to explain my case ..
> 
> Situation is not so simple in my logs and solution. There also many types of 
> logs and there are from many sources.
> They are as csv-format and header line includes names of the columns.
> 
> This is simplified description of input logs.
> 
> LOG A's: bus coordinate logs (every bus has own log):
> - timestamp
> - bus number
> - coordinates
> 
> LOG B: bus login/logout (to/from line) message log:
> - timestamp
> - bus number
> - line number
> 
> LOG C:  log from central computers:
> - timestamp
> - bus number
> - bus stop number
> - estimated arrival time to bus stop
> 
> LOG A are updated every 30 seconds (i have also another system by 1 seconds 
> interval). LOG B are updated when bus starts from terminal bus stop and 
> arrives to final bus stop in a line. LOG C is updated when central computer 
> sends new arrival time estimation to bus stop.
> 
> I also need metadata for logs (and analyzer). For example coordinates for bus 
> stop areas.
> 
> Main purpose of analyzing is to check an accuracy (error) of the estimated 
> arrival time to bus stops.
> 
> Because there are many buses and lines, it is too time-comsuming to check all 
> of them. So i check only specific lines with specific bus stops. There are 
> many buses (logged to lines) coming to one bus stop and i am interested about 
> only certain bus.
> 
> To do that, i have to read log partly not in time order (upstream) by 
> sequence:
> 1. From LOG C is searched bus number
> 2. From LOG A is searched when the bus has leaved from terminal bus stop
> 3. From LOG B is searched when bus has sent a login to the line
> 4. From LOG A is searched when the bus has entered to bus stop
> 5. From LOG C is searched a last estimated arrival time to the bus stop and 
> calculates error between real and estimated value
> 
> In my understanding (almost) all log file analyzers reads all data (lines) in 
> time order from log files. My need is only for specific part of log (lines). 
> To achieve that, my solution is to read logs in an arbitrary order (with 
> given time window).
> 
> I know this solution is not suitable for all cases (for example for very fast 
> analyzing and very big data). This solution is suitable for very complex 
> (targeted) analyzing. It can be too slow and memory-consuming, but well done 
> pre-processing of log data can help a lot.
> 
> ---
> Esa Heikkinen
> 
> 28.4.2016, 14:44, Michael Segel kirjoitti:
>> I don’t.
>> 
>> I believe that there have been a  couple of hack-a-thons like one done in 
>> Chicago a few years back using public transportation data.
>> 
>> The first question is what sort of data do you get from the city? 
>> 
>> I mean it could be as simple as time_stamp, bus_id, route and GPS (x,y).   
>> Or they could provide more information. Like last stop, distance to next 
>> stop, avg current velocity… 
>> 
>> Then there is the frequency of the updates. Every second? Every 3 seconds? 5 
>> or 6 seconds…
>> 
>> This will determine how much work you have to do. 
>> 
>> Maybe they provide the routes of the busses via a different API call since 
>> its relatively static.
>> 
>> This will drive your solution more than the underlying technology. 
>> 
>> Oh and whileI focused on bus, there are also rail and other modes of public 
>> transportation like light rail, trains, etc … 
>> 
>> HTH
>> 
>> -Mike
>> 
>> 
>>> On Apr 28, 2016, at 4:10 AM, Esa Heikkinen <esa.heikki...@student.tut.fi 
>>> <mailto:esa.heikki...@student.tut.fi>> wrote:
>>> 
>>> 
>>> Do you know any good examples how to use Spark streaming in tracking public 
>>> transportation systems ?
>>> 
>>> Or Storm or some other tool example ?
>>> 
>>> Regards
>>> Esa Heikkinen
>>> 
>>> 28.4.2016, 3:16, Michael Segel kirjoitti:
>>>> Uhm… 
>>>> I think you need to clarify a couple of thing

Fwd: Spark support for Complex Event Processing (CEP)

2016-04-27 Thread Michael Segel

Doh! 
Wrong email account again! 

> Begin forwarded message:
> 
> From: Michael Segel <michael_se...@hotmail.com>
> Subject: Re: Spark support for Complex Event Processing (CEP)
> Date: April 27, 2016 at 7:16:55 PM CDT
> To: Mich Talebzadeh <mich.talebza...@gmail.com>
> Cc: Esa Heikkinen <esa.heikki...@student.tut.fi>, "user@spark" 
> <user@spark.apache.org>
> 
> Uhm… 
> I think you need to clarify a couple of things…
> 
> First there is this thing called analog signal processing…. Is that 
> continuous enough for you? 
> 
> But more to the point, Spark Streaming does micro batching so if you’re 
> processing a continuous stream of tick data, you will have more than 50K of 
> tics per second while there are markets open and trading.  Even at 50K a 
> second, that would mean 1 every .02 ms or 50 ticks a ms. 
> 
> And you don’t want to wait until you have a batch to start processing, but 
> you want to process when the data hits the queue and pull it from the queue 
> as quickly as possible. 
> 
> Spark streaming will be able to pull batches in as little as 500ms. So if you 
> pull a batch at t0 and immediately have a tick in your queue, you won’t 
> process that data until t0+500ms. And said batch would contain 25,000 
> entries. 
> 
> Depending on what you are doing… that 500ms delay can be enough to be fatal 
> to your trading process. 
> 
> If you don’t like stock data, there are other examples mainly when pulling 
> data from real time embedded systems. 
> 
> 
> If you go back and read what I said, if your data flow is >> (much slower) 
> than 500ms, and / or the time to process is >> 500ms ( much longer )  you 
> could use spark streaming.  If not… and there are applications which require 
> that type of speed…  then you shouldn’t use spark streaming. 
> 
> If you do have that constraint, then you can look at systems like 
> storm/flink/samza / whatever where you have a continuous queue and listener 
> and no micro batch delays.
> Then for each bolt (storm) you can have a spark context for processing the 
> data. (Depending on what sort of processing you want to do.) 
> 
> To put this in perspective… if you’re using spark streaming / akka / storm 
> /etc to handle real time requests from the web, 500ms added delay can be a 
> long time. 
> 
> Choose the right tool. 
> 
> For the OP’s problem. Sure Tracking public transportation could be done using 
> spark streaming. It could also be done using half a dozen other tools because 
> the rate of data generation is much slower than 500ms. 
> 
> HTH
> 
> 
>> On Apr 27, 2016, at 4:34 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> couple of things.
>> 
>> There is no such thing as Continuous Data Streaming as there is no such 
>> thing as Continuous Availability.
>> 
>> There is such thing as Discrete Data Streaming and  High Availability  but 
>> they reduce the finite unavailability to minimum. In terms of business needs 
>> a 5 SIGMA is good enough and acceptable. Even the candles set to a 
>> predefined time interval say 2, 4, 15 seconds overlap. No FX savvy trader 
>> makes a sell or buy decision on the basis of 2 seconds candlestick
>> 
>> The calculation itself in measurements is subject to finite error as defined 
>> by their Confidence Level (CL) using Standard Deviation function.
>> 
>> OK so far I have never noticed a tool that requires that details of 
>> granularity. Those stuff from Flink etc is in practical term is of little 
>> value and does not make commercial sense.
>> 
>> Now with regard to your needs, Spark micro batching is perfectly adequate.
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 27 April 2016 at 22:10, Esa Heikkinen <esa.heikki...@student.tut.fi 
>> <mailto:esa.heikki...@student.tut.fi>> wrote:
>> 
>> Hi
>> 
>> Thanks for the answer.
>> 
>> I have developed a log file analyzer for RTPIS (Real Time Passenger 
>> Information System) system, where buses drive lines and the system try to 
>> estimate the arrival times to the bus stops. There are many different log 
>> files (and events) and analyzing situation can be very complex. Also spatial 
>> data can be incl

Fwd: Spark support for Complex Event Processing (CEP)

2016-04-27 Thread Michael Segel
Sorry sent from wrong email address. 

> Begin forwarded message:
> 
> From: Michael Segel <michael_se...@hotmail.com>
> Subject: Re: Spark support for Complex Event Processing (CEP)
> Date: April 27, 2016 at 7:51:14 AM CDT
> To: Mich Talebzadeh <mich.talebza...@gmail.com>
> Cc: Esa Heikkinen <esa.heikki...@student.tut.fi>, "user @spark" 
> <user@spark.apache.org>
> 
> Spark and CEP? It depends… 
> 
> Ok, I know that’s not the answer you want to hear, but its a bit more 
> complicated… 
> 
> If you consider Spark Streaming, you have some issues. 
> Spark Streaming isn’t a Real Time solution because it is a micro batch 
> solution. The smallest Window is 500ms.  This means that if your compute time 
> is >> 500ms and/or  your event flow is >> 500ms this could work.
> (e.g. 'real time' image processing on a system that is capturing 60FPS 
> because the processing time is >> 500ms. ) 
> 
> So Spark Streaming wouldn’t be the best solution…. 
> 
> However, you can combine spark with other technologies like Storm, Akka, etc 
> .. where you have continuous streaming. 
> So you could instantiate a spark context per worker in storm… 
> 
> I think if there are no class collisions between Akka and Spark, you could 
> use Akka, which may have a better potential for communication between 
> workers. 
> So here you can handle CEP events. 
> 
> HTH
> 
>> On Apr 27, 2016, at 7:03 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> please see my other reply
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 27 April 2016 at 10:40, Esa Heikkinen <esa.heikki...@student.tut.fi 
>> <mailto:esa.heikki...@student.tut.fi>> wrote:
>> Hi
>> 
>> I have followed with interest the discussion about CEP and Spark. It is 
>> quite close to my research, which is a complex analyzing for log files and 
>> "history" data  (not actually for real time streams).
>> 
>> I have few questions:
>> 
>> 1) Is CEP only for (real time) stream data and not for "history" data?
>> 
>> 2) Is it possible to search "backward" (upstream) by CEP with given time 
>> window? If a start time of the time window is earlier than the current 
>> stream time.
>> 
>> 3) Do you know any good tools or softwares for "CEP's" using for log data ?
>> 
>> 4) Do you know any good (scientific) papers i should read about CEP ?
>> 
>> 
>> Regards
>> PhD student at Tampere University of Technology, Finland, www.tut.fi 
>> <http://www.tut.fi/>
>> Esa Heikkinen
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> <mailto:user-h...@spark.apache.org>
>> 
>> 
> 
> The opinions expressed here are mine, while they may reflect a cognitive 
> thought, that is purely accidental. 
> Use at your own risk. 
> Michael Segel
> michael_segel (AT) hotmail.com <http://hotmail.com/>
> 
> 
> 
> 
> 




Re: Spark support for Complex Event Processing (CEP)

2016-04-28 Thread Michael Segel
Look, you said that you didn’t have continuous data, and you do have continuous 
data. I just used an analog signal which can be converted. So that you end up 
with contiguous digital sampling.  

The point is that you have to consider that micro batches are still batched and 
you’re adding latency. 
Even at 500ms, if you’re dealing with a high velocity low latency stream, that 
delay can kill you. 

Time is relative.  Which is why Spark Streaming isn’t good enough for *all* 
streaming. It wasn’t designed to be a contiguous stream. 
And by contiguous I mean that at any given time, there will be an inbound 
message in the queue. 


> On Apr 28, 2016, at 3:57 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> Also the point about
> 
> "First there is this thing called analog signal processing…. Is that 
> continuous enough for you? "
> 
> I agree  that analog signal processing like a sine wave,  an AM radio signal 
> – is truly continuous. However,  here we are talking about digital data which 
> will always be sent as bytes and typically with bytes grouped into messages . 
> In other words when we are sending data it is never truly continuous.  We are 
> sending discrete messages.
> 
> HTH,
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 28 April 2016 at 17:22, Mich Talebzadeh <mich.talebza...@gmail.com 
> <mailto:mich.talebza...@gmail.com>> wrote:
> In a commerical (C)EP like say StreamBase, or for example its competitor 
> Apama, the arrival of an input event **immediately** triggers further 
> downstream processing.
> 
> This is admitadly an asynchronous approach, not a synchronous clock-driven 
> micro-batch approach like Spark's.
> 
> I suppose if one wants to split hairs / be philosophical, the clock rate of 
> the microprocessor chip underlies everything.  But I don't think that is 
> quite the point.
> 
> The point is that an asychonrous event-driven approach is as continuous / 
> immediate as **the underlying computer hardware will ever allow.**. It is not 
> limited by an architectural software clock.
> 
> So it is asynchronous vs synchronous that is the key issue, not just the 
> exact speed of the software clock in the synchronous approach.
> 
> It isalso indeed true that latencies down to the single digit microseconds 
> level can sometimes matter in financial trading but rarely.
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 28 April 2016 at 12:44, Michael Segel <msegel_had...@hotmail.com 
> <mailto:msegel_had...@hotmail.com>> wrote:
> I don’t.
> 
> I believe that there have been a  couple of hack-a-thons like one done in 
> Chicago a few years back using public transportation data.
> 
> The first question is what sort of data do you get from the city? 
> 
> I mean it could be as simple as time_stamp, bus_id, route and GPS (x,y).   Or 
> they could provide more information. Like last stop, distance to next stop, 
> avg current velocity… 
> 
> Then there is the frequency of the updates. Every second? Every 3 seconds? 5 
> or 6 seconds…
> 
> This will determine how much work you have to do. 
> 
> Maybe they provide the routes of the busses via a different API call since 
> its relatively static.
> 
> This will drive your solution more than the underlying technology. 
> 
> Oh and whileI focused on bus, there are also rail and other modes of public 
> transportation like light rail, trains, etc … 
> 
> HTH
> 
> -Mike
> 
> 
>> On Apr 28, 2016, at 4:10 AM, Esa Heikkinen <esa.heikki...@student.tut.fi 
>> <mailto:esa.heikki...@student.tut.fi>> wrote:
>> 
>> 
>> Do you know any good examples how to use Spark streaming in tracking public 
>> transportation systems ?
>> 
>> Or Storm or some other tool example ?
>> 
>> Regards
>> Esa Heikkinen
>> 
>> 28.4.2016, 3:16, Michael Segel kirjoitti:
>>> Uhm… 
>>> I think you need to clarify a couple of things…
>>> 
>>> First there is this thing called analog signal processing…. Is that 
>>> continuous enough for you? 
>>> 
>>> But 

Re: Silly Question on my part...

2016-05-17 Thread Michael Segel
Thanks for the response. 

That’s what I thought, but I didn’t want to assume anything. 
(You know what happens when you ass u me … :-) 


Not sure about Tachyon though.  Its a thought, but I’m very conservative when 
it comes to design choices. 


> On May 16, 2016, at 5:21 PM, John Trengrove <john.trengr...@servian.com.au> 
> wrote:
> 
> If you are wanting to share RDDs it might be a good idea to check out Tachyon 
> / Alluxio.
> 
> For the Thrift server, I believe the datasets are located in your Spark 
> cluster as RDDs and you just communicate with it via the Thrift JDBC 
> Distributed Query Engine connector.
> 
> 2016-05-17 5:12 GMT+10:00 Michael Segel <msegel_had...@hotmail.com 
> <mailto:msegel_had...@hotmail.com>>:
> For one use case.. we were considering using the thrift server as a way to 
> allow multiple clients access shared RDDs.
> 
> Within the Thrift Context, we create an RDD and expose it as a hive table.
> 
> The question  is… where does the RDD exist. On the Thrift service node 
> itself, or is that just a reference to the RDD which is contained with 
> contexts on the cluster?
> 
> 
> Thx
> 
> -Mike
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 
> 



Silly Question on my part...

2016-05-16 Thread Michael Segel
For one use case.. we were considering using the thrift server as a way to 
allow multiple clients access shared RDDs. 

Within the Thrift Context, we create an RDD and expose it as a hive table. 

The question  is… where does the RDD exist. On the Thrift service node itself, 
or is that just a reference to the RDD which is contained with contexts on the 
cluster? 


Thx

-Mike


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SPARK - DataFrame for BulkLoad

2016-05-18 Thread Michael Segel
Yes, but he’s using phoenix which may not work cleanly with your HBase spark 
module. 
They key issue here may be Phoenix which is separate from HBase. 


> On May 18, 2016, at 5:36 AM, Ted Yu  wrote:
> 
> Please see HBASE-14150
> 
> The hbase-spark module would be available in the upcoming hbase 2.0 release.
> 
> On Tue, May 17, 2016 at 11:48 PM, Takeshi Yamamuro  > wrote:
> Hi,
> 
> Have you checked this?
> http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3ccacyzca3askwd-tujhqi1805bn7sctguaoruhd5xtxcsul1a...@mail.gmail.com%3E
>  
> 
> 
> // maropu
> 
> On Wed, May 18, 2016 at 1:14 PM, Mohanraj Ragupathiraj  > wrote:
> I have 100 million records to be inserted to a HBase table (PHOENIX) as a 
> result of a Spark Job. I would like to know if i convert it to a Dataframe 
> and save it, will it do Bulk load (or) it is not the efficient way to write 
> data to a HBase table
> 
> -- 
> Thanks and Regards
> Mohan
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro
> 



Re: removing header from csv file

2016-05-03 Thread Michael Segel
Hi, 
Another silly question… 

Don’t you want to use the header line to help create a schema for the RDD? 

Thx

-Mike

> On May 3, 2016, at 8:09 AM, Mathieu Longtin  wrote:
> 
> This only works if the files are "unsplittable". For example gzip files, each 
> partition is one file (if you have more partitions than files), so the first 
> line of each partition is the header.
> 
> Spark-csv extensions reads the very first line of the RDD, assumes it's the 
> header, and then filters every occurrence of that line. Something like this 
> (python code here, but Scala should be very similar)
> 
> header = data.first()
> data = data.filter(lambda line: line != header)
> 
> Since I had lots of small CSV files, and not all of them have the same exact 
> header, I use the following:
> 
> file_list = sc.parallelize(list_of_csv)
> data = 
> file_list.flatMap(function_that_reads_csvs_and_extracts_the_colums_I_want)
> 
> 
> 
> 
> On Tue, May 3, 2016 at 3:23 AM Abhishek Anand  > wrote:
> You can use this function to remove the header from your dataset(applicable 
> to RDD)
> 
> def dropHeader(data: RDD[String]): RDD[String] = {
> data.mapPartitionsWithIndex((idx, lines) => {
>   if (idx == 0) {
> lines.drop(1)
>   }
>   lines
> })
> }
> 
> 
> Abhi 
> 
> On Wed, Apr 27, 2016 at 12:55 PM, Marco Mistroni  > wrote:
> If u r using Scala api you can do
> Myrdd.zipwithindex.filter(_._2 >0).map(_._1)
> 
> Maybe a little bit complicated but will do the trick
> As per spark CSV, you will get back a data frame which you can reconduct to 
> rdd. .
> Hth
> Marco
> 
> On 27 Apr 2016 6:59 am, "nihed mbarek"  > wrote:
> You can add a filter with string that you are sure available only in the 
> header 
> 
> Le mercredi 27 avril 2016, Divya Gehlot  > a écrit :
> yes you can remove the headers by removing the first row 
> 
> can first() or head() to do that 
> 
> 
> Thanks,
> Divya 
> 
> On 27 April 2016 at 13:24, Ashutosh Kumar > wrote:
> I see there is a library spark-csv which can be used for removing header and 
> processing of csv files. But it seems it works with sqlcontext only. Is there 
> a way to remove header from csv files without sqlcontext ? 
> 
> Thanks
> Ashutosh
> 
> 
> 
> -- 
> 
> M'BAREK Med Nihed,
> Fedora Ambassador, TUNISIA, Northern Africa
> http://www.nihed.com 
> 
>  
> 
> 
> -- 
> Mathieu Longtin
> 1-514-803-8977



Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Michael Segel
Silly question? 

If you change the predicate to 
( s.date >= ‘2016-01-03’ OR s.date IS NULL ) 
AND 
(d.date >= ‘2016-01-03’ OR d.date IS NULL) 

What do you get? 

Sorry if the syntax isn’t 100% correct. The idea is to not drop null values 
from the query. 
I would imagine that this shouldn’t kill performance since its most likely a 
post join filter on the result set? 
(Or is that just a Hive thing?) 

-Mike

> On May 3, 2016, at 12:42 PM, Davies Liu  wrote:
> 
> Bingo, the two predicate s.date >= '2016-01-03' AND d.date >=
> '2016-01-03' is the root cause,
> which will filter out all the nulls from outer join, will have same
> result as inner join.
> 
> In Spark 2.0, we turn these join into inner join actually.
> 
> On Tue, May 3, 2016 at 9:50 AM, Cesar Flores  wrote:
>> Hi
>> 
>> Have you tried the joins without the where clause? When you use them you are
>> filtering all the rows with null columns in those fields. In other words you
>> are doing a inner join in all your queries.
>> 
>> On Tue, May 3, 2016 at 11:37 AM, Gourav Sengupta 
>> wrote:
>>> 
>>> Hi Kevin,
>>> 
>>> Having given it a first look I do think that you have hit something here
>>> and this does not look quite fine. I have to work on the multiple AND
>>> conditions in ON and see whether that is causing any issues.
>>> 
>>> Regards,
>>> Gourav Sengupta
>>> 
>>> On Tue, May 3, 2016 at 8:28 AM, Kevin Peng  wrote:
 
 Davies,
 
 Here is the code that I am typing into the spark-shell along with the
 results (my question is at the bottom):
 
 val dps =
 sqlContext.read.format("com.databricks.spark.csv").option("header",
 "true").load("file:///home/ltu/dps_csv/")
 val swig =
 sqlContext.read.format("com.databricks.spark.csv").option("header",
 "true").load("file:///home/ltu/swig_csv/")
 
 dps.count
 res0: Long = 42694
 
 swig.count
 res1: Long = 42034
 
 
 dps.registerTempTable("dps_pin_promo_lt")
 swig.registerTempTable("swig_pin_promo_lt")
 
 sqlContext.sql("select * from dps_pin_promo_lt where date >
 '2016-01-03'").count
 res4: Long = 42666
 
 sqlContext.sql("select * from swig_pin_promo_lt where date >
 '2016-01-03'").count
 res5: Long = 34131
 
 sqlContext.sql("select distinct date, account, ad from dps_pin_promo_lt
 where date > '2016-01-03'").count
 res6: Long = 42533
 
 sqlContext.sql("select distinct date, account, ad from swig_pin_promo_lt
 where date > '2016-01-03'").count
 res7: Long = 34131
 
 
 sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
 AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
 d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
 dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad 
 =
 d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
 res9: Long = 23809
 
 
 sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
 AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
 d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
 dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad 
 =
 d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
 res10: Long = 23809
 
 
 sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
 AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
 d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
 dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad 
 =
 d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
 res11: Long = 23809
 
 
 sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
 AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
 d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
 dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad 
 =
 d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
 res12: Long = 23809
 
 
 
 From my results above, we notice that the counts of distinct values based
 on the join criteria and filter criteria for each individual table is
 located at res6 and res7.  My question is why is the outer join producing
 less rows than the smallest table; if there are no matches it should still
 bring in that row as part of the outer join.  For the full and right outer
 join I am expecting to see a minimum of res6 rows, but I get less, is there
 something specific that I am missing here?  I am expecting that the full
 outer join would give me the union of the two table sets so I am expecting

Re: inter spark application communication

2016-04-18 Thread Michael Segel
have you thought about Akka? 

What are you trying to send? Why do you want them to talk to one another? 

> On Apr 18, 2016, at 12:04 PM, Soumitra Johri  
> wrote:
> 
> Hi,
> 
> I have two applications : App1 and App2. 
> On a single cluster I have to spawn 5 instances os App1 and 1 instance of 
> App2.
> 
> What would be the best way to send data from the 5 App1 instances to the 
> single App2 instance ?
> 
> Right now I am using Kafka to send data from one spark application to the 
> spark application  but the setup doesn't seem right and I hope there is a 
> better way to do this.
> 
> Warm Regards
> Soumitra


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: inter spark application communication

2016-04-18 Thread Michael Segel
Hi, 

Putting this back in User email list… 

Ok, 

So its not inter-job communication, but taking the output of Job 1 and using it 
as input to job 2. 
(Chaining of jobs.) 

So you don’t need anything fancy… 


> On Apr 18, 2016, at 1:47 PM, Soumitra Johri <soumitra.siddha...@gmail.com> 
> wrote:
> 
> Yes , I want to chain spark applications. 
> On Mon, Apr 18, 2016 at 4:46 PM Michael Segel <msegel_had...@hotmail.com 
> <mailto:msegel_had...@hotmail.com>> wrote:
> Yes, but I’m confused. Are you chaining your spark jobs? So you run job one 
> and its output is the input to job 2? 
> 
>> On Apr 18, 2016, at 1:21 PM, Soumitra Johri <soumitra.siddha...@gmail.com 
>> <mailto:soumitra.siddha...@gmail.com>> wrote:
>> 
>> By Akka you mean the Akka Actors ?
>> 
>> I am sending basically sending counts which are further aggregated on the 
>> second  application.
>> 
>> 
>> Warm Regards
>> Soumitra
>> 
>> On Mon, Apr 18, 2016 at 3:54 PM, Michael Segel <msegel_had...@hotmail.com 
>> <mailto:msegel_had...@hotmail.com>> wrote:
>> have you thought about Akka?
>> 
>> What are you trying to send? Why do you want them to talk to one another?
>> 
>> > On Apr 18, 2016, at 12:04 PM, Soumitra Johri <soumitra.siddha...@gmail.com 
>> > <mailto:soumitra.siddha...@gmail.com>> wrote:
>> >
>> > Hi,
>> >
>> > I have two applications : App1 and App2.
>> > On a single cluster I have to spawn 5 instances os App1 and 1 instance of 
>> > App2.
>> >
>> > What would be the best way to send data from the 5 App1 instances to the 
>> > single App2 instance ?
>> >
>> > Right now I am using Kafka to send data from one spark application to the 
>> > spark application  but the setup doesn't seem right and I hope there is a 
>> > better way to do this.
>> >
>> > Warm Regards
>> > Soumitra
>> 
>> 
> 



Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Michael Segel
Perhaps this is a silly question on my part…. 

Why do you want to start up HDFS on a single node?

You only mention one windows machine in your description of your cluster. 
If this is a learning  experience, why not run Hadoop in a VM (MapR and I think 
the other vendors make linux images that can run in a VM) 

HTH

-Mike

> On Apr 18, 2016, at 10:27 AM, Jörn Franke  wrote:
> 
> I think the easiest would be to use a Hadoop Windows distribution, such as 
> Hortonworks. However, the Linux version of Hortonworks is a little bit more 
> advanced.
> 
> On 18 Apr 2016, at 14:13, My List  > wrote:
> 
>> Deepak,
>> 
>> The following could be a very dumb questions so pardon me for the same.
>> 1) When I download the binary for Spark with a version of Hadoop(Hadoop 2.6) 
>> does it not come in the zip or tar file?
>> 2) If it does not come along,Is there a Apache Hadoop for windows, is it in 
>> binary format or will have to build it?
>> 3) Is there a basic tutorial for Hadoop on windows for the basic needs of 
>> Spark.
>> 
>> Thanks in Advance !
>> 
>> On Mon, Apr 18, 2016 at 5:35 PM, Deepak Sharma > > wrote:
>> Once you download hadoop and format the namenode , you can use start-dfs.sh 
>> to start hdfs.
>> Then use 'jps' to sss if datanode/namenode services are up and running.
>> 
>> Thanks
>> Deepak
>> 
>> On Mon, Apr 18, 2016 at 5:18 PM, My List > > wrote:
>> Hi ,
>> 
>> I am a newbie on Spark.I wanted to know how to start and verify if HDFS has 
>> started on Spark stand alone.
>> 
>> Env - 
>> Windows 7 - 64 bit
>> Spark 1.4.1 With Hadoop 2.6
>> 
>> Using Scala Shell - spark-shell
>> 
>> 
>> -- 
>> Thanks,
>> Harry
>> 
>> 
>> 
>> -- 
>> Thanks
>> Deepak
>> www.bigdatabig.com 
>> www.keosha.net 
>> 
>> 
>> -- 
>> Thanks,
>> Harmeet



Re: Spark SQL Transaction

2016-04-21 Thread Michael Segel
Hi, 

Sometimes terms get muddled over time.

If you’re not using transactions, then each database statement is atomic and is 
itself a transaction. 
So unless you have some explicit ‘Begin Work’ at the start…. your statements 
should be atomic and there will be no ‘redo’ or ‘commit’ or ‘rollback’. 

I don’t see anything in Spark’s documentation about transactions, so the 
statements should be atomic.  (I’m not a guru here so I could be missing 
something in Spark) 

If you’re seeing the connection drop unexpectedly and then a rollback, could 
this be a setting or configuration of the database? 


> On Apr 19, 2016, at 1:18 PM, Andrés Ivaldi  wrote:
> 
> Hello, is possible to execute a SQL write without Transaction? we dont need 
> transactions to save our data and this adds an overhead to the SQLServer.
> 
> Regards.
> 
> -- 
> Ing. Ivaldi Andres


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-21 Thread Michael Segel
How many partitions in your data set. 

Per the Spark DataFrameWritetr Java Doc:
“
Saves the content of the DataFrame 

 to a external database table via JDBC. In the case the table already exists in 
the external database, behavior of this function depends on the save mode, 
specified by the mode function (default to throwing an exception).
Don't create too many partitions in parallel on a large cluster; otherwise 
Spark might crash your external database systems.  

“

This implies one connection per partition writing in parallel. So you could be 
swamping your database. 
Which database are you using? 

Also, how many hops? 
Network latency could also impact performance too… 

> On Apr 19, 2016, at 3:14 PM, Jonathan Gray  wrote:
> 
> Hi,
> 
> I'm trying to write ~60 million rows from a DataFrame to a database using 
> JDBC using Spark 1.6.1, something similar to df.write().jdbc(...)
> 
> The write seems to not be performing well.  Profiling the application with a 
> master of local[*] it appears there is not much socket write activity and 
> also not much CPU.
> 
> I would expect there to be an almost continuous block of socket write 
> activity showing up somewhere in the profile.
> 
> I can see that the top hot method involves 
> apache.spark.unsafe.platform.CopyMemory all from calls within 
> JdbcUtils.savePartition(...).  However, the CPU doesn't seem particularly 
> stressed so I'm guessing this isn't the cause of the problem.
> 
> Is there any best practices or has anyone come across a case like this before 
> where a write to a database seems to perform poorly?
> 
> Thanks,
> Jon



Re: How to run Zeppelin and Spark Thrift Server Together

2016-07-13 Thread Michael Segel
I believe that there is one JVM for the Thrift Service and that there is only 
one context for the service. 

This would allow you to share RDDs across multiple jobs, however… not so great 
for security.

HTH… 


> On Jul 10, 2016, at 10:05 PM, Takeshi Yamamuro  > wrote:
> 
> Hi,
> 
> ISTM multiple sparkcontexts are not recommended in spark.
> See: https://issues.apache.org/jira/browse/SPARK-2243 
> 
> 
> // maropu
> 
> 
> On Mon, Jul 11, 2016 at 12:01 PM, ayan guha  > wrote:
> Hi
> 
> Can you try using JDBC interpreter with STS? We are using Zeppelin+STS on 
> YARN for few months now without much issue. 
> 
> On Mon, Jul 11, 2016 at 12:48 PM, Chanh Le  > wrote:
> Hi everybody,
> We are using Spark to query big data and currently we’re using Zeppelin to 
> provide a UI for technical users.
> Now we also need to provide a UI for business users so we use Oracle BI tools 
> and set up a Spark Thrift Server (STS) for it.
> 
> When I run both Zeppelin and STS throw error:
> 
> INFO [2016-07-11 09:40:21,905] ({pool-2-thread-4} 
> SchedulerFactory.java[jobStarted]:131) - Job remoteInterpretJob_1468204821905 
> started by scheduler org.apache.zeppelin.spark.SparkInterpreter835015739
>  INFO [2016-07-11 09:40:21,911] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - Changing view acls to: giaosudau
>  INFO [2016-07-11 09:40:21,912] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - Changing modify acls to: giaosudau
>  INFO [2016-07-11 09:40:21,912] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - SecurityManager: authentication disabled; ui acls disabled; users with view 
> permissions: Set(giaosudau); users with modify permissions: Set(giaosudau)
>  INFO [2016-07-11 09:40:21,918] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - Starting HTTP Server
>  INFO [2016-07-11 09:40:21,919] ({pool-2-thread-4} Server.java[doStart]:272) 
> - jetty-8.y.z-SNAPSHOT
>  INFO [2016-07-11 09:40:21,920] ({pool-2-thread-4} 
> AbstractConnector.java[doStart]:338) - Started SocketConnector@0.0.0.0:54818 
> 
>  INFO [2016-07-11 09:40:21,922] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - Successfully started service 'HTTP class server' on port 54818.
>  INFO [2016-07-11 09:40:22,408] ({pool-2-thread-4} 
> SparkInterpreter.java[createSparkContext]:233) - -- Create new 
> SparkContext local[*] ---
>  WARN [2016-07-11 09:40:22,411] ({pool-2-thread-4} 
> Logging.scala[logWarning]:70) - Another SparkContext is being constructed (or 
> threw an exception in its constructor).  This may indicate an error, since 
> only one SparkContext may be running in this JVM (see SPARK-2243). The other 
> SparkContext was created at:
> 
> Is that mean I need to setup allow multiple context? Because It’s only test 
> in local with local mode If I deploy on mesos cluster what would happened?
> 
> Need you guys suggests some solutions for that. Thanks.
> 
> Chanh
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> 
> 
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro



Re: Spark Thrift Server performance

2016-07-13 Thread Michael Segel
Hey, silly question? 

If you’re running a load balancer, are you trying to reuse the RDDs between 
jobs? 

TIA
-Mike

> On Jul 13, 2016, at 9:08 AM, ayan guha  > wrote:
> 
> My 2 cents:
> 
> Yes, we are running multiple STS (we are running on different nodes, but you 
> can run on same node, different ports). Using Ambari, it is really convenient 
> to manage. 
> 
> We have set up a nginx load balancer as well pointing to both services and 
> all our external BI tools connect to the load balancer. 
> 
> STS works as an YARN Client application, where STS is the driver. 
> 
> 
> 
> On Wed, Jul 13, 2016 at 5:33 PM, Mich Talebzadeh  > wrote:
> Hi,
> 
> I need some feedback on the performance of the Spark Thrift Server (STS) 
> 
> As far I can ascertain one can start STS passing the usual spark parameters
> 
> ${SPARK_HOME}/sbin/start-thriftserver.sh \
> --master spark://50.140.197.217:7077 
>  \
> --hiveconf hive.server2.thrift.port=10055 \
> --packages  \
> --driver-memory 2G \
> --num-executors 2 \
> --executor-memory 2G \
> --conf "spark.scheduler.mode=FAIR" \
> --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps" \
> --jars  \
> --conf "spark.ui.port=12345" 
> 
> 
>   And accessing it via beeline JDBC client
> 
> beeline -u jdbc:hive2://rhes564:10055 -n hduser -p
> 
> Now the questions I have
> 
> What is the limit on the number of users accessing the thrift server.
> Clearly the thrift server can start with resource configuration. In a simple 
> way does STS act as a gateway to Spark (meaning Spark apps can use their own 
> resources) or one is limited to resource that STS offers?
> Can one start multiple thrift servers
> As far as I can see STS is equivalent to Spark SQL accessing Hive DW. Indeed 
> this is what it says:
> 
> Connecting to jdbc:hive2://rhes564:10055
> Connected to: Spark SQL (version 1.6.1)
> Driver: Spark Project Core (version 1.6.1)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 1.6.1 by Apache Hive
> 0: jdbc:hive2://rhes564:10055>
> 
> Thanks
> 
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha



Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Michael Segel
I think you need to learn the basics of how to build a ‘data lake/pond/sewer’ 
first. 

The short answer is yes. 
The longer answer is that you need to think more about translating a relational 
model in to a hierarchical model, something that I seriously doubt has been 
taught in schools in a very long time.  

Then there’s more to the design, including indexing. 
Do you want to stick with SQL or do you want to hand code the work to allow for 
indexing / secondary indexing to help with the filtering since Spark SQL 
doesn’t really handle indexing. Note that you could actually still use an index 
table (narrow/thin inverted table) and join against the base table to get 
better performance. 

There’s more to this, but you get the idea.

HTH

-Mike

> On Jul 6, 2016, at 2:25 PM, dabuki  wrote:
> 
> I was thinking about to replace a legacy batch job with Spark, but I'm not
> sure if Spark is suited for this use case. Before I start the proof of
> concept, I wanted to ask for opinions.
> 
> The legacy job works as follows: A file (100k - 1 mio entries) is iterated.
> Every row contains a (book) order with an id and for each row approx. 15
> processing steps have to be performed that involve access to multiple
> database tables. In total approx. 25 tables (each containing 10k-700k
> entries) have to be scanned using the book's id and the retrieved data is
> joined together. 
> 
> As I'm new to Spark I'm not sure if I can leverage Spark's processing model
> for this use case.
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-suited-for-replacing-a-batch-job-using-many-database-tables-tp27300.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Michael Segel
Sorry, 
I was assuming that you wanted to build the data lake in Hadoop rather than 
just reading from DB2. (Data Lakes need to be built correctly. ) 

So, slightly different answer.

Yes, you can do this… 

You will end up with an immutable copy of the data that you would read in 
serially. Then you will probably need to repartition the data, depending on 
size and how much parallelization you want.  And then run the batch processing. 

But I have to ask why? 
Are you having issues with DB2? 
Are your batch jobs interfering with your transactional work? 

You will have a hit up front as you read the data from DB2, but then, depending 
on how you use the data… you may be faster overall. 

Please don’t misunderstand, Spark is a viable solution, however… there’s a bit 
of heavy lifting that has to occur (e.g. building and maintaining a spark 
cluster) and there are alternatives out there that work. 

Performance of the DB2 tables will vary based on indexing, assuming you have 
the appropriate indexes in place. 

You could also look at Apache Drill too. 

HTH 
-Mike



> On Jul 6, 2016, at 3:24 PM, Andreas Bauer <dabuks...@gmail.com> wrote:
> 
> Thanks for the advice. I have to retrieve the basic data from the DB2 tables 
> but afterwards I'm pretty free to transform the data as needed. 
> 
> 
> 
> On 6. Juli 2016 um 22:12:26 MESZ, Michael Segel <msegel_had...@hotmail.com> 
> wrote:
>> I think you need to learn the basics of how to build a ‘data 
>> lake/pond/sewer’ first. 
>> 
>> The short answer is yes. 
>> The longer answer is that you need to think more about translating a 
>> relational model in to a hierarchical model, something that I seriously 
>> doubt has been taught in schools in a very long time. 
>> 
>> Then there’s more to the design, including indexing. 
>> Do you want to stick with SQL or do you want to hand code the work to allow 
>> for indexing / secondary indexing to help with the filtering since Spark SQL 
>> doesn’t really handle indexing. Note that you could actually still use an 
>> index table (narrow/thin inverted table) and join against the base table to 
>> get better performance. 
>> 
>> There’s more to this, but you get the idea.
>> 
>> HTH
>> 
>> -Mike
>> 
>> > On Jul 6, 2016, at 2:25 PM, dabuki wrote:
>> > 
>> > I was thinking about to replace a legacy batch job with Spark, but I'm not
>> > sure if Spark is suited for this use case. Before I start the proof of
>> > concept, I wanted to ask for opinions.
>> > 
>> > The legacy job works as follows: A file (100k - 1 mio entries) is iterated.
>> > Every row contains a (book) order with an id and for each row approx. 15
>> > processing steps have to be performed that involve access to multiple
>> > database tables. In total approx. 25 tables (each containing 10k-700k
>> > entries) have to be scanned using the book's id and the retrieved data is
>> > joined together. 
>> > 
>> > As I'm new to Spark I'm not sure if I can leverage Spark's processing model
>> > for this use case.
>> > 
>> > 
>> > 
>> > 
>> > 
>> > --
>> > View this message in context: 
>> > http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-suited-for-replacing-a-batch-job-using-many-database-tables-tp27300.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> > 
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> > 
>> > 
>> 
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark application doesn't scale to worker nodes

2016-07-05 Thread Michael Segel
Did the OP say he was running a stand alone cluster of Spark, or on Yarn? 


> On Jul 5, 2016, at 10:22 AM, Mich Talebzadeh  
> wrote:
> 
> Hi Jakub,
> 
> Any reason why you are running in standalone mode, given that your are 
> familiar with YARN?
> 
> In theory your settings are correct. I checked your environment tab settings 
> and they look correct.
> 
> I assume you have checked this link
> 
> http://spark.apache.org/docs/latest/spark-standalone.html 
> 
> 
> BTW is this issue confined to ML or any other Spark application exhibits the 
> same behaviour in standalone mode?
> 
> 
> HTH
> 
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 5 July 2016 at 11:17, Jacek Laskowski  > wrote:
> Hi Jakub,
> 
> You're correct - spark.masterspark://master.clust:7077 - proves your 
> point. You're running Spark Standalone that was set in 
> conf/spark-defaults.conf perhaps.
> 
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/ 
> Mastering Apache Spark http://bit.ly/mastering-apache-spark 
> 
> Follow me at https://twitter.com/jaceklaskowski 
> 
> 
> On Tue, Jul 5, 2016 at 12:04 PM, Jakub Stransky  > wrote:
> Hello,
> 
> I am convinced that we are not running in local mode:
> 
> Runtime Information
> 
> NameValue
> Java Home/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre
> Java Version1.7.0_65 (Oracle Corporation)
> Scala Versionversion 2.10.5
> Spark Properties
> 
> NameValue
> spark.app.id app-20160704121044-0003
> spark.app.name DemoApp
> spark.driver.extraClassPath/home/sparkuser/sqljdbc4.jar
> spark.driver.host10.2.0.4
> spark.driver.memory4g
> spark.driver.port59493
> spark.executor.extraClassPath/usr/local/spark-1.6.1/sqljdbc4.jar
> spark.executor.id driver
> spark.executor.memory12g
> spark.externalBlockStore.folderName
> spark-5630dd34-4267-462e-882e-b382832bb500
> spark.jarsfile:/home/sparkuser/SparkPOC.jar
> spark.masterspark://master.clust:7077
> spark.scheduler.modeFIFO
> spark.submit.deployModeclient
> System Properties
> 
> NameValue
> SPARK_SUBMITtrue
> awt.toolkitsun.awt.X11.XToolkit
> file.encodingUTF-8
> file.encoding.pkgsun.io 
> file.separator/
> java.awt.graphicsenvsun.awt.X11GraphicsEnvironment
> java.awt.printerjobsun.print.PSPrinterJob
> java.class.version51.0
> java.endorsed.dirs
> /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/endorsed
> java.ext.dirs
> /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/ext:/usr/java/packages/lib/ext
> java.home/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre
> java.io.tmpdir/tmp
> java.library.path
> /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
> java.runtime.name OpenJDK Runtime Environment
> java.runtime.version1.7.0_65-mockbuild_2014_07_16_06_06-b00
> java.specification.name Java Platform 
> API Specification
> java.specification.vendorOracle Corporation
> java.specification.version1.7
> java.vendorOracle Corporation
> java.vendor.urlhttp://java.oracle.com/ 
> java.vendor.url.bughttp://bugreport.sun.com/bugreport/ 
> 
> java.version1.7.0_65
> java.vm.info mixed mode
> java.vm.name OpenJDK 64-Bit Server VM
> java.vm.specification.name Java 
> Virtual Machine Specification
> java.vm.specification.vendorOracle Corporation
> java.vm.specification.version1.7
> java.vm.vendorOracle Corporation
> java.vm.version24.65-b04
> line.separator
> os.archamd64
> os.name Linux
> os.version2.6.32-431.29.2.el6.x86_64
> path.separator:
> sun.arch.data.model64
> sun.boot.class.path
> 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Michael Segel
I don’t think that it would be a good comparison. 

If memory serves, Tez w LLAP is going to be running a separate engine that is 
constantly running, no? 

Spark?  That runs under hive… 

Unless you’re suggesting that the spark context is constantly running as part 
of the hiveserver2? 

> On May 23, 2016, at 6:51 PM, Jörn Franke  wrote:
> 
> 
> Hi Mich,
> 
> I think these comparisons are useful. One interesting aspect could be 
> hardware scalability in this context. Additionally different type of 
> computations. Furthermore, one could compare Spark and Tez+llap as execution 
> engines. I have the gut feeling that  each one can be justified by different 
> use cases.
> Nevertheless, there should be always a disclaimer for such comparisons, 
> because Spark and Hive are not good for a lot of concurrent lookups of single 
> rows. They are not good for frequently write small amounts of data (eg sensor 
> data). Here hbase could be more interesting. Other use cases can justify 
> graph databases, such as Titan, or text analytics/ data matching using Solr 
> on Hadoop.
> Finally, even if you have a lot of data you need to think if you always have 
> to process everything. For instance, I have found valid use cases in practice 
> where we decided to evaluate 10 machine learning models in parallel on only a 
> sample of data and only evaluate the "winning" model of the total of data.
> 
> As always it depends :) 
> 
> Best regards
> 
> P.s.: at least Hortonworks has in their distribution spark 1.5 with hive 1.2 
> and spark 1.6 with hive 1.2. Maybe they have somewhere described how to 
> manage bringing both together. You may check also Apache Bigtop (vendor 
> neutral distribution) on how they managed to bring both together.
> 
> On 23 May 2016, at 01:42, Mich Talebzadeh  > wrote:
> 
>> Hi,
>>  
>> I have done a number of extensive tests using Spark-shell with Hive DB and 
>> ORC tables.
>>  
>> Now one issue that we typically face is and I quote:
>>  
>> Spark is fast as it uses Memory and DAG. Great but when we save data it is 
>> not fast enough
>> 
>> OK but there is a solution now. If you use Spark with Hive and you are on a 
>> descent version of Hive >= 0.14, then you can also deploy Spark as execution 
>> engine for Hive. That will make your application run pretty fast as you no 
>> longer rely on the old Map-Reduce for Hive engine. In a nutshell what you 
>> are gaining speed in both querying and storage.
>>  
>> I have made some comparisons on this set-up and I am sure some of you will 
>> find it useful.
>>  
>> The version of Spark I use for Spark queries (Spark as query tool) is 1.6.
>> The version of Hive I use in Hive 2
>> The version of Spark I use as Hive execution engine is 1.3.1 It works and 
>> frankly Spark 1.3.1 as an execution engine is adequate (until we sort out 
>> the Hadoop libraries mismatch).
>>  
>> An example I am using Hive on Spark engine to find the min and max of IDs 
>> for a table with 1 billion rows:
>>  
>> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id), 
>> stddev(id) from oraclehadoop.dummy;
>> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>  
>>  
>> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151
>>  
>> INFO  : Completed compiling 
>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006); 
>> Time taken: 1.911 seconds
>> INFO  : Executing 
>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006): 
>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>> INFO  : Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>> INFO  : Total jobs = 1
>> INFO  : Launching Job 1 out of 1
>> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>>  
>> Query Hive on Spark job[0] stages:
>> 0
>> 1
>> Status: Running (Hive on Spark job[0])
>> Job Progress Format
>> CurrentTime StageId_StageAttemptId: 
>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
>> [StageCost]
>> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22Stage-1_0: 0/1
>> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22Stage-1_0: 0/1
>> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22Stage-1_0: 0/1
>> INFO  :
>> Query Hive on Spark job[0] stages:
>> INFO  : 0
>> INFO  : 1
>> INFO  :
>> Status: Running (Hive on Spark job[0])
>> INFO  : Job Progress Format
>> CurrentTime StageId_StageAttemptId: 
>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
>> [StageCost]
>> INFO  : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>> INFO  : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22Stage-1_0: 0/1
>> INFO  : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22Stage-1_0: 0/1
>> INFO  : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22Stage-1_0: 0/1
>> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Michael Segel
Just a clarification. 

Tez is ‘vendor’ independent.  ;-) 

Yeah… I know…  Anyone can support it.  Only Hortonworks has stacked the deck in 
their favor. 

Drill could be in the same boat, although there now more committers who are not 
working for MapR. I’m not sure who outside of HW is supporting Tez. 

But I digress. 

Here in the Spark user list, I have to ask how do you run hive on spark? Is the 
execution engine … the spark context always running? (Client mode I assume) 
Are the executors always running?   Can you run multiple queries from multiple 
users in parallel? 

These are some of the questions that should be asked and answered when 
considering how viable spark is going to be as the engine under Hive… 

Thx

-Mike

> On May 29, 2016, at 3:35 PM, Mich Talebzadeh  
> wrote:
> 
> thanks I think the problem is that the TEZ user group is exceptionally quiet. 
> Just sent an email to Hive user group to see anyone has managed to built a 
> vendor independent version.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 29 May 2016 at 21:23, Jörn Franke  > wrote:
> Well I think it is different from MR. It has some optimizations which you do 
> not find in MR. Especially the LLAP option in Hive2 makes it interesting. 
> 
> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
> integrated in the Hortonworks distribution. 
> 
> 
> On 29 May 2016, at 21:43, Mich Talebzadeh  > wrote:
> 
>> Hi Jorn,
>> 
>> I started building apache-tez-0.8.2 but got few errors. Couple of guys from 
>> TEZ user group kindly gave a hand but I could not go very far (or may be I 
>> did not make enough efforts) making it work.
>> 
>> That TEZ user group is very quiet as well.
>> 
>> My understanding is TEZ is MR with DAG but of course Spark has both plus 
>> in-memory capability.
>> 
>> It would be interesting to see what version of TEZ works as execution engine 
>> with Hive.
>> 
>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of 
>> Hive etc as I am sure you already know.
>> 
>> Cheers,
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 29 May 2016 at 20:19, Jörn Franke > > wrote:
>> Very interesting do you plan also a test with TEZ?
>> 
>> On 29 May 2016, at 13:40, Mich Talebzadeh > > wrote:
>> 
>>> Hi,
>>> 
>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>> 
>>> Basically took the original table imported using Sqoop and created and 
>>> populated a new ORC table partitioned by year and month into 48 partitions 
>>> as follows:
>>> 
>>> 
>>> ​ 
>>> Connections use JDBC via beeline. Now for each partition using MR it takes 
>>> an average of 17 minutes as seen below for each PARTITION..  Now that is 
>>> just an individual partition and there are 48 partitions.
>>> 
>>> In contrast doing the same operation with Spark engine took 10 minutes all 
>>> inclusive. I just gave up on MR. You can see the StartTime and FinishTime 
>>> from below
>>> 
>>> 
>>> 
>>> This is by no means indicate that Spark is much better than MR but shows 
>>> that some very good results can ve achieved using Spark engine.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> 
>>>  
>>> http://talebzadehmich.wordpress.com 
>>>  
>>> 
>>> On 24 May 2016 at 08:03, Mich Talebzadeh >> > wrote:
>>> Hi,
>>> 
>>> We use Hive as the database and use Spark as an all purpose query tool.
>>> 
>>> Whether Hive is the write database for purpose or one is better off with 
>>> something like Phoenix on Hbase, well the answer is it depends and your 
>>> mileage varies. 
>>> 
>>> So fit for purpose.
>>> 
>>> Ideally what wants is to use the fastest  method to get the results. How 
>>> fast we confine it to our SLA agreements in production and that helps us 
>>> from unnecessary further work as we technologists like to play around.
>>> 
>>> So in short, we use Spark most of the time and use Hive as the backend 
>>> engine for data storage, 

Re: Joining a compressed ORC table with a non compressed text table

2016-06-29 Thread Michael Segel
Hi, 

I’m not sure I understand your initial question… 

Depending on the compression algo, you may or may not be able to split the 
file. 
So if its not splittable, you have a single long running thread. 

My guess is that you end up with a very long single partition. 
If so, if you repartition, you may end up seeing better performance in the 
join. 

I see that you’re using a hive context. 

Have you tried to manually do this using just data frames and compare the DAG 
to the SQL DAG? 

HTH

-Mike

> On Jun 29, 2016, at 9:14 AM, Mich Talebzadeh  
> wrote:
> 
> Hi all,
> 
> It finished in 2 hours 18 minutes!
> 
> Started at
> [29/06/2016 10:25:27.27]
> [148]
> [148]
> [148]
> [148]
> [148]
> Finished at
> [29/06/2016 12:43:33.33]
> 
> I need to dig in more. 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 29 June 2016 at 10:42, Mich Talebzadeh  > wrote:
> Focusing on Spark job, as I mentioned before Spark is running in local mode 
> with 8GB of memory for both the driver and executor memory.
> 
> However, I still see this enormous Duration time which indicates something is 
> wrong badly!
> 
> Also I got rid of groupBy
> 
>   val s2 = HiveContext.table("sales2").select("PROD_ID")
>   val s = HiveContext.table("sales_staging").select("PROD_ID")
>   val rs = s2.join(s,"prod_id").sort(desc("prod_id")).take(5).foreach(println)
> 
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 29 June 2016 at 10:18, Jörn Franke  > wrote:
> 
> I think the TEZ engine is much more maintained with respect to optimizations 
> related to Orc , hive , vectorizing, querying than the mr engine. It will be 
> definitely better to use it.
> Mr is also deprecated in hive 2.0.
> For me it does not make sense to use mr with hive larger than 1.1.
> 
> As I said, order by might be inefficient to use (not sure if this has 
> changed). You may want to use sort by.
> 
> That being said there are many optimizations methods.
> 
> On 29 Jun 2016, at 00:27, Mich Talebzadeh  > wrote:
> 
>> That is a good point.
>> 
>> The ORC table property is as follows
>> 
>> TBLPROPERTIES ( "orc.compress"="SNAPPY",
>> "orc.stripe.size"="268435456",
>> "orc.row.index.stride"="1")
>> 
>> which puts each stripe at 256MB
>> 
>> Just to clarify this is spark running on Hive tables. I don't think the use 
>> of TEZ, MR or Spark as execution engines is going to make any difference?
>> 
>> This is the same query with Hive on MR
>> 
>> select a.prod_id from sales2 a, sales_staging b where a.prod_id = b.prod_id 
>> order by a.prod_id;
>> 
>> 2016-06-28 23:23:51,203 Stage-1 map = 0%,  reduce = 0%
>> 2016-06-28 23:23:59,480 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 7.32 
>> sec
>> 2016-06-28 23:24:08,771 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU 
>> 18.21 sec
>> 2016-06-28 23:24:11,860 Stage-1 map = 58%,  reduce = 0%, Cumulative CPU 
>> 22.34 sec
>> 2016-06-28 23:24:18,021 Stage-1 map = 62%,  reduce = 0%, Cumulative CPU 
>> 30.33 sec
>> 2016-06-28 23:24:21,101 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU 
>> 33.45 sec
>> 2016-06-28 23:24:24,181 Stage-1 map = 66%,  reduce = 0%, Cumulative CPU 37.5 
>> sec
>> 2016-06-28 23:24:27,270 Stage-1 map = 69%,  reduce = 0%, Cumulative CPU 42.0 
>> sec
>> 2016-06-28 23:24:30,349 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 
>> 45.62 sec
>> 2016-06-28 23:24:33,441 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 
>> 49.69 sec
>> 2016-06-28 23:24:36,521 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 
>> 52.92 sec
>> 2016-06-28 23:24:39,605 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU 
>> 56.78 sec
>> 2016-06-28 

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-07 Thread Michael Segel
Why couldn’t you use the spark thrift server?


On Feb 7, 2017, at 1:28 PM, Cosmin Posteuca 
> wrote:

answer for Gourav Sengupta

I want to use same spark application because i want to work as a FIFO 
scheduler. My problem is that i have many jobs(not so big) and if i run an 
application for every job my cluster will split resources as a FAIR 
scheduler(it's what i observe, maybe i'm wrong) and exist the possibility to 
create bottleneck effect. The start time isn't a problem for me, because it 
isn't a real-time application.

I need a business solution, that's the reason why i can't use code from github.

Thanks!

2017-02-07 19:55 GMT+02:00 Gourav Sengupta 
>:
Hi,

May I ask the reason for using the same spark application? Is it because of the 
time it takes in order to start a spark context?

On another note you may want to look at the number of contributors in a github 
repo before choosing a solution.


Regards,
Gourav

On Tue, Feb 7, 2017 at 5:26 PM, vincent gromakowski 
> wrote:
Spark jobserver or Livy server are the best options for pure technical API.
If you want to publish business API you will probably have to build you own app 
like the one I wrote a year ago https://github.com/elppc/akka-spark-experiments
It combines Akka actors and a shared Spark context to serve concurrent 
subsecond jobs


2017-02-07 15:28 GMT+01:00 ayan guha 
>:
I think you are loking for livy or spark  jobserver

On Wed, 8 Feb 2017 at 12:37 am, Cosmin Posteuca 
> wrote:

I want to run different jobs on demand with same spark context, but i don't 
know how exactly i can do this.

I try to get current context, but seems it create a new spark context(with new 
executors).

I call spark-submit to add new jobs.

I run code on Amazon EMR(3 instances, 4 core & 16GB ram / instance), with yarn 
as resource manager.

My code:

val sparkContext = SparkContext.getOrCreate()
val content = 1 to 4
val result = sparkContext.parallelize(content, 5)
result.map(value => value.toString).foreach(loop)

def loop(x: String): Unit = {
   for (a <- 1 to 3000) {

   }
}


spark-submit:

spark-submit --executor-cores 1 \
 --executor-memory 1g \
 --driver-memory 1g \
 --master yarn \
 --deploy-mode cluster \
 --conf spark.dynamicAllocation.enabled=true \
 --conf spark.shuffle.service.enabled=true \
 --conf spark.dynamicAllocation.minExecutors=1 \
 --conf spark.dynamicAllocation.maxExecutors=3 \
 --conf spark.dynamicAllocation.initialExecutors=3 \
 --conf spark.executor.instances=3 \


If i run twice spark-submit it create 6 executors, but i want to run all this 
jobs on same spark application.

How can achieve adding jobs to an existing spark application?

I don't understand why SparkContext.getOrCreate() don't get existing spark 
context.


Thanks,

Cosmin P.

--
Best Regards,
Ayan Guha






Re: Spark/Parquet/Statistics question

2017-01-17 Thread Michael Segel
Hi, 
Lexicographically speaking, Min/Max should work because String(s)  support a 
comparator operator.  So anything which supports an equality test (<,>, <= , >= 
, == …) can also support min and max functions as well. 

I guess the question is if Spark does support this, and if not, why? 
Yes, it makes sense. 



> On Jan 17, 2017, at 9:17 AM, Jörn Franke  wrote:
> 
> Hallo,
> 
> I am not sure what you mean by min/max for strings. I do not know if this 
> makes sense. What the ORC format has is bloom filters for strings etc. - are 
> you referring to this? 
> 
> In order to apply min/max filters Spark needs to read the meta data of the 
> file. If the filter is applied or not - this you can see from the number of 
> bytes read.
> 
> 
> Best regards
> 
>> On 17 Jan 2017, at 15:28, djiang  wrote:
>> 
>> Hi, 
>> 
>> I have been looking into how Spark stores statistics (min/max) in Parquet as
>> well as how it uses the info for query optimization.
>> I have got a few questions.
>> First setup: Spark 2.1.0, the following sets up a Dataframe of 1000 rows,
>> with a long type and a string type column.
>> They are sorted by different columns, though.
>> 
>> scala> spark.sql("select id, cast(id as string) text from
>> range(1000)").sort("id").write.parquet("/secret/spark21-sortById")
>> scala> spark.sql("select id, cast(id as string) text from
>> range(1000)").sort("Text").write.parquet("/secret/spark21-sortByText")
>> 
>> I added some code to parquet-tools to print out stats and examine the
>> generated parquet files:
>> 
>> hadoop jar parquet-tools-1.9.1-SNAPSHOT.jar meta
>> /secret/spark21-sortById/part-0-39f7ac12-6038-46ee-b5c3-d7a5a06e4425.snappy.parquet
>>  
>> file:   
>> file:/secret/spark21-sortById/part-0-39f7ac12-6038-46ee-b5c3-d7a5a06e4425.snappy.parquet
>>  
>> creator: parquet-mr version 1.8.1 (build
>> 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf) 
>> extra:   org.apache.spark.sql.parquet.row.metadata =
>> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"text","type":"string","nullable":false,"metadata":{}}]}
>>  
>> 
>> file schema: spark_schema 
>> 
>> id:  REQUIRED INT64 R:0 D:0
>> text:REQUIRED BINARY O:UTF8 R:0 D:0
>> 
>> row group 1: RC:5 TS:133 OFFSET:4 
>> 
>> id:   INT64 SNAPPY DO:0 FPO:4 SZ:71/81/1.14 VC:5
>> ENC:PLAIN,BIT_PACKED STA:[min: 0, max: 4, num_nulls: 0]
>> text: BINARY SNAPPY DO:0 FPO:75 SZ:53/52/0.98 VC:5
>> ENC:PLAIN,BIT_PACKED
>> 
>> hadoop jar parquet-tools-1.9.1-SNAPSHOT.jar meta
>> /secret/spark21-sortByText/part-0-3d7eac74-5ca0-44a0-b8a6-d67cc38a2bde.snappy.parquet
>>  
>> file:   
>> file:/secret/spark21-sortByText/part-0-3d7eac74-5ca0-44a0-b8a6-d67cc38a2bde.snappy.parquet
>>  
>> creator: parquet-mr version 1.8.1 (build
>> 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf) 
>> extra:   org.apache.spark.sql.parquet.row.metadata =
>> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"text","type":"string","nullable":false,"metadata":{}}]}
>>  
>> 
>> file schema: spark_schema 
>> 
>> id:  REQUIRED INT64 R:0 D:0
>> text:REQUIRED BINARY O:UTF8 R:0 D:0
>> 
>> row group 1: RC:5 TS:140 OFFSET:4 
>> 
>> id:   INT64 SNAPPY DO:0 FPO:4 SZ:71/81/1.14 VC:5
>> ENC:PLAIN,BIT_PACKED STA:[min: 0, max: 101, num_nulls: 0]
>> text: BINARY SNAPPY DO:0 FPO:75 SZ:60/59/0.98 VC:5
>> ENC:PLAIN,BIT_PACKED
>> 
>> So the question is why is Spark, particularly, 2.1.0, only generate min/max
>> for numeric columns, but not strings(BINARY) fields, even if the string
>> field is included in the sort? Maybe I missed a configuraiton?
>> 
>> The second issue, is how can I confirm Spark is utilizing the min/max?
>> scala> sc.setLogLevel("INFO")
>> scala> spark.sql("select * from parquet.`/secret/spark21-sortById` where
>> id=4").show
>> I got many lines like this:
>> 17/01/17 09:23:35 INFO FilterCompat: Filtering using predicate:
>> and(noteq(id, null), eq(id, 4))
>> 17/01/17 09:23:35 INFO FileScanRDD: Reading File path:
>> file:///secret/spark21-sortById/part-0-39f7ac12-6038-46ee-b5c3-d7a5a06e4425.snappy.parquet,
>> range: 0-558, partition values: [empty row]
>> ...
>> 17/01/17 09:23:35 INFO FilterCompat: Filtering using predicate:
>> and(noteq(id, null), eq(id, 4))
>> 17/01/17 09:23:35 INFO FileScanRDD: Reading File path:
>> file:///secret/spark21-sortById/part-00193-39f7ac12-6038-46ee-b5c3-d7a5a06e4425.snappy.parquet,
>> range: 0-574, partition values: [empty row]
>> ...
>> 
>> The question is it looks like Spark is scanning every file, even if from the
>> min/max, 

Quick but probably silly question...

2017-01-17 Thread Michael Segel
Hi, 
While the parquet file is immutable and the data sets are immutable, how does 
sparkSQL handle updates or deletes? 
I mean if I read in a file using SQL in to an RDD, mutate it, eg delete a row, 
and then persist it, I now have two files. If I reread the table back in … will 
I see duplicates or not? 

The larger issue is how to handle mutable data in a multi-user / multi-tenant 
situation and using Parquet as the storage. 

Would this be the right tool? 

W.R.T ORC files, mutation is handled by Tez. 

Thanks in Advance, 

-Mike



Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
Ok… so what’s the tricky part? 
Spark Streaming isn’t real time so if you don’t mind a slight delay in 
processing… it would work.

The drawback is that you now have a long running Spark Job (assuming under 
YARN) and that could become a problem in terms of security and resources. 
(How well does Yarn handle long running jobs these days in a secured Cluster? 
Steve L. may have some insight… ) 

Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you 
want to write your own compaction code? Or use Hive 1.x+?)

HBase? Depending on your admin… stability could be a problem. 
Cassandra? That would be a separate cluster and that in itself could be a 
problem… 

YMMV so you need to address the pros/cons of each tool specific to your 
environment and skill level. 

HTH

-Mike

> On Sep 29, 2016, at 8:54 AM, Ali Akhtar  wrote:
> 
> I have a somewhat tricky use case, and I'm looking for ideas.
> 
> I have 5-6 Kafka producers, reading various APIs, and writing their raw data 
> into Kafka.
> 
> I need to:
> 
> - Do ETL on the data, and standardize it.
> 
> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / 
> ElasticSearch / Postgres)
> 
> - Query this data to generate reports / analytics (There will be a web UI 
> which will be the front-end to the data, and will show the reports)
> 
> Java is being used as the backend language for everything (backend of the web 
> UI, as well as the ETL layer)
> 
> I'm considering:
> 
> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive 
> raw data from Kafka, standardize & store it)
> 
> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, and 
> to allow queries
> 
> - In the backend of the web UI, I could either use Spark to run queries 
> across the data (mostly filters), or directly run queries against Cassandra / 
> HBase
> 
> I'd appreciate some thoughts / suggestions on which of these alternatives I 
> should go with (e.g, using raw Kafka consumers vs Spark for ETL, which 
> persistent data store to use, and how to query that data store in the backend 
> of the web UI, for displaying the reports).
> 
> 
> Thanks.



Re: Treadting NaN fields in Spark

2016-09-29 Thread Michael Segel

On Sep 29, 2016, at 10:29 AM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

Good points :) it took take "-" as a negative number -123456?

Yeah… you have to go down a level and start to remember that you’re dealing 
with a stream or buffer of bytes below any casting.

At this moment in time this is what the code does


  1.  csv is imported into HDFS as is. No cleaning done for rogue columns done 
at shell level
  2.  Spark programs does the following filtration:
  3.  val rs = df2.filter($"Open" !== "-").filter($"Volume".cast("Integer") > 0)

So my first line of defence is to check for !== "-" which is a dash, commonly 
used for not available. The next filter is for volume column > 0 (there was 
trades on this stock), otherwise the calculation could skew the results.  Note 
that a filter with AND with !== will not work.


You can’t rely on the ‘-‘ to represent NaN or NULL.

The issue is that you’re going from a loose typing to a stronger typing (String 
to Double).
So pretty much any byte buffer could be interpreted as a String, but iff the 
String value is too long to be a Double, you will fail the NaN test. (Or its a 
NULL value/string)
As to filtering… you would probably want to filter on volume being == 0.  (Its 
possible to actually have a negative volume.
Or you could set the opening, low, high to the close if the volume is 0 
regardless of the values in those columns.

Note: This would be a transformation of the data and should be done during 
ingestion so you’re doing it only once.

Or you could just remove the rows since no trades occurred and then either 
reflect it in your graph as gaps or the graph interpolates it out .


scala> val rs = df2.filter($"Open" !== "-" && $"Volume".cast("Integer") > 0)
:40: error: value && is not a member of String
   val rs = df2.filter($"Open" !== "-" && $"Volume".cast("Integer") > 0)

Will throw an error.

But this equality === works!

scala> val rs = df2.filter($"Open" === "-" && $"Volume".cast("Integer") > 0)
rs: org.apache.spark.sql.Dataset[columns] = [Stock: string, Ticker: string ... 
6 more fields]


Another alternative is to check for all digits here

 scala> def isAllPostiveNumber (price: String) = price forall Character.isDigit
isAllPostiveNumber: (price: String)Boolean

Not really a good idea. You’re walking thru each byte in a stream and checking 
to see if its a digit. What if its a NULL string? What do you set the value to?
This doesn’t scale well…

Again why not remove the rows where the volume of trades is 0?

Retuns Boolean true or false.  But does not work unless someone tells me what 
is wrong with this below!

scala> val rs = df2.filter(isAllPostiveNumber("Open") => true)

scala> val rs = df2.filter(isAllPostiveNumber("Open") => true)
:1: error: not a legal formal parameter.
Note: Tuples cannot be directly destructured in method or function parameters.
  Either create a single parameter accepting the Tuple1,
  or consider a pattern matching anonymous function: `{ case (param1, 
param1) => ... }
val rs = df2.filter(isAllPostiveNumber("Open") => true)


Thanks











Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 29 September 2016 at 13:45, Michael Segel 
<msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>> wrote:
Hi,

Just a few thoughts so take it for what its worth…

Databases have static schemas and will reject a row’s column on insert.

In your case… you have one data set where you have a column which is supposed 
to be a number but you have it as a string.
You want to convert this to a double in your final data set.


It looks like your problem is that your original data set that you ingested 
used a ‘-‘ (dash) to represent missing data, rather than a NULL value.
In fact, looking at the rows… you seem to have a stock that didn’t trade for a 
given day. (All have Volume as 0. ) Why do you need this?  Wouldn’t you want to 
represent this as null or no row for a given date?

The reason your ‘-‘ check failed when isnan() is that ‘-‘ actually could be 
represented as a number.

If you replaced the ‘-‘ with a String that is wider than the width of a double 
… the isnan should flag the row.

(I 

Fwd: tod...@yahoo-inc.com is no longer with Yahoo! (was: Re: Treadting NaN fields in Spark)

2016-09-29 Thread Michael Segel
Hi,
Hate to be a pain… but could someone remove this email address (see below) from 
the spark mailing list(s)
It seems that ‘Elvis’ has left the building and forgot to change his mail 
subscriptions…

Begin forwarded message:

From: Yahoo! No Reply 
>
Subject: tod...@yahoo-inc.com is no longer with 
Yahoo! (was: Re: Treadting NaN fields in Spark)
Date: September 29, 2016 at 10:56:10 AM CDT
To: >


This is an automatically generated message.

tod...@yahoo-inc.com is no longer with Yahoo! Inc.

Your message will not be forwarded.

If you have a sales inquiry, please email 
yahoosa...@yahoo-inc.com and someone will 
follow up with you shortly.

If you require assistance with a legal matter, please send a message to 
legal-noti...@yahoo-inc.com

Thank you!



Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
Spark standalone is not Yarn… or secure for that matter… ;-)

> On Sep 29, 2016, at 11:18 AM, Cody Koeninger <c...@koeninger.org> wrote:
> 
> Spark streaming helps with aggregation because
> 
> A. raw kafka consumers have no built in framework for shuffling
> amongst nodes, short of writing into an intermediate topic (I'm not
> touching Kafka Streams here, I don't have experience), and
> 
> B. it deals with batches, so you can transactionally decide to commit
> or rollback your aggregate data and your offsets.  Otherwise your
> offsets and data store can get out of sync, leading to lost /
> duplicate data.
> 
> Regarding long running spark jobs, I have streaming jobs in the
> standalone manager that have been running for 6 months or more.
> 
> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
> <msegel_had...@hotmail.com> wrote:
>> Ok… so what’s the tricky part?
>> Spark Streaming isn’t real time so if you don’t mind a slight delay in 
>> processing… it would work.
>> 
>> The drawback is that you now have a long running Spark Job (assuming under 
>> YARN) and that could become a problem in terms of security and resources.
>> (How well does Yarn handle long running jobs these days in a secured 
>> Cluster? Steve L. may have some insight… )
>> 
>> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you 
>> want to write your own compaction code? Or use Hive 1.x+?)
>> 
>> HBase? Depending on your admin… stability could be a problem.
>> Cassandra? That would be a separate cluster and that in itself could be a 
>> problem…
>> 
>> YMMV so you need to address the pros/cons of each tool specific to your 
>> environment and skill level.
>> 
>> HTH
>> 
>> -Mike
>> 
>>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>> 
>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>> 
>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw 
>>> data into Kafka.
>>> 
>>> I need to:
>>> 
>>> - Do ETL on the data, and standardize it.
>>> 
>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / 
>>> ElasticSearch / Postgres)
>>> 
>>> - Query this data to generate reports / analytics (There will be a web UI 
>>> which will be the front-end to the data, and will show the reports)
>>> 
>>> Java is being used as the backend language for everything (backend of the 
>>> web UI, as well as the ETL layer)
>>> 
>>> I'm considering:
>>> 
>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive 
>>> raw data from Kafka, standardize & store it)
>>> 
>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, 
>>> and to allow queries
>>> 
>>> - In the backend of the web UI, I could either use Spark to run queries 
>>> across the data (mostly filters), or directly run queries against Cassandra 
>>> / HBase
>>> 
>>> I'd appreciate some thoughts / suggestions on which of these alternatives I 
>>> should go with (e.g, using raw Kafka consumers vs Spark for ETL, which 
>>> persistent data store to use, and how to query that data store in the 
>>> backend of the web UI, for displaying the reports).
>>> 
>>> 
>>> Thanks.
>> 



Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
OP mentioned HBase or HDFS as persisted storage. Therefore they have to be 
running YARN if they are considering spark. 
(Assuming that you’re not trying to do a storage / compute model and use 
standalone spark outside your cluster. You can, but you have more moving 
parts…) 

I never said anything about putting something on a public network. I mentioned 
running a secured cluster.
You don’t deal with PII or other regulated data, do you? 


If you read my original post, you are correct we don’t have a lot, if any real 
information. 
Based on what the OP said, there are design considerations since every tool he 
mentioned has pluses and minuses and the problem isn’t really that challenging 
unless you have something extraordinary like high velocity or some other 
constraint that makes this challenging. 

BTW, depending on scale and velocity… your relational engines may become 
problematic. 
HTH

-Mike


> On Sep 29, 2016, at 1:51 PM, Cody Koeninger <c...@koeninger.org> wrote:
> 
> The OP didn't say anything about Yarn, and why are you contemplating
> putting Kafka or Spark on public networks to begin with?
> 
> Gwen's right, absent any actual requirements this is kind of pointless.
> 
> On Thu, Sep 29, 2016 at 1:27 PM, Michael Segel
> <msegel_had...@hotmail.com> wrote:
>> Spark standalone is not Yarn… or secure for that matter… ;-)
>> 
>>> On Sep 29, 2016, at 11:18 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>> 
>>> Spark streaming helps with aggregation because
>>> 
>>> A. raw kafka consumers have no built in framework for shuffling
>>> amongst nodes, short of writing into an intermediate topic (I'm not
>>> touching Kafka Streams here, I don't have experience), and
>>> 
>>> B. it deals with batches, so you can transactionally decide to commit
>>> or rollback your aggregate data and your offsets.  Otherwise your
>>> offsets and data store can get out of sync, leading to lost /
>>> duplicate data.
>>> 
>>> Regarding long running spark jobs, I have streaming jobs in the
>>> standalone manager that have been running for 6 months or more.
>>> 
>>> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
>>> <msegel_had...@hotmail.com> wrote:
>>>> Ok… so what’s the tricky part?
>>>> Spark Streaming isn’t real time so if you don’t mind a slight delay in 
>>>> processing… it would work.
>>>> 
>>>> The drawback is that you now have a long running Spark Job (assuming under 
>>>> YARN) and that could become a problem in terms of security and resources.
>>>> (How well does Yarn handle long running jobs these days in a secured 
>>>> Cluster? Steve L. may have some insight… )
>>>> 
>>>> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do 
>>>> you want to write your own compaction code? Or use Hive 1.x+?)
>>>> 
>>>> HBase? Depending on your admin… stability could be a problem.
>>>> Cassandra? That would be a separate cluster and that in itself could be a 
>>>> problem…
>>>> 
>>>> YMMV so you need to address the pros/cons of each tool specific to your 
>>>> environment and skill level.
>>>> 
>>>> HTH
>>>> 
>>>> -Mike
>>>> 
>>>>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>>>> 
>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>> 
>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw 
>>>>> data into Kafka.
>>>>> 
>>>>> I need to:
>>>>> 
>>>>> - Do ETL on the data, and standardize it.
>>>>> 
>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / 
>>>>> ElasticSearch / Postgres)
>>>>> 
>>>>> - Query this data to generate reports / analytics (There will be a web UI 
>>>>> which will be the front-end to the data, and will show the reports)
>>>>> 
>>>>> Java is being used as the backend language for everything (backend of the 
>>>>> web UI, as well as the ETL layer)
>>>>> 
>>>>> I'm considering:
>>>>> 
>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer 
>>>>> (receive raw data from Kafka, standardize & store it)
>>>>> 
>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, 
>>>>> and to allow queries
>>>>> 
>>>>> - In the backend of the web UI, I could either use Spark to run queries 
>>>>> across the data (mostly filters), or directly run queries against 
>>>>> Cassandra / HBase
>>>>> 
>>>>> I'd appreciate some thoughts / suggestions on which of these alternatives 
>>>>> I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which 
>>>>> persistent data store to use, and how to query that data store in the 
>>>>> backend of the web UI, for displaying the reports).
>>>>> 
>>>>> 
>>>>> Thanks.
>>>> 
>> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Off Heap (Tungsten) Memory Usage / Management ?

2016-09-21 Thread Michael Segel
I’ve asked this question a couple of times from a friend who didn’t know the 
answer… so I thought I would try here. 


Suppose we launch a job on a cluster (YARN) and we have set up the containers 
to be 3GB in size.


What does that 3GB represent? 

I mean what happens if we end up using 2-3GB of off heap storage via tungsten? 
What will Spark do? 
Will it try to honor the container’s limits and throw an exception or will it 
allow my job to grab that amount of memory and exceed YARN’s expectations since 
its off heap? 

Thx

-Mike



Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Michael Segel
Thanks for the response Sean. 

But how does YARN know about the off-heap memory usage? 
That’s the piece that I’m missing.

Thx again, 

-Mike

> On Sep 21, 2016, at 10:09 PM, Sean Owen <so...@cloudera.com> wrote:
> 
> No, Xmx only controls the maximum size of on-heap allocated memory.
> The JVM doesn't manage/limit off-heap (how could it? it doesn't know
> when it can be released).
> 
> The answer is that YARN will kill the process because it's using more
> memory than it asked for. A JVM is always going to use a little
> off-heap memory by itself, so setting a max heap size of 2GB means the
> JVM process may use a bit more than 2GB of memory. With an off-heap
> intensive app like Spark it can be a lot more.
> 
> There's a built-in 10% overhead, so that if you ask for a 3GB executor
> it will ask for 3.3GB from YARN. You can increase the overhead.
> 
> On Wed, Sep 21, 2016 at 11:41 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>> All off-heap memory is still managed by the JVM process. If you limit the
>> memory of this process then you limit the memory. I think the memory of the
>> JVM process could be limited via the xms/xmx parameter of the JVM. This can
>> be configured via spark options for yarn (be aware that they are different
>> in cluster and client mode), but i recommend to use the spark options for
>> the off heap maximum.
>> 
>> https://spark.apache.org/docs/latest/running-on-yarn.html
>> 
>> 
>> On 21 Sep 2016, at 22:02, Michael Segel <msegel_had...@hotmail.com> wrote:
>> 
>> I’ve asked this question a couple of times from a friend who didn’t know
>> the answer… so I thought I would try here.
>> 
>> 
>> Suppose we launch a job on a cluster (YARN) and we have set up the
>> containers to be 3GB in size.
>> 
>> 
>> What does that 3GB represent?
>> 
>> I mean what happens if we end up using 2-3GB of off heap storage via
>> tungsten?
>> What will Spark do?
>> Will it try to honor the container’s limits and throw an exception or will
>> it allow my job to grab that amount of memory and exceed YARN’s
>> expectations since its off heap?
>> 
>> Thx
>> 
>> -Mike
>> 
>> B‹CB• È
>> [œÝXœØÜšX™H K[XZ[ ˆ \Ù\‹][œÝXœØÜšX™P Ü \šË˜\ XÚ K›Ü™ÃBƒ



Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Michael Segel
I would disagree. 

While you can tune the system to not over subscribe, I would rather have it hit 
swap then fail. Especially on long running jobs. 

If we look at oversubscription on Hadoop clusters which are not running HBase… 
they survive.  Its when you have things like HBase that don’t handle swap well… 
or you don’t allocate enough swap that things go boom. 

Also consider that you could move swap to something that is faster than 
spinning rust. 


> On Sep 22, 2016, at 12:44 PM, Sean Owen <so...@cloudera.com> wrote:
> 
> I don't think I'd enable swap on a cluster. You'd rather processes
> fail than grind everything to a halt. You'd buy more memory or
> optimize memory before trading it for I/O.
> 
> On Thu, Sep 22, 2016 at 6:29 PM, Michael Segel
> <msegel_had...@hotmail.com> wrote:
>> Ok… gotcha… wasn’t sure that YARN just looked at the heap size allocation 
>> and ignored the off heap.
>> 
>> WRT over all OS memory… this would be one reason why I’d keep a decent 
>> amount of swap around. (Maybe even putting it on a fast device like an .m2 
>> or PCIe flash drive….



Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Michael Segel
Ok… gotcha… wasn’t sure that YARN just looked at the heap size allocation and 
ignored the off heap. 

WRT over all OS memory… this would be one reason why I’d keep a decent amount 
of swap around. (Maybe even putting it on a fast device like an .m2 or PCIe 
flash drive…. 


> On Sep 22, 2016, at 9:56 AM, Sean Owen <so...@cloudera.com> wrote:
> 
> It's looking at the whole process's memory usage, and doesn't care
> whether the memory is used by the heap or not within the JVM. Of
> course, allocating memory off-heap still counts against you at the OS
> level.
> 
> On Thu, Sep 22, 2016 at 3:54 PM, Michael Segel
> <msegel_had...@hotmail.com> wrote:
>> Thanks for the response Sean.
>> 
>> But how does YARN know about the off-heap memory usage?
>> That’s the piece that I’m missing.
>> 
>> Thx again,
>> 
>> -Mike
>> 
>>> On Sep 21, 2016, at 10:09 PM, Sean Owen <so...@cloudera.com> wrote:
>>> 
>>> No, Xmx only controls the maximum size of on-heap allocated memory.
>>> The JVM doesn't manage/limit off-heap (how could it? it doesn't know
>>> when it can be released).
>>> 
>>> The answer is that YARN will kill the process because it's using more
>>> memory than it asked for. A JVM is always going to use a little
>>> off-heap memory by itself, so setting a max heap size of 2GB means the
>>> JVM process may use a bit more than 2GB of memory. With an off-heap
>>> intensive app like Spark it can be a lot more.
>>> 
>>> There's a built-in 10% overhead, so that if you ask for a 3GB executor
>>> it will ask for 3.3GB from YARN. You can increase the overhead.
>>> 
>>> On Wed, Sep 21, 2016 at 11:41 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>>>> All off-heap memory is still managed by the JVM process. If you limit the
>>>> memory of this process then you limit the memory. I think the memory of the
>>>> JVM process could be limited via the xms/xmx parameter of the JVM. This can
>>>> be configured via spark options for yarn (be aware that they are different
>>>> in cluster and client mode), but i recommend to use the spark options for
>>>> the off heap maximum.
>>>> 
>>>> https://spark.apache.org/docs/latest/running-on-yarn.html
>>>> 
>>>> 
>>>> On 21 Sep 2016, at 22:02, Michael Segel <msegel_had...@hotmail.com> wrote:
>>>> 
>>>> I’ve asked this question a couple of times from a friend who didn’t 
>>>> know
>>>> the answer… so I thought I would try here.
>>>> 
>>>> 
>>>> Suppose we launch a job on a cluster (YARN) and we have set up the
>>>> containers to be 3GB in size.
>>>> 
>>>> 
>>>> What does that 3GB represent?
>>>> 
>>>> I mean what happens if we end up using 2-3GB of off heap storage via
>>>> tungsten?
>>>> What will Spark do?
>>>> Will it try to honor the container’s limits and throw an exception or 
>>>> will
>>>> it allow my job to grab that amount of memory and exceed YARN’s
>>>> expectations since its off heap?
>>>> 
>>>> Thx
>>>> 
>>>> -Mike
>>>> 
>>>> B‹CB• È
>>>> [œÝXœØÜšX™H K[XZ[ ˆ \Ù\‹][œÝXœØÜšX™P Ü \šË˜\ XÚ K›Ü™ÃBƒ
>> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Treadting NaN fields in Spark

2016-09-29 Thread Michael Segel
Hi,

Just a few thoughts so take it for what its worth…

Databases have static schemas and will reject a row’s column on insert.

In your case… you have one data set where you have a column which is supposed 
to be a number but you have it as a string.
You want to convert this to a double in your final data set.


It looks like your problem is that your original data set that you ingested 
used a ‘-‘ (dash) to represent missing data, rather than a NULL value.
In fact, looking at the rows… you seem to have a stock that didn’t trade for a 
given day. (All have Volume as 0. ) Why do you need this?  Wouldn’t you want to 
represent this as null or no row for a given date?

The reason your ‘-‘ check failed when isnan() is that ‘-‘ actually could be 
represented as a number.

If you replaced the ‘-‘ with a String that is wider than the width of a double 
… the isnan should flag the row.

(I still need more coffee, so I could be wrong) ;-)

HTH

-Mike

On Sep 28, 2016, at 5:56 AM, Mich Talebzadeh 
> wrote:


This is an issue in most databases. Specifically if a field is NaN.. --> (NaN, 
standing for not a number, is a numeric data type value representing an 
undefined or unrepresentable value, especially in floating-point calculations)

There is a method called isnan() in Spark that is supposed to handle this 
scenario . However, it does not return correct values! For example I defined 
column "Open" as String  (it should be Float) and it has the following 7 rogue 
entries out of 1272 rows in a csv

df2.filter( $"OPen" === 
"-").select((changeToDate("TradeDate").as("TradeDate")), 'Open, 'High, 'Low, 
'Close, 'Volume).show

+--+++---+-+--+
| TradeDate|Open|High|Low|Close|Volume|
+--+++---+-+--+
|2011-12-23|   -|   -|  -|40.56| 0|
|2011-04-21|   -|   -|  -|45.85| 0|
|2010-12-30|   -|   -|  -|38.10| 0|
|2010-12-23|   -|   -|  -|38.36| 0|
|2008-04-30|   -|   -|  -|32.39| 0|
|2008-04-29|   -|   -|  -|33.05| 0|
|2008-04-28|   -|   -|  -|32.60| 0|
+--+++---+-+--+

However, the following does not work!

 df2.filter(isnan($"Open")).show
+-+--+-+++---+-+--+
|Stock|Ticker|TradeDate|Open|High|Low|Close|Volume|
+-+--+-+++---+-+--+
+-+--+-+++---+-+--+

Any suggestions?

Thanks


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.






Re: building runnable distribution from source

2016-09-29 Thread Michael Segel
You may want to replace the 2.4 with a later release.

On Sep 29, 2016, at 3:08 AM, AssafMendelson 
> wrote:

Hi,
I am trying to compile the latest branch of spark in order to try out some code 
I wanted to contribute.

I was looking at the instructions to build from 
http://spark.apache.org/docs/latest/building-spark.html
So at first I did:
./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
This worked without a problem and compiled.

I then did
./dev/make-distribution.sh --name custom-spark --tgz -e -Psparkr -Phadoop-2.4 
-Phive -Phive-thriftserver -Pyarn
Which failed.
(I added the –e because the first run, without it suggested adding this to get 
more information).
If I look at the compilation itself, It provides no messages for spark project 
core:

[INFO] Building Spark Project Core 2.1.0-SNAPSHOT
[INFO] 
[INFO]
[INFO] 
[INFO] Building Spark Project YARN Shuffle Service 2.1.0-SNAPSHOT
[INFO] ---

However, when I reach the summary I find that core has failed to compile.
Below is the messages from the end of the compilation but I can’t find any 
direct error.
I tried to google this but found no solution. Could anyone point me to how to 
fix this?


[INFO] --- maven-compiler-plugin:3.5.1:compile (default-compile) @ 
spark-core_2.11 ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 74 source files to 
/home/mendea3/git/spark/core/target/scala-2.11/classes
[INFO]
[INFO] --- exec-maven-plugin:1.4.0:exec (sparkr-pkg) @ spark-core_2.11 ---
Cannot find 'R_HOME'. Please specify 'R_HOME' or make sure R is properly 
installed.
[INFO] 
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ... SUCCESS [  4.165 s]
[INFO] Spark Project Tags . SUCCESS [  5.163 s]
[INFO] Spark Project Sketch ... SUCCESS [  7.393 s]
[INFO] Spark Project Networking ... SUCCESS [ 18.929 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [ 10.528 s]
[INFO] Spark Project Unsafe ... SUCCESS [ 14.453 s]
[INFO] Spark Project Launcher . SUCCESS [ 15.198 s]
[INFO] Spark Project Core . FAILURE [ 57.641 s]
[INFO] Spark Project ML Local Library . SUCCESS [ 10.561 s]
[INFO] Spark Project GraphX ... SKIPPED
[INFO] Spark Project Streaming  SKIPPED
[INFO] Spark Project Catalyst . SKIPPED
[INFO] Spark Project SQL .. SKIPPED
[INFO] Spark Project ML Library ... SKIPPED
[INFO] Spark Project Tools  SUCCESS [  4.188 s]
[INFO] Spark Project Hive . SKIPPED
[INFO] Spark Project REPL . SKIPPED
[INFO] Spark Project YARN Shuffle Service . SUCCESS [ 16.128 s]
[INFO] Spark Project YARN . SKIPPED
[INFO] Spark Project Hive Thrift Server ... SKIPPED
[INFO] Spark Project Assembly . SKIPPED
[INFO] Spark Project External Flume Sink .. SUCCESS [  9.855 s]
[INFO] Spark Project External Flume ... SKIPPED
[INFO] Spark Project External Flume Assembly .. SKIPPED
[INFO] Spark Integration for Kafka 0.8  SKIPPED
[INFO] Spark Project Examples . SKIPPED
[INFO] Spark Project External Kafka Assembly .. SKIPPED
[INFO] Spark Integration for Kafka 0.10 ... SKIPPED
[INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
[INFO] Spark Project Java 8 Tests . SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 01:52 min (Wall Clock)
[INFO] Finished at: 2016-09-29T10:48:57+03:00
[INFO] Final Memory: 49M/771M
[INFO] 
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.4.0:exec 
(sparkr-pkg) on project spark-core_2.11: Command execution failed. Process 
exited with an error: 1 (Exit value: 1) -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal 
org.codehaus.mojo:exec-maven-plugin:1.4.0:exec (sparkr-pkg) on project 
spark-core_2.11: Command execution failed.
at 

Re: Spark Hive Rejection

2016-09-29 Thread Michael Segel
Correct me if I’m wrong but isn’t hive schema on read and not on write?
So you shouldn’t fail on write.


On Sep 29, 2016, at 1:25 AM, Mostafa Alaa Mohamed 
> wrote:

Dears,
I want to ask
• What will happened if there are rejections rows when inserting 
dataframe into hive?
o   Rejection will be for example table required integer into column and 
dataframe include string.
o   Duplication rejection restriction from the table itself?
• How can we specify the rejection directory?
If not avaiable do you recommend to open Jira issue?

Best Regards,
Mostafa Alaa Mohamed,
Technical Expert Big Data,
M: +971506450787
Email: mohamedamost...@etisalat.ae


The content of this email together with any attachments, statements and 
opinions expressed herein contains information that is private and confidential 
are intended for the named addressee(s) only. If you are not the addressee of 
this email you may not copy, forward, disclose or otherwise use it or any part 
of it in any form whatsoever. If you have received this message in error please 
notify postmas...@etisalat.ae by email 
immediately and delete the message without making any copies.



Re: sanboxing spark executors

2016-11-08 Thread Michael Segel
Not that easy of a problem to solve… 

Can you impersonate the user who provided the code? 

I mean if Joe provides the lambda function, then it runs as Joe so it has joe’s 
permissions. 

Steve is right, you’d have to get down to your cluster’s security and 
authenticate the user before accepting the lambda code. You may also want to 
run with a restricted subset of permissions. 
(e.g. Joe is an admin, but he wants it to run as if its an untrusted user… this 
gets a bit more interesting.) 

And this beg’s the question… 

How are you sharing your RDDs across multiple users?  This too opens up a 
security question or two… 



> On Nov 4, 2016, at 6:13 PM, blazespinnaker  wrote:
> 
> In particular, we need to make sure the RDDs execute the lambda functions
> securely as they are provided by user code.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/sanboxing-spark-executors-tp28014p28024.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 



How sensitive is Spark to Swap?

2016-11-07 Thread Michael Segel
This may seem like a silly question, but it really isn’t. 
In terms of Map/Reduce, its possible to over subscribe the cluster because 
there is a lack of sensitivity if the servers swap memory to disk. 

In terms of HBase, which is very sensitive, swap doesn’t just kill performance, 
but also can kill HBase. (I’m sure one can tune it to be less sensitive…) 

But I have to ask how sensitive is Spark? 
Considering we can cache to disk (local disk) it would imply that it would less 
sensitive. 
Yet we see some posters facing over subscription and hitting OOME. 

Thoughts? 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Streaming backpressure weird behavior/bug

2016-11-07 Thread Michael Segel

Spark inherits its security from the underlying mechanisms in either YARN or 
MESOS (whichever environment you are launching your cluster/jobs)

That said… there is limited support from Ranger.  There are three parts to this…

1) Ranger being called when the job is launched…

2) Ranger being called when data is being read from disk (HDFS) or HBase, 
however… once the application has the data… its fair game.

Now if Ranger were woven in to a thrift server (which would be a one off ) then 
you would have more security if you were planning on providing the data to 
multiple users and applications…


Does that help?

On Nov 7, 2016, at 3:41 AM, Mudit Kumar 
> wrote:

Hi,

Do ranger provide security to spark?If yes,then in what capacity.

Thanks,
Mudit



Re: Save a spark RDD to disk

2016-11-09 Thread Michael Segel
Can you increase the number of partitions and also increase the number of 
executors?
(This should improve the parallelization but you may become disk i/o bound)

On Nov 8, 2016, at 4:08 PM, Elf Of Lothlorein 
> wrote:

Hi
I am trying to save a RDD to disk and I am using the saveAsNewAPIHadoopFile for 
that. I am seeing that it takes almost 20 mins for about 900 GB of data. Is 
there any parameter that I can tune to make this saving faster.
I am running about 45 executors with 5 cores each on 5 Spark worker nodes and 
using Spark on YARN for this..
Thanks for your help.
C



Re: importing data into hdfs/spark using Informatica ETL tool

2016-11-09 Thread Michael Segel
Oozie, a product only a mad Russian would love. ;-)

Just say no to hive. Go from Flat to Parquet.
(This sounds easy, but there’s some work that has to occur…)

Sorry for being cryptic, Mich’s question is pretty much generic for anyone 
building a data lake so it ends up overlapping with some work that I have to do…

-Mike

On Nov 9, 2016, at 4:16 PM, Mich Talebzadeh 
> wrote:

Thanks guys,

Sounds like let Informatica get the data out of RDBMS and create mapping to 
flat files that will be delivered to a directory visible by HDFS host. Then 
push the csv files into HDFS. then there are number of options to work on:


  1.  run cron or oozie to get data out of HDFS (or build external Hive table 
on that directory) and do insert/select into Hive managed table
  2.  alternatively use a spark job to get CSV data into RDD and then create 
tempTable and do insert/select from tempTable to Hive table. Bear in mind that 
we need a spark job tailored to each table schema

I believe the above is feasible?





Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 9 November 2016 at 21:26, Jörn Franke 
> wrote:
Basically you mention the options. However, there are several ways how 
informatica can extract (or store) from/to rdbms. If the native option is not 
available then you need to go via JDBC as you have described.
Alternatively (but only if it is worth it) you can schedule fetching of the 
files via oozie and use it to convert the csv into orc/ parquet etc.
If this is a common use case in the company you can extend informatica with 
Java classes that for instance convert the data directly into parquet or orc. 
However, is some effort.

On 9 Nov 2016, at 14:56, Mich Talebzadeh 
> wrote:

Hi,

I am exploring the idea of flexibility with importing multiple RDBMS tables 
using Informatica that customer has into HDFS.

I don't want to use connectivity tools from Informatica to Hive etc.

So this is what I have in mind


  1.
If possible get the tables data out using Informatica and use Informatica ui  
to convert RDBMS data into some form of CSV, TSV file (Can Informatica do it?) 
I guess yes
  2.
Put the flat files on an edge where HDFS node can see them.
  3.
Assuming that a directory can be created by Informatica daily, periodically run 
a cron that ingest that data from directories into HDFS equivalent daily 
directories
  4.
Once the data is in HDFS one can use, Spark csv, Hive etc to query data

The problem I have is to see if someone has done such thing before. 
Specifically can Informatica create target flat files on normal directories.

Any other generic alternative?

Thanks

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.






Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
Its a quasi columnar store.
Sort of a hi-bred approach.


On Oct 17, 2016, at 4:30 PM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

I assume that Hbase is more of columnar data store by virtue of it storing 
column data together.

many interpretation of this is all over places. However, it is not columnar in 
a sense of column based (as opposed to row based) implementation of relational 
model.



Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 17 October 2016 at 22:14, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:
Oltp use case scenario does not mean necessarily the traditional oltp. See also 
apache hawk etc. they can fit indeed to some use cases to some other less.

On 17 Oct 2016, at 23:02, Michael Segel 
<msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>> wrote:

You really don’t want to do OLTP on a distributed NoSQL engine.
Remember Big Data isn’t relational its more of a hierarchy model or record 
model. Think IMS or Pick (Dick Pick’s revelation, U2, Universe, etc …)


On Oct 17, 2016, at 3:45 PM, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:

It has some implication because it imposes the SQL model on Hbase. Internally 
it translates the SQL queries into custom Hbase processors. Keep also in mind 
for what Hbase need a proper key design and how Phoenix designs those keys to 
get the best performance out of it. I think for oltp it is a workable model and 
I think they plan to offer Phoenix as a default interface as part of Hbase 
anyway.
For OLAP it depends.


On 17 Oct 2016, at 22:34, ayan guha 
<guha.a...@gmail.com<mailto:guha.a...@gmail.com>> wrote:


Hi

Any reason not to recommend Phoneix? I haven't used it myself so curious about 
pro's and cons about the use of it.

On 18 Oct 2016 03:17, "Michael Segel" 
<msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>> wrote:
Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a 
Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and 
you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that 
you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
performance will vary.

With respect to Tableau… their entire interface in to the big data world 
revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim 
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wrote:

Thanks for all the suggestions. It would seem you guys are right about the 
Tableau side of things. The reports don’t need to be real-time, and they won’t 
be directly feeding off of the main DMP HBase data. Instead, it’ll be batched 
to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase 
for ID generation, but after further investigation into the use-case and 
architecture, the ID generation needs to happen local to the Ad Servers where 
we generate a unique ID and store it in a ID linking table. Even better, many 
of the 3rd party services supply this ID. So, data only needs to flow in one 
direction. We will use Kafka as the bus for this. No JDBC required. This is 
also goes for the REST Endpoints. 3rd party services will hit ours to update 
our data with no need to read from our data. And, when we want to update their 
data, we will hit theirs to update their data using a triggered job.

This al boils down to just integrating with Kafka.

Once again, thanks for all the help.

Cheers,
Ben


On Oct 9, 2016, at 3:16 AM, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:

please keep also in mind that Tableau Server has the capabilities to store data 
in-memory and refresh only when needed the in-memory data. This means you can 
import it from any source and let your 

Re: spark with kerberos

2016-10-18 Thread Michael Segel
(Sorry sent reply via wrong account.. )

Steve,

Kinda hijacking the thread, but I promise its still on topic to OP’s issue.. ;-)

Usually you will end up having a local Kerberos set up per cluster.
So your machine accounts (hive, yarn, hbase, etc …) are going to be local  to 
the cluster.

So you will have to set up some sort of realm trusts between the clusters.

If you’re going to be setting up security (Kerberos … ick! shivers… ;-) you’re 
going to want to keep the machine accounts isolated to the cluster.
And the OP said that he didn’t control the other cluster which makes me believe 
that they are separate.


I would also think that you would have trouble with the credential… isn’t is 
tied to a user at a specific machine?
(Its been a while since I looked at this and I drank heavily to forget 
Kerberos… so I may be a bit fuzzy here.)

Thx

-Mike
On Oct 18, 2016, at 2:59 PM, Steve Loughran 
<ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote:


On 17 Oct 2016, at 22:11, Michael Segel 
<michael_se...@hotmail.com<mailto:michael_se...@hotmail.com>> wrote:

@Steve you are going to have to explain what you mean by ‘turn Kerberos on’.

Taken one way… it could mean making cluster B secure and running Kerberos and 
then you’d have to create some sort of trust between B and C,



I'd imagined making cluster B a kerberized cluster.

I don't think you need to go near trust relations though —ideally you'd just 
want the same accounts everywhere if you can, if not, the main thing is that 
the user submitting the job can get a credential for  that far NN at job 
submission time, and that credential is propagated all the way to the executors.


Did you mean turn on kerberos on the nodes in Cluster B so that each node 
becomes a trusted client that can connect to C

OR

Did you mean to turn on kerberos on the master node (eg edge node) where the 
data persists if you collect() it so its off the cluster on to a single machine 
and then push it from there so that only that machine has to have kerberos 
running and is a trusted server to Cluster C?


Note: In option 3, I hope I said it correctly, but I believe that you would be 
collecting the data to a client (edge node) before pushing it out to the 
secured cluster.





Does that make sense?

On Oct 14, 2016, at 1:32 PM, Steve Loughran 
<ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote:


On 13 Oct 2016, at 10:50, dbolshak 
<bolshakov.de...@gmail.com<mailto:bolshakov.de...@gmail.com>> wrote:

Hello community,

We've a challenge and no ideas how to solve it.

The problem,

Say we have the following environment:
1. `cluster A`, the cluster does not use kerberos and we use it as a source
of data, important thing is - we don't manage this cluster.
2. `cluster B`, small cluster where our spark application is running and
performing some logic. (we manage this cluster and it does not have
kerberos).
3. `cluster C`, the cluster uses kerberos and we use it to keep results of
our spark application, we manage this cluster

Our requrements and conditions that are not mentioned yet:
1. All clusters are in a single data center, but in the different
subnetworks.
2. We cannot turn on kerberos on `cluster A`
3. We cannot turn off kerberos on `cluster C`
4. We can turn on/off kerberos on `cluster B`, currently it's turned off.
5. Spark app is built on top of RDD and does not depend on spark-sql.

Does anybody know how to write data using RDD api to remote cluster which is
running with Kerberos?

If you want to talk to the secure clsuter, C, from code running in cluster B, 
you'll need to turn kerberos on there. Maybe, maybe, you could just get away 
with kerberos being turned off, but you, the user, launching the application 
while logged in to kerberos yourself and so trusted by Cluster C.

one of the problems you are likely to hit with Spark here is that it's only 
going to collect the tokens you need to talk to HDFS at the time you launch the 
application, and by default, it only knows about the cluster FS. You will need 
to tell spark about the other filesystem at launch time, so it will know to 
authenticate with it as you, then collect the tokens needed for the application 
itself to work with kerberos.

spark.yarn.access.namenodes=hdfs://cluster-c:8080

-Steve

ps: https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/






Quirk in how Spark DF handles JSON input records?

2016-11-02 Thread Michael Segel
This may be a silly mistake on my part…

Doing an example using Chicago’s Crime data.. (There’s a lot of it going 
around. ;-)

The goal is to read a file containing a JSON record that describes the crime 
data.csv for ingestion into a data frame, then I want to output to a Parquet 
file.
(Pretty simple right?)

I ran this both in Zeppelin and in the Spark-Shell (2.01)

// Setup of environment
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Spark SQL basic 
example").config("spark.some.config.option", "some-value").getOrCreate()

// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._

// Load the JSON from file:
val df = spark.read.json(“~/datasets/Chicago_Crimes.json")
df.show()


The output
df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
++
| _corrupt_record|
++
| {|
| "metadata": {|
| "source": "CSV_...|
| "table": "Chica...|
| "compression": ...|
| },|
| "columns": [{|
| "col_name": "Id",|
| "data_type": "I...|
| }, {|
| "col_name": "Ca...|
| "data_type": "B...|
| }, {|

I checked the JSON file against a JSONLint tool (two actually)
My JSON record is valid w no errors. (see below)

So what’s happening?  What am I missing?
The goal is to create an ingestion schema for each source. From this I can 
build the schema for the Parquet file or other data target.

Thx

-Mike

My JSON record:
{
"metadata": {
"source": "CSV_FILE",
"table": "Chicago_Crime",
"compression": "SNAPPY"
},
"columns": [{
"col_name": "Id",
"data_type": "INT64"
}, {
"col_name": "Case_No.",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Date",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Block",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "IUCR",
"data_type": "INT32"
}, {
"col_name": "Primary_Type",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Description",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Location_Description",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Arrest",
"data_type": "BOOLEAN"
}, {
"col_name": "Domestic",
"data_type": "BOOLEAN"
}, {
"col_name": "Beat",
"data_type": "BYTE_ARRAYI"
}, {
"col_name": "District",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Ward",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Community",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "FBI_Code",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "X_Coordinate",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Y_Coordinate",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Year",
"data_type": "INT32"
}, {
"col_name": "Updated_On",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Latitude",
"data_type": "DOUBLE"
}, {
"col_name": "Longitude",
"data_type": "DOUBLE"
}, {
"col_name": "Location",
"data_type": "BYTE_ARRAY"


}]
}


Re: Quirk in how Spark DF handles JSON input records?

2016-11-02 Thread Michael Segel
ARGH!!

Looks like a formatting issue.  Spark doesn’t like ‘pretty’ output.

So then the entire record which defines the schema has to be a single line?

Really?

On Nov 2, 2016, at 1:50 PM, Michael Segel 
<msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>> wrote:

This may be a silly mistake on my part…

Doing an example using Chicago’s Crime data.. (There’s a lot of it going 
around. ;-)

The goal is to read a file containing a JSON record that describes the crime 
data.csv for ingestion into a data frame, then I want to output to a Parquet 
file.
(Pretty simple right?)

I ran this both in Zeppelin and in the Spark-Shell (2.01)

// Setup of environment
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Spark SQL basic 
example").config("spark.some.config.option", "some-value").getOrCreate()

// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._

// Load the JSON from file:
val df = spark.read.json(“~/datasets/Chicago_Crimes.json")
df.show()


The output
df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
++
| _corrupt_record|
++
| {|
| "metadata": {|
| "source": "CSV_...|
| "table": "Chica...|
| "compression": ...|
| },|
| "columns": [{|
| "col_name": "Id",|
| "data_type": "I...|
| }, {|
| "col_name": "Ca...|
| "data_type": "B...|
| }, {|

I checked the JSON file against a JSONLint tool (two actually)
My JSON record is valid w no errors. (see below)

So what’s happening?  What am I missing?
The goal is to create an ingestion schema for each source. From this I can 
build the schema for the Parquet file or other data target.

Thx

-Mike

My JSON record:
{
"metadata": {
"source": "CSV_FILE",
"table": "Chicago_Crime",
"compression": "SNAPPY"
},
"columns": [{
"col_name": "Id",
"data_type": "INT64"
}, {
"col_name": "Case_No.",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Date",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Block",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "IUCR",
"data_type": "INT32"
}, {
"col_name": "Primary_Type",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Description",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Location_Description",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Arrest",
"data_type": "BOOLEAN"
}, {
"col_name": "Domestic",
"data_type": "BOOLEAN"
}, {
"col_name": "Beat",
"data_type": "BYTE_ARRAYI"
}, {
"col_name": "District",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Ward",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Community",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "FBI_Code",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "X_Coordinate",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Y_Coordinate",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Year",
"data_type": "INT32"
}, {
"col_name": "Updated_On",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Latitude",
"data_type": "DOUBLE"
}, {
"col_name": "Longitude",
"data_type": "DOUBLE"
}, {
"col_name": "Location",
"data_type": "BYTE_ARRAY"


}]
}



Re: Quirk in how Spark DF handles JSON input records?

2016-11-02 Thread Michael Segel

On Nov 2, 2016, at 2:22 PM, Daniel Siegmann 
> wrote:

Yes, it needs to be on a single line. Spark (or Hadoop really) treats newlines 
as a record separator by default. While it is possible to use a different 
string as a record separator, what would you use in the case of JSON?

If you do some Googling I suspect you'll find some possible solutions. 
Personally, I would just use a separate JSON library (e.g. json4s) to parse 
this metadata into an object, rather than trying to read it in through Spark.


Yeah, that’s the basic idea.

This JSON is metadata to help drive the process not row records… although the 
column descriptors are row records so in the short term I could cheat and just 
store those in a file.

:-(

--
Daniel Siegmann
Senior Software Engineer
SecurityScorecard Inc.
214 W 29th Street, 5th Floor
New York, NY 10001



Re: Quirk in how Spark DF handles JSON input records?

2016-11-03 Thread Michael Segel
Hi,

I understand.

With XML, if you know the tag you want to group by, you can use a multi-line 
input format and just advance in to the split until you find that tag.
Much more difficult in JSON.




On Nov 3, 2016, at 2:41 AM, Mendelson, Assaf 
<assaf.mendel...@rsa.com<mailto:assaf.mendel...@rsa.com>> wrote:

I agree this can be a little annoying. The reason this is done this way is to 
enable cases where the json file is huge. To allow splitting it, a separator is 
needed and newline is the separator used (as is done in all text files in 
Hadoop and spark).
I always wondered why support has not been implemented for cases where each 
file is small (e.g. has one object) but the implementation now assume each line 
has a legal json object.

Why I do to overcome this is use RDDs (using pyspark):

// get an RDD of the text context. The map is used because wholeTextFiles 
returns a tuple of filename, file content
jsonRDD = sc.wholeTextFiles(filename).map(lambda x: x[1])

// remove whitespaces. This can actually be too much as it would also work 
inside string info so you can maybe remove just the end line characters (e.g. 
\r, \n)
import re
js = jsonRDD.map(lambda x: re.sub(r"\s+", "", x, flags=re.UNICODE))

// convert the rdd to dataframe. If you have your own schema, this is where you 
should add it.
df = spark.read.json(js)

Assaf.

From: Michael Segel [mailto:msegel_had...@hotmail.com]
Sent: Wednesday, November 02, 2016 9:39 PM
To: Daniel Siegmann
Cc: user @spark
Subject: Re: Quirk in how Spark DF handles JSON input records?


On Nov 2, 2016, at 2:22 PM, Daniel Siegmann 
<dsiegm...@securityscorecard.io<mailto:dsiegm...@securityscorecard.io>> wrote:

Yes, it needs to be on a single line. Spark (or Hadoop really) treats newlines 
as a record separator by default. While it is possible to use a different 
string as a record separator, what would you use in the case of JSON?
If you do some Googling I suspect you'll find some possible solutions. 
Personally, I would just use a separate JSON library (e.g. json4s) to parse 
this metadata into an object, rather than trying to read it in through Spark.


Yeah, that’s the basic idea.

This JSON is metadata to help drive the process not row records… although the 
column descriptors are row records so in the short term I could cheat and just 
store those in a file.

:-(


--
Daniel Siegmann
Senior Software Engineer
SecurityScorecard Inc.
214 W 29th Street, 5th Floor
New York, NY 10001



Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a 
Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and 
you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that 
you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
performance will vary.

With respect to Tableau… their entire interface in to the big data world 
revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim 
> wrote:

Thanks for all the suggestions. It would seem you guys are right about the 
Tableau side of things. The reports don’t need to be real-time, and they won’t 
be directly feeding off of the main DMP HBase data. Instead, it’ll be batched 
to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase 
for ID generation, but after further investigation into the use-case and 
architecture, the ID generation needs to happen local to the Ad Servers where 
we generate a unique ID and store it in a ID linking table. Even better, many 
of the 3rd party services supply this ID. So, data only needs to flow in one 
direction. We will use Kafka as the bus for this. No JDBC required. This is 
also goes for the REST Endpoints. 3rd party services will hit ours to update 
our data with no need to read from our data. And, when we want to update their 
data, we will hit theirs to update their data using a triggered job.

This al boils down to just integrating with Kafka.

Once again, thanks for all the help.

Cheers,
Ben


On Oct 9, 2016, at 3:16 AM, Jörn Franke 
> wrote:

please keep also in mind that Tableau Server has the capabilities to store data 
in-memory and refresh only when needed the in-memory data. This means you can 
import it from any source and let your users work only on the in-memory data in 
Tableau Server.

On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke 
> wrote:
Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided 
already a good alternative. However, you should check if it contains a recent 
version of Hbase and Phoenix. That being said, I just wonder what is the 
dataflow, data model and the analysis you plan to do. Maybe there are 
completely different solutions possible. Especially these single inserts, 
upserts etc. should be avoided as much as possible in the Big Data (analysis) 
world with any technology, because they do not perform well.

Hive with Llap will provide an in-memory cache for interactive analytics. You 
can put full tables in-memory with Hive using Ignite HDFS in-memory solution. 
All this does only make sense if you do not use MR as an engine, the right 
input format (ORC, parquet) and a recent Hive version.

On 8 Oct 2016, at 21:55, Benjamin Kim 
> wrote:

Mich,

Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 
as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will 
either try Phoenix JDBC Server for HBase or push to move faster to Kudu with 
Impala. We will use Impala as the JDBC in-between until the Kudu team completes 
Spark SQL support for JDBC.

Thanks for the advice.

Cheers,
Ben


On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh 
> wrote:

Sure. But essentially you are looking at batch data for analytics for your 
tableau users so Hive may be a better choice with its rich SQL and ODBC.JDBC 
connection to Tableau already.

I would go for Hive especially the new release will have an in-memory offering 
as well for frequently accessed data :)


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 8 October 2016 at 20:15, Benjamin Kim 
> wrote:
Mich,

First and 

Indexing w spark joins?

2016-10-17 Thread Michael Segel
Hi,

Apologies if I’ve asked this question before but I didn’t see it in the list 
and I’m certain that my last surviving brain cell has gone on strike over my 
attempt to reduce my caffeine intake…

Posting this to both user and dev because I think the question / topic jumps in 
to both camps.


Again since I’m a relative newbie on spark… I may be missing something so 
apologies up front…


With respect to Spark SQL,  in pre 2.0.x,  there were only hash joins?  In post 
2.0.x you have hash, semi-hash , and sorted list merge.

For the sake of simplicity… lets forget about cross product joins…

Has anyone looked at how we could use inverted tables to improve query 
performance?

The issue is that when you have a data sewer (lake) , what happens when your 
use case query is orthogonal to how your data is stored? This means full table 
scans.
By using secondary indexes, we can reduce this albeit at a cost of increasing 
your storage footprint by the size of the index.

Are there any JIRAs open that discuss this?

Indexes to assist in terms of ‘predicate push downs’ (using the index when a 
field in a where clause is indexed) rather than performing a full table scan.
Indexes to assist in the actual join if the join column is on an indexed column?

In the first, using an inverted table to produce a sort ordered set of row keys 
that you would then use in the join process (same as if you produced the subset 
based on the filter.)

To put this in perspective… here’s a dummy use case…

CCCis (CCC) is the middle man in the insurance industry. They have a piece of 
software that sits in the repair shop (e.g Joe’s Auto Body) and works with 
multiple insurance carriers.
The primary key in their data is going to be Insurance Company | Claim ID.  
This makes it very easy to find a specific claim for further processing.

Now lets say I want to do some analysis on determining the average cost of 
repairing a front end collision of a Volvo S80?
Or
Break down the number and types of accidents by car manufacturer , model and 
color.  (Then see if there is any correlation between car color and # and type 
of accidents)


As you can see, all of these queries are orthogonal to my storage.  So I need 
to create secondary indexes to help sift thru the data efficiently.

Does this make sense?

Please Note: I did some work for CCC back in the late 90’s. Any resemblance to 
their big data efforts is purely coincidence  and you can replace CCC with 
Allstate, Progressive, StateFarm or some other auto insurance company …

Thx

-Mike




Re: Accessing Hbase tables through Spark, this seems to work

2016-10-17 Thread Michael Segel
Mitch,

Short answer… no, it doesn’t scale.

Longer answer…

You are using an UUID as the row key?  Why?  (My guess is that you want to 
avoid hot spotting)

So you’re going to have to pull in all of the data… meaning a full table scan… 
and then perform a sort order transformation, dropping the UUID in the process.

You would be better off not using HBase and storing the data in Parquet files 
in a directory partitioned on date.  Or rather the rowkey would be the max_ts - 
TS so that your data is in LIFO.
Note: I’ve used the term epoch to describe the max value of a long (8 bytes of 
‘FF’ ) for the max_ts. This isn’t a good use of the term epoch, but if anyone 
has a better term, please let me know.



Having said that… if you want to use HBase, you could do the same thing.  If 
you want to avoid hot spotting, you could load the day’s transactions using a 
bulk loader so that you don’t have to worry about splits.

But that’s just my $0.02 cents worth.

HTH

-Mike

PS. If you wanted to capture the transactions… you could do the following 
schemea:

1) Rowkey = max_ts - TS
2) Rows contain the following:
CUSIP (Transaction ID)
Party 1 (Seller)
Party 2 (Buyer)
Symbol
Qty
Price

This is a trade ticket.



On Oct 16, 2016, at 1:37 PM, Mich Talebzadeh 
> wrote:

Hi,

I have trade data stored in Hbase table. Data arrives in csv format to HDFS and 
then loaded into Hbase via periodic load with 
org.apache.hadoop.hbase.mapreduce.ImportTsv.

The Hbase table has one Column family "trade_info" and three columns: ticker, 
timecreated, price.

The RowKey is UUID. So each row has UUID, ticker, timecreated and price in the 
csv file

Each row in Hbase is a key, value map. In my case, I have one Column Family and 
three columns. Without going into semantics I see Hbase as a column oriented 
database where column data stay together.

So I thought of this way of accessing the data.

I define an RDD for each column in the column family as below. In this case 
column trade_info:ticker

//create rdd
val hBaseRDD = sc.newAPIHadoopRDD(conf, 
classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result])
val rdd1 = hBaseRDD.map(tuple => tuple._2).map(result => (result.getRow, 
result.getColumn("price_info".getBytes(), "ticker".getBytes(.map(row => {
(
  row._1.map(_.toChar).mkString,
  row._2.asScala.reduceLeft {
(a, b) => if (a.getTimestamp > b.getTimestamp) a else b
  }.getValue.map(_.toChar).mkString
)
})
case class columns (key: String, ticker: String)
val dfticker = rdd1.toDF.map(p => columns(p(0).toString,p(1).toString))

Note that the end result is a DataFrame with the RowKey -> key and column -> 
ticker

I use the same approach to create two other DataFrames, namely dftimecreated 
and dfprice for the two other columns.

Note that if I don't need a column, then I do not create a DF for it. So a DF 
with each column I use. I am not sure how this compares if I read the full row 
through other methods if any.

Anyway all I need to do after creating a DataFrame for each column is to join 
themthrough RowKey to slice and dice data. Like below.

Get me the latest prices ordered by timecreated and ticker (ticker is stock)

val rs = 
dfticker.join(dftimecreated,"key").join(dfprice,"key").orderBy('timecreated 
desc, 'price desc).select('timecreated, 'ticker, 
'price.cast("Float").as("Latest price"))
rs.show(10)

+---+--++
|timecreated|ticker|Latest price|
+---+--++
|2016-10-16T18:44:57|   S16|   97.631966|
|2016-10-16T18:44:57|   S13|92.11406|
|2016-10-16T18:44:57|   S19|85.93021|
|2016-10-16T18:44:57|   S09|   85.714645|
|2016-10-16T18:44:57|   S15|82.38932|
|2016-10-16T18:44:57|   S17|80.77747|
|2016-10-16T18:44:57|   S06|79.81854|
|2016-10-16T18:44:57|   S18|74.10128|
|2016-10-16T18:44:57|   S07|66.13622|
|2016-10-16T18:44:57|   S20|60.35727|
+---+--++
only showing top 10 rows

Is this a workable solution?


Thanks


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.





Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
You forgot to mention that if you roll your own… you can toss your own level of 
security on top of it.

For most, that’s not important.
For those working with PII type of information… kinda important, especially 
when the rules can get convoluted.


On Oct 17, 2016, at 12:14 PM, vincent gromakowski 
<vincent.gromakow...@gmail.com<mailto:vincent.gromakow...@gmail.com>> wrote:

I would suggest to code your own Spark thriftserver which seems to be very easy.
http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server

I am starting to test it. The big advantage is that you can implement any logic 
because it's a spark job and then start a thrift server on temporary table. For 
example you can query a micro batch rdd from a kafka stream, or pre load some 
tables and implement a rolling cache to periodically update the spark in memory 
tables with persistent store...
It's not part of the public API and I don't know yet what are the issues doing 
this but I think Spark community should look at this path: making the 
thriftserver be instantiable in any spark job.

2016-10-17 18:17 GMT+02:00 Michael Segel 
<msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>>:
Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a 
Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and 
you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that 
you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
performance will vary.

With respect to Tableau… their entire interface in to the big data world 
revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim 
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wrote:

Thanks for all the suggestions. It would seem you guys are right about the 
Tableau side of things. The reports don’t need to be real-time, and they won’t 
be directly feeding off of the main DMP HBase data. Instead, it’ll be batched 
to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase 
for ID generation, but after further investigation into the use-case and 
architecture, the ID generation needs to happen local to the Ad Servers where 
we generate a unique ID and store it in a ID linking table. Even better, many 
of the 3rd party services supply this ID. So, data only needs to flow in one 
direction. We will use Kafka as the bus for this. No JDBC required. This is 
also goes for the REST Endpoints. 3rd party services will hit ours to update 
our data with no need to read from our data. And, when we want to update their 
data, we will hit theirs to update their data using a triggered job.

This al boils down to just integrating with Kafka.

Once again, thanks for all the help.

Cheers,
Ben


On Oct 9, 2016, at 3:16 AM, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:

please keep also in mind that Tableau Server has the capabilities to store data 
in-memory and refresh only when needed the in-memory data. This means you can 
import it from any source and let your users work only on the in-memory data in 
Tableau Server.

On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:
Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided 
already a good alternative. However, you should check if it contains a recent 
version of Hbase and Phoenix. That being said, I just wonder what is the 
dataflow, data model and the analysis you plan to do. Maybe there are 
completely different solutions possible. Especially these single inserts, 
upserts etc. should be avoided as much as possible in the Big Data (analysis) 
world with any technology, because they do not perform well.

Hive with Llap will provide an in-memory cache for interactive analytics. You 
can put full tables in-memory with Hive using Ignite HDFS in-memory solution. 
All this does only make sense if you do not use MR as an engine, the right 
input format (ORC, parquet) and a recent Hive version.

On 8 Oct 2016, at 21:55, Benjamin Kim 
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wrote:

Mich,

Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 
as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will 

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
Skip Phoenix

On Oct 17, 2016, at 2:20 PM, Thakrar, Jayesh 
<jthak...@conversantmedia.com<mailto:jthak...@conversantmedia.com>> wrote:

Ben,

Also look at Phoenix (Apache project) which provides a better (one of the best) 
SQL/JDBC layer on top of HBase.
http://phoenix.apache.org/

Cheers,
Jayesh


From: vincent gromakowski 
<vincent.gromakow...@gmail.com<mailto:vincent.gromakow...@gmail.com>>
Date: Monday, October 17, 2016 at 1:53 PM
To: Benjamin Kim <bbuil...@gmail.com<mailto:bbuil...@gmail.com>>
Cc: Michael Segel 
<msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>>, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>>, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Spark SQL Thriftserver with HBase

Instead of (or additionally to) saving results somewhere, you just start a 
thriftserver that expose the Spark tables of the SQLContext (or SparkSession 
now). That means you can implement any logic (and maybe use structured 
streaming) to expose your data. Today using the thriftserver means reading data 
from the persistent store every query, so if the data modeling doesn't fit the 
query it can be quite long.  What you generally do in a common spark job is to 
load the data and cache spark table in a in-memory columnar table which is 
quite efficient for any kind of query, the counterpart is that the cache isn't 
updated you have to implement a reload mechanism, and this solution isn't 
available using the thriftserver.
What I propose is to mix the two world: periodically/delta load data in spark 
table cache and expose it through the thriftserver. But you have to implement 
the loading logic, it can be very simple to very complex depending on your 
needs.


2016-10-17 19:48 GMT+02:00 Benjamin Kim 
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>>:
Is this technique similar to what Kinesis is offering or what Structured 
Streaming is going to have eventually?

Just curious.

Cheers,
Ben


On Oct 17, 2016, at 10:14 AM, vincent gromakowski 
<vincent.gromakow...@gmail.com<mailto:vincent.gromakow...@gmail.com>> wrote:

I would suggest to code your own Spark thriftserver which seems to be very easy.
http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server

I am starting to test it. The big advantage is that you can implement any logic 
because it's a spark job and then start a thrift server on temporary table. For 
example you can query a micro batch rdd from a kafka stream, or pre load some 
tables and implement a rolling cache to periodically update the spark in memory 
tables with persistent store...
It's not part of the public API and I don't know yet what are the issues doing 
this but I think Spark community should look at this path: making the 
thriftserver be instantiable in any spark job.

2016-10-17 18:17 GMT+02:00 Michael Segel 
<msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>>:
Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a 
Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and 
you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that 
you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
performance will vary.

With respect to Tableau… their entire interface in to the big data world 
revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim 
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wrote:

Thanks for all the suggestions. It would seem you guys are right about the 
Tableau side of things. The reports don’t need to be real-time, and they won’t 
be directly feeding off of the main DMP HBase data. Instead, it’ll be batched 
to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase 
for ID generation, but after further investigation into the use-case and 
architecture, the ID generation needs to happen local to the Ad Servers where 
we generate a unique ID and store it in a ID linking table. Even better, many 
of t

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
ple you can query a micro batch rdd from a kafka stream, or pre load some 
tables and implement a rolling cache to periodically update the spark in memory 
tables with persistent store...
It's not part of the public API and I don't know yet what are the issues doing 
this but I think Spark community should look at this path: making the 
thriftserver be instantiable in any spark job.

2016-10-17 18:17 GMT+02:00 Michael Segel 
<msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>>:
Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a 
Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and 
you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that 
you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
performance will vary.

With respect to Tableau… their entire interface in to the big data world 
revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim 
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wrote:

Thanks for all the suggestions. It would seem you guys are right about the 
Tableau side of things. The reports don’t need to be real-time, and they won’t 
be directly feeding off of the main DMP HBase data. Instead, it’ll be batched 
to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase 
for ID generation, but after further investigation into the use-case and 
architecture, the ID generation needs to happen local to the Ad Servers where 
we generate a unique ID and store it in a ID linking table. Even better, many 
of the 3rd party services supply this ID. So, data only needs to flow in one 
direction. We will use Kafka as the bus for this. No JDBC required. This is 
also goes for the REST Endpoints. 3rd party services will hit ours to update 
our data with no need to read from our data. And, when we want to update their 
data, we will hit theirs to update their data using a triggered job.

This al boils down to just integrating with Kafka.

Once again, thanks for all the help.

Cheers,
Ben


On Oct 9, 2016, at 3:16 AM, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:

please keep also in mind that Tableau Server has the capabilities to store data 
in-memory and refresh only when needed the in-memory data. This means you can 
import it from any source and let your users work only on the in-memory data in 
Tableau Server.

On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:
Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided 
already a good alternative. However, you should check if it contains a recent 
version of Hbase and Phoenix. That being said, I just wonder what is the 
dataflow, data model and the analysis you plan to do. Maybe there are 
completely different solutions possible. Especially these single inserts, 
upserts etc. should be avoided as much as possible in the Big Data (analysis) 
world with any technology, because they do not perform well.

Hive with Llap will provide an in-memory cache for interactive analytics. You 
can put full tables in-memory with Hive using Ignite HDFS in-memory solution. 
All this does only make sense if you do not use MR as an engine, the right 
input format (ORC, parquet) and a recent Hive version.

On 8 Oct 2016, at 21:55, Benjamin Kim 
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wrote:

Mich,

Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 
as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will 
either try Phoenix JDBC Server for HBase or push to move faster to Kudu with 
Impala. We will use Impala as the JDBC in-between until the Kudu team completes 
Spark SQL support for JDBC.

Thanks for the advice.

Cheers,
Ben


On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

Sure. But essentially you are looking at batch data for analytics for your 
tableau users so Hive may be a better choice with its rich SQL and ODBC.JDBC 
connection to Tableau already.

I would go for Hive especially the new release will have an in-memory offering 
as well for frequently accessed data :)


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
You really don’t want to do OLTP on a distributed NoSQL engine.
Remember Big Data isn’t relational its more of a hierarchy model or record 
model. Think IMS or Pick (Dick Pick’s revelation, U2, Universe, etc …)


On Oct 17, 2016, at 3:45 PM, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:

It has some implication because it imposes the SQL model on Hbase. Internally 
it translates the SQL queries into custom Hbase processors. Keep also in mind 
for what Hbase need a proper key design and how Phoenix designs those keys to 
get the best performance out of it. I think for oltp it is a workable model and 
I think they plan to offer Phoenix as a default interface as part of Hbase 
anyway.
For OLAP it depends.


On 17 Oct 2016, at 22:34, ayan guha 
<guha.a...@gmail.com<mailto:guha.a...@gmail.com>> wrote:


Hi

Any reason not to recommend Phoneix? I haven't used it myself so curious about 
pro's and cons about the use of it.

On 18 Oct 2016 03:17, "Michael Segel" 
<msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>> wrote:
Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a 
Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and 
you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that 
you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
performance will vary.

With respect to Tableau… their entire interface in to the big data world 
revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim 
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wrote:

Thanks for all the suggestions. It would seem you guys are right about the 
Tableau side of things. The reports don’t need to be real-time, and they won’t 
be directly feeding off of the main DMP HBase data. Instead, it’ll be batched 
to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase 
for ID generation, but after further investigation into the use-case and 
architecture, the ID generation needs to happen local to the Ad Servers where 
we generate a unique ID and store it in a ID linking table. Even better, many 
of the 3rd party services supply this ID. So, data only needs to flow in one 
direction. We will use Kafka as the bus for this. No JDBC required. This is 
also goes for the REST Endpoints. 3rd party services will hit ours to update 
our data with no need to read from our data. And, when we want to update their 
data, we will hit theirs to update their data using a triggered job.

This al boils down to just integrating with Kafka.

Once again, thanks for all the help.

Cheers,
Ben


On Oct 9, 2016, at 3:16 AM, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:

please keep also in mind that Tableau Server has the capabilities to store data 
in-memory and refresh only when needed the in-memory data. This means you can 
import it from any source and let your users work only on the in-memory data in 
Tableau Server.

On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:
Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided 
already a good alternative. However, you should check if it contains a recent 
version of Hbase and Phoenix. That being said, I just wonder what is the 
dataflow, data model and the analysis you plan to do. Maybe there are 
completely different solutions possible. Especially these single inserts, 
upserts etc. should be avoided as much as possible in the Big Data (analysis) 
world with any technology, because they do not perform well.

Hive with Llap will provide an in-memory cache for interactive analytics. You 
can put full tables in-memory with Hive using Ignite HDFS in-memory solution. 
All this does only make sense if you do not use MR as an engine, the right 
input format (ORC, parquet) and a recent Hive version.

On 8 Oct 2016, at 21:55, Benjamin Kim 
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wrote:

Mich,

Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 
as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will 
either try Phoenix JDBC Server for HBase or push to move faster to Kudu with 
Impala. We will use Impala as the JDBC in-between u

Apache Spark documentation on mllib's Kmeans doesn't jibe.

2017-12-13 Thread Michael Segel
Hi,

Just came across this while looking at the docs on how to use Spark’s Kmeans 
clustering.

Note: This appears to be true in both 2.1 and 2.2 documentation.

The overview page:
https://spark.apache.org/docs/2.1.0/mllib-clustering.html#k-means

Here’ the example contains the following line:

val clusters = KMeans.train(parsedData, numClusters, numIterations)

I was trying to get more information on the train() method.
So I checked out the KMeans Scala API:
https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.mllib.clustering.KMeans

The issue is that I couldn’t find the train method…

So I thought I was slowly losing my mind.

I checked out the entire API page… could not find any API docs which describe 
the method train().

I ended up looking at the source code and found the method in the scala source 
code.
(You can see the code here: 
https://github.com/apache/spark/blob/v2.1.0/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
 )

So the method(s) exist, but not covered in the Scala API doc.

How do you raise this as a ‘bug’ ?

Thx

-Mike



Reading Hive RCFiles?

2018-01-18 Thread Michael Segel
Hi, 

I’m trying to find out if there’s a simple way for Spark to be able to read an 
RCFile. 

I know I can create a table in Hive, then drop the files in to that directory 
and use a sql context to read the file from Hive, however I wanted to read the 
file directly. 

Not a lot of details to go on… even the Apache site’s links are broken. 
See :
https://cwiki.apache.org/confluence/display/Hive/RCFile

Then try to follow the Javadoc link. 


Any suggestions? 

Thx

-Mike



Re: Reading Hive RCFiles?

2018-01-18 Thread Michael Segel
No idea on how that last line of garbage got in the message. 


> On Jan 18, 2018, at 9:32 AM, Michael Segel <msegel_had...@hotmail.com> wrote:
> 
> Hi, 
> 
> I’m trying to find out if there’s a simple way for Spark to be able to read 
> an RCFile. 
> 
> I know I can create a table in Hive, then drop the files in to that directory 
> and use a sql context to read the file from Hive, however I wanted to read 
> the file directly. 
> 
> Not a lot of details to go on… even the Apache site’s links are broken. 
> See :
> https://cwiki.apache.org/confluence/display/Hive/RCFile
> 
> Then try to follow the Javadoc link. 
> 
> 
> Any suggestions? 
> 
> Thx
> 
> -Mike
> 
> 


Re: Reading Hive RCFiles?

2018-01-29 Thread Michael Segel
Just to follow up…

I was able to create an RDD from the file, however,  diving in to the RDD is a 
bit weird, and I’m working thru it.  My test file seems to be one block … 3K 
rows. So when I tried to get the first column of the first row, I ended up 
getting all of the rows for the first column which were comma delimited.   The 
other issue is then converting numeric fields back from their byte code.  I 
have the schema so I can do that.  (This is also an issue with RCFileCat  
(sorry if I messed that name up…) things work great if you’re using strings 
only. )

I guess this could be a start of a project (time permitting) to enhance the 
ability to read older file formats as easy as it is to read Parquet and ORC 
files.

Will have to follow up in Dev.

Thanks everyone for the pointers.


On Jan 20, 2018, at 5:55 PM, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:

Forgot to add the mailinglist

On 18. Jan 2018, at 18:55, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:

Welll you can use:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#hadoopRDD-org.apache.hadoop.mapred.JobConf-java.lang.Class-java.lang.Class-java.lang.Class-int-

with the following inputformat:
https://hive.apache.org/javadocs/r2.1.1/api/org/apache/hadoop/hive/ql/io/RCFileInputFormat.html

(note the version of the Javadoc does not matter it is already possible since a 
long time).

Writing is similarly with PairRDD and RCFileOutputFormat

On Thu, Jan 18, 2018 at 5:02 PM, Michael Segel 
<msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>> wrote:
No idea on how that last line of garbage got in the message.


> On Jan 18, 2018, at 9:32 AM, Michael Segel 
> <msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>> wrote:
>
> Hi,
>
> I’m trying to find out if there’s a simple way for Spark to be able to read 
> an RCFile.
>
> I know I can create a table in Hive, then drop the files in to that directory 
> and use a sql context to read the file from Hive, however I wanted to read 
> the file directly.
>
> Not a lot of details to go on… even the Apache site’s links are broken.
> See :
> https://cwiki.apache.org/confluence/display/Hive/RCFile
>
> Then try to follow the Javadoc link.
>
>
> Any suggestions?
>
> Thx
>
> -Mike
>
>




Re: schema change for structured spark streaming using jsonl files

2018-04-25 Thread Michael Segel
Hi,

This is going to sound complicated.

Taken as an individual JSON document, because its a self contained schema doc, 
its structured.  However there isn’t a persisting schema that has to be 
consistent across multiple documents.  So you can consider it semi structured.

If you’re parsing the JSON document and storing different attributes in 
separate columns… you will have a major issue because its possible for a JSON 
document to contain a new element that isn’t in your Parquet schema.

If you are going from JSON to parquet… you will probably be better off storing 
a serialized version of the JSON doc and then storing highlighted attributes in 
separate columns.

HTH

-Mike


> On Apr 23, 2018, at 1:46 PM, Lian Jiang  wrote:
> 
> Hi,
> 
> I am using structured spark streaming which reads jsonl files and writes into 
> parquet files. I am wondering what's the process if jsonl files schema change.
> 
> Suppose jsonl files are generated in \jsonl folder and the old schema is { 
> "field1": String}. My proposal is:
> 
> 1. write the jsonl files with new schema (e.g. {"field1":String, 
> "field2":Int}) into another folder \jsonl2
> 2. let spark job complete handling all data in \jsonl, then stop the spark 
> streaming job.
> 3. use a spark script to convert the parquet files from old schema to new 
> schema (e.g. add a new column with some default value for "field2").
> 4. upgrade and start the spark streaming job for handling the new schema 
> jsonl files and parquet files.
> 
> Is this process correct (best)? Thanks for any clue.



Re: Merging Parquet Files

2020-09-03 Thread Michael Segel
Hi, 

I think you’re asking the right question, however you’re making an assumption 
that he’s on the cloud and he never talked about the size of the file. 

It could be that he’s got a lot of small-ish data sets.  1GB is kinda small in 
relative terms.  

Again YMMV. 

Personally if you’re going to use Spark for data engineering,  Scala first, 
Java second, then Python unless you’re a Python developer which means go w 
Python. 

I agree that wanting to have a single file needs to be explained. 


> On Aug 31, 2020, at 10:52 AM, Jörn Franke  wrote:
> 
> Why only one file?
> I would go more for files of specific size, eg data is split in 1gb files. 
> The reason is also that if you need to transfer it (eg to other clouds etc) - 
> having a large file of several terabytes is bad.
> 
> It depends on your use case but you might look also at partitions etc.
> 
>> Am 31.08.2020 um 16:17 schrieb Tzahi File :
>> 
>> 
>> Hi, 
>> 
>> I would like to develop a process that merges parquet files. 
>> My first intention was to develop it with PySpark using coalesce(1) -  to 
>> create only 1 file. 
>> This process is going to run on a huge amount of files.
>> I wanted your advice on what is the best way to implement it (PySpark isn't 
>> a must).  
>> 
>> 
>> Thanks,
>> Tzahi