Re: [Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Michael Segel
Just out of curiosity, what would happen if you put your 10K values in to a 
temp table and then did a join against it? 

> On Apr 5, 2017, at 4:30 PM, Maciej Bryński  wrote:
> 
> Hi,
> I'm trying to run queries with many values in IN operator.
> 
> The result is that for more than 10K values IN operator is getting slower.
> 
> For example this code is running about 20 seconds.
> 
> df = spark.range(0,10,1,1)
> df.where('id in ({})'.format(','.join(map(str,range(10).count()
> 
> Any ideas how to improve this ?
> Is it a bug ?
> -- 
> Maciek Bryński
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Handling questions in the mailing lists

2016-11-08 Thread Michael Segel
Guys… please take what I say with a grain of salt…

The issue is that the input is a stream of messages where they are addressed in 
a LIFO manner.  This means that messages may be ignored. The stream of data 
(user@spark for example) is semi-structured in that the stream contains a lot 
of messages, some which could be noise or repeats not really organized by 
content.


So why not try to solve this as a Big Data problem… You’re streaming data in to 
the ‘lake’ and upon ingestion, you need to scan / index / and tag the message 
so that it could be easier to find.

Now you can create user tools to search the messages. (e.g. SparkSQL … , ML, 
etc…) So you can find a target set of messages and see how many times they have 
been viewed, answered… even query who answered them… (e.g. Dean Wampler on 
Spark/Scala issues answered 30 questions this past month.   or Owen was 
answering questions that focused on spark security… )  What features came up 
the most in the questions…  etc …


I guess the point I’m trying to make is that you should consider rolling your 
own tool set, or looking beyond just SO.

Some have taken to glitter to set up online communities where discussions and 
questions can be answered… but looking at tools like glitter (github) , 
Atlassian, and SO… its a disjoint toolset.

Why not choose one, or decide to roll your own and move on with it?  (Either 
under Apache, or outside on your own.)


I apologize for my mini rant.

-Mike

On Nov 7, 2016, at 4:24 PM, Maciej Szymkiewicz 
> wrote:


Just a couple of random thoughts regarding Stack Overflow...

  *   If we are thinking about shifting focus towards SO all attempts of 
micromanaging should be discarded right in the beginning. Especially things 
like meta tags, which are discouraged and "burninated" 
(https://meta.stackoverflow.com/tags/burninate-request/info) , or thread 
bumping. Depending on a context these won't be manageable, go against community 
guidelines or simply obsolete.
  *   Lack of expertise is unlikely an issue. Even now there is a number of 
advanced Spark users on SO. Of course the more the merrier.

Things that can be easily improved:

  *   Identifying, improving and promoting canonical questions and answers. It 
means closing duplicate, suggesting edits to improve existing answers, 
providing alternative solutions. This can be also used to identify gaps in the 
documentation.
  *   Providing a set of clear posting guidelines to reduce effort required to 
identify the problem (think abouthttp://stackoverflow.com/q/5963269 a.k.a How 
to make a great R reproducible example?)
  *   Helping users decide if question is a good fit for SO (see below). API 
questions are great fit, debugging problems like "my cluster is slow" are not.
  *   Actively cleaning (closing, deleting) off-topic and low quality 
questions. The less junk to sieve through the better chance of good questions 
being answered.
  *   Repurposing and actively moderating SO docs 
(https://stackoverflow.com/documentation/apache-spark/topics). Right now most 
of the stuff that goes there is useless, duplicated or plagiarized, or border 
case SPAM.
  *   Encouraging community to monitor featured 
(https://stackoverflow.com/questions/tagged/apache-spark?sort=featured) and 
active & upvoted & unanswered 
(https://stackoverflow.com/unanswered/tagged/apache-spark) questions.
  *   Implementing some procedure to identify questions which are likely to be 
bugs or a material for feature requests. Personally I am quite often tempted to 
simply send a link to dev list, but I don't think it is really acceptable.
  *   Animating Spark related chat room. I tried this a couple of times but to 
no avail. Without a certain critical mass of users it just won't work.


On 11/07/2016 07:32 AM, Reynold Xin wrote:
This is an excellent point. If we do go ahead and feature SO as a way for users 
to ask questions more prominently, as someone who knows SO very well, would you 
be willing to help write a short guideline (ideally the shorter the better, 
which makes it hard) to direct what goes to user@ and what goes to SO?

Sure, I'll be happy to help if I can.



On Sun, Nov 6, 2016 at 9:54 PM, Maciej Szymkiewicz 
> wrote:

Damn, I always thought that mailing list is only for nice and welcoming people 
and there is nothing to do for me here >:)

To be serious though, there are many questions on the users list which would 
fit just fine on SO but it is not true in general. There are dozens of 
questions which are to broad, opinion based, ask for external resources and so 
on. If you want to direct users to SO you have to help them to decide if it is 
the right channel. Otherwise it will just create a really bad experience for 
both seeking help and active answerers. Former ones will be downvoted and 
bashed, latter ones will have to deal with handling all the junk and the number 
of 

Indexing w spark joins?

2016-10-17 Thread Michael Segel
Hi,

Apologies if I’ve asked this question before but I didn’t see it in the list 
and I’m certain that my last surviving brain cell has gone on strike over my 
attempt to reduce my caffeine intake…

Posting this to both user and dev because I think the question / topic jumps in 
to both camps.


Again since I’m a relative newbie on spark… I may be missing something so 
apologies up front…


With respect to Spark SQL,  in pre 2.0.x,  there were only hash joins?  In post 
2.0.x you have hash, semi-hash , and sorted list merge.

For the sake of simplicity… lets forget about cross product joins…

Has anyone looked at how we could use inverted tables to improve query 
performance?

The issue is that when you have a data sewer (lake) , what happens when your 
use case query is orthogonal to how your data is stored? This means full table 
scans.
By using secondary indexes, we can reduce this albeit at a cost of increasing 
your storage footprint by the size of the index.

Are there any JIRAs open that discuss this?

Indexes to assist in terms of ‘predicate push downs’ (using the index when a 
field in a where clause is indexed) rather than performing a full table scan.
Indexes to assist in the actual join if the join column is on an indexed column?

In the first, using an inverted table to produce a sort ordered set of row keys 
that you would then use in the join process (same as if you produced the subset 
based on the filter.)

To put this in perspective… here’s a dummy use case…

CCCis (CCC) is the middle man in the insurance industry. They have a piece of 
software that sits in the repair shop (e.g Joe’s Auto Body) and works with 
multiple insurance carriers.
The primary key in their data is going to be Insurance Company | Claim ID.  
This makes it very easy to find a specific claim for further processing.

Now lets say I want to do some analysis on determining the average cost of 
repairing a front end collision of a Volvo S80?
Or
Break down the number and types of accidents by car manufacturer , model and 
color.  (Then see if there is any correlation between car color and # and type 
of accidents)


As you can see, all of these queries are orthogonal to my storage.  So I need 
to create secondary indexes to help sift thru the data efficiently.

Does this make sense?

Please Note: I did some work for CCC back in the late 90’s. Any resemblance to 
their big data efforts is purely coincidence  and you can replace CCC with 
Allstate, Progressive, StateFarm or some other auto insurance company …

Thx

-Mike




Re: Spark SQL JSON Column Support

2016-09-28 Thread Michael Segel
Silly  question?
When you talk about ‘user specified schema’ do you mean for the user to supply 
an additional schema, or that you’re using the schema that’s described by the 
JSON string?
(or both? [either/or] )

Thx

On Sep 28, 2016, at 12:52 PM, Michael Armbrust 
> wrote:

Spark SQL has great support for reading text files that contain JSON data. 
However, in many cases the JSON data is just one column amongst others. This is 
particularly true when reading from sources such as Kafka. This 
PR adds a new functions from_json 
that converts a string column into a nested StructType with a user specified 
schema, using the same internal logic as the json Data Source.

Would love to hear any comments / suggestions.

Michael



Re: Spark Thrift Server Concurrency

2016-06-23 Thread Michael Segel
Hi, 
There are  a lot of moving parts and a lot of unknowns from your description. 
Besides the version stuff. 

How many executors, how many cores? How much memory? 
Are you persisting (memory and disk) or just caching (memory) 

During the execution… same tables… are  you seeing a lot of shuffling of data 
for some queries and not others? 

It sounds like an interesting problem… 

> On Jun 23, 2016, at 5:21 AM, Prabhu Joseph  wrote:
> 
> Hi All,
> 
>On submitting 20 parallel same SQL query to Spark Thrift Server, the query 
> execution time for some queries are less than a second and some are more than 
> 2seconds. The Spark Thrift Server logs shows all 20 queries are submitted at 
> same time 16/06/23 12:12:01 but the result schema are at different times.
> 
> 16/06/23 12:12:01 INFO SparkExecuteStatementOperation: Running query 'select 
> distinct val2 from philips1 where key>=1000 and key<=1500
> 
> 16/06/23 12:12:02 INFO SparkExecuteStatementOperation: Result Schema: 
> ArrayBuffer(val2#2110)
> 16/06/23 12:12:03 INFO SparkExecuteStatementOperation: Result Schema: 
> ArrayBuffer(val2#2182)
> 16/06/23 12:12:04 INFO SparkExecuteStatementOperation: Result Schema: 
> ArrayBuffer(val2#2344)
> 16/06/23 12:12:05 INFO SparkExecuteStatementOperation: Result Schema: 
> ArrayBuffer(val2#2362)
> 
> There are sufficient executors running on YARN. The concurrency is affected 
> by Single Driver. How to improve the concurrency and what are the best 
> practices.
> 
> Thanks,
> Prabhu Joseph



Re: Secondary Indexing?

2016-05-30 Thread Michael Segel
I have to clarify something… 
In SparkSQL, we can query against both immutable existing RDDs, and 
Hive/HBase/MapRDB/  which are mutable.  
So we have to keep this in mind while we are talking about secondary indexing. 
(Its not just RDDs)

I think the only advantage to being immutable is that once you generate and 
index the RDD, its not going to change, so the ‘correctness’ or RI is implicit. 
Here, the issue becomes how long will the RDD live. There is a cost to generate 
the index, which has to be weighed against its usefulness and the longevity of 
the underlying RDD. Since the RDD is typically associated to a single spark 
context, building indexes may be cost prohibitive. 

At the same time… if you are dealing with a large enough set of data… you will 
have I/O. Both in terms of networking and Physical. This is true of both Spark 
and in-memory RDBMs.  This is due to the footprint of data along with the need 
to persist the data. 

But I digress. 

So in one scenario, we’re building our RDDs from a system that has indexing 
available.  Is it safe to assume that SparkSQL will take advantage of the 
indexing in the underlying system? (Imagine sourcing data from an Oracle or DB2 
database in order to build RDDs.) If so, then we don’t have to work about 
indexing. 

In another scenario, we’re joining an RDD against a table in an RDBMS. Is it 
safe to assume that Spark will select data from the database in to an RDD prior 
to attempting to do the join?  Here, the RDBMs table will use its index when 
you execute the query? (Again its an assumption…)  Then you have two data sets 
that then need to be joined, which leads to the third scenario…

Joining two spark RDDs. 
Going from memory, its a hash join. Here the RDD is used to create a hash table 
which would imply an index   of the hash key.  So for joins, you wouldn’t need 
a secondary index? 
They wouldn’t provide any value due to the hash table being created. (And you 
would probably apply the filter while you inserted a row in to the hash table 
before the join. ) 

Did I just answer my own question? 



> On May 30, 2016, at 10:58 AM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> Just a thought
> 
> Well in Spark RDDs are immutable which is an advantage compared to a 
> conventional IMDB like Oracle TimesTen meaning concurrency is not an issue 
> for certain indexes.
> 
> The overriding optimisation (as there is no Physical IO) has to be reducing 
> memory footprint and CPU demands and using indexes may help for full key 
> lookups. if I recall correctly in-memory databases support hash-indexes and 
> T-tree indexes which are pretty common in these situations. But there is an 
> overhead in creating indexes on RDDS and I presume parallelize those indexes.
> 
> With regard to getting data into RDD from say an underlying table in Hive 
> into a temp table, then depending on the size of that temp table, one can 
> debate an index on that temp table.
> 
> The question is what use case do you have in mind.?
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 30 May 2016 at 17:08, Michael Segel <msegel_had...@hotmail.com 
> <mailto:msegel_had...@hotmail.com>> wrote:
> I’m not sure where to post this since its a bit of a philosophical question 
> in terms of design and vision for spark.
> 
> If we look at SparkSQL and performance… where does Secondary indexing fit in?
> 
> The reason this is a bit awkward is that if you view Spark as querying RDDs 
> which are temporary, indexing doesn’t make sense until you consider your use 
> case and how long is ‘temporary’.
> Then if you consider your RDD result set could be based on querying tables… 
> and you could end up with an inverted table as an index… then indexing could 
> make sense.
> 
> Does it make sense to discuss this in user or dev email lists? Has anyone 
> given this any thought in the past?
> 
> Thx
> 
> -Mike
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 



Secondary Indexing?

2016-05-30 Thread Michael Segel
I’m not sure where to post this since its a bit of a philosophical question in 
terms of design and vision for spark. 

If we look at SparkSQL and performance… where does Secondary indexing fit in? 

The reason this is a bit awkward is that if you view Spark as querying RDDs 
which are temporary, indexing doesn’t make sense until you consider your use 
case and how long is ‘temporary’.
Then if you consider your RDD result set could be based on querying tables… and 
you could end up with an inverted table as an index… then indexing could make 
sense. 

Does it make sense to discuss this in user or dev email lists? Has anyone given 
this any thought in the past? 

Thx

-Mike


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Indexing of RDDs and DF in 2.0?

2016-05-17 Thread Michael Segel
Hi, 

I saw a replay of a talk about what’s coming in Spark 2.0 and the speed 
performances… 

I am curious about indexing of data sets. 
In HBase/MapRDB you can create ordered sets of indexes through an inverted 
table. 
Here, you can take the intersection of the indexes to find the result set of 
rows.  
(Or intersect/null if you have left outer joins…) 

AFAIK, there was a project on an indexedRDD, but not sure how far that had 
gone? 

I realize that some of the improvements are based on using hashed joins, which 
would make indexing a bit harder… or am I missing something? 

Thx



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Any documentation on Spark's security model beyond YARN?

2016-04-01 Thread Michael Segel
Guys, 

Getting a bit off topic.  

Saying Security and HBase in the same sentence is a bit of a joke until HBase 
rejiggers its co-processers. Although’s Andrew’s fix could be enough to keep 
CSOs and their minions happy.

The larger picture is that Security has to stop being a ‘second thought’.  Once 
you start getting in to restricted and highly restricted data, you will have 
issues and anything you can do to stop leakage or the potential of leakage 
would be great. 

Getting back to spark specifically, you have components like the Thrift Service 
which can persist RDDs and I don’t see any restrictions on access. 

Does this mean integration w Ranger or Sentry? Does it mean rolling a separate 
solution? 

And if you’re going to look at Thrift, do you want to look at other potential 
areas as well? 

Please note: This may all be for nothing. It may be just having the discussion 
and coming to a conclusion as to the potential risks and how to mitigate is 
enough. 

Thx

-Mike

> On Mar 31, 2016, at 6:32 AM, Steve Loughran <ste...@hortonworks.com> wrote:
> 
>> 
>> On 30 Mar 2016, at 21:02, Sean Busbey <bus...@cloudera.com> wrote:
>> 
>> On Wed, Mar 30, 2016 at 4:33 AM, Steve Loughran <ste...@hortonworks.com> 
>> wrote:
>>> 
>>>> On 29 Mar 2016, at 22:19, Michael Segel <msegel_had...@hotmail.com> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> So yeah, I know that Spark jobs running on a Hadoop cluster will inherit 
>>>> its security from the underlying YARN job.
>>>> However… that’s not really saying much when you think about some use cases.
>>>> 
>>>> Like using the thrift service …
>>>> 
>>>> I’m wondering what else is new and what people have been thinking about 
>>>> how to enhance spark’s security.
>>>> 
>>> 
>>> Been thinking a bit.
>>> 
>>> One thing to look at is renewal of hbase and hive tokens on long-lived 
>>> services, alongside hdfs
>>> 
>>> 
>> 
>> I've been looking at this as well. The current work-around I'm using
>> is to use keytab logins on the executors, which is less than
>> desirable.
> 
> 
> OK, let's work together on this ... the current spark renewal code assumes 
> its only for HDFS (indeed, that the filesystem is HDFS and therefore the #of 
> tokens > 0); there' s no fundamental reason why the code in 
> YarnSparkHadoopUtils can't run in the AM too.
> 
>> 
>> Since the HBase project maintains Spark integration points, it'd be
>> great if there were just a hook for services to provide "here's how to
>> renew" to a common renewal service.
>> 
> 
> 1. Wittenauer is doing some work on a tool for doing this; I'm pushing for it 
> to be a fairly generic API. Even if Spark has to use reflection to get at it, 
> at least it would be consistent across services. See 
> https://issues.apache.org/jira/browse/HADOOP-12563 
> <https://issues.apache.org/jira/browse/HADOOP-12563>
> 
> 2. The topic of HTTPS based acquisition/use of HDFS tokens has arisen 
> elsewhere; needed for long-haul job submission when  you don' t have a keytab 
> to hand. This could be useful as it'd avoid actually needing hbase-*.jar on 
> the classpath at submit time.
> 
> 
>> 
>> 
>> -- 
>> busbey
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
>> <mailto:dev-unsubscr...@spark.apache.org>
>> For additional commands, e-mail: dev-h...@spark.apache.org 
>> <mailto:dev-h...@spark.apache.org>
>> 
>> 
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> <mailto:dev-unsubscr...@spark.apache.org>
> For additional commands, e-mail: dev-h...@spark.apache.org 
> <mailto:dev-h...@spark.apache.org>


Any documentation on Spark's security model beyond YARN?

2016-03-29 Thread Michael Segel
Hi, 

So yeah, I know that Spark jobs running on a Hadoop cluster will inherit its 
security from the underlying YARN job. 
However… that’s not really saying much when you think about some use cases. 

Like using the thrift service … 

I’m wondering what else is new and what people have been thinking about how to 
enhance spark’s security. 

Thx

-Mike


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Silly question about building Spark 1.4.1

2015-07-20 Thread Michael Segel
Hi, 

I’m looking at the online docs for building spark 1.4.1 … 

http://spark.apache.org/docs/latest/building-spark.html 
http://spark.apache.org/docs/latest/building-spark.html 

I was interested in building spark for Scala 2.11 (latest scala) and also for 
Hive and JDBC support. 

The docs say:
“
To produce a Spark package compiled with Scala 2.11, use the -Dscala-2.11 
property:
dev/change-version-to-2.11.sh
mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package
“ 
So… 
Is there a reason I shouldn’t build against hadoop-2.6 ? 

If I want to add the Thirft and Hive support, is it possible? 
Looking at the Scala build, it looks like hive support is being built? 
(Looking at the stdout messages…)
Should the docs be updated? Am I missing something? 
(Dean W. can confirm, I am completely brain dead. ;-) 

Thx

-Mike
PS. Yes I can probably download a prebuilt image, but I’m a glutton for 
punishment. ;-)