Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Matei Zaharia
I'm pretty sure inner joins on Spark SQL already build only one of the sides. 
Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators. Only outer 
joins do both, and it seems like we could optimize it for those that are not 
full.

Matei


On Oct 7, 2014, at 11:04 PM, Haopu Wang  wrote:

> Liquan, yes, for full outer join, one hash table on both sides is more 
> efficient.
>  
> For the left/right outer join, it looks like one hash table should be enought.
>  
> From: Liquan Pei [mailto:liquan...@gmail.com] 
> Sent: 2014年9月30日 18:34
> To: Haopu Wang
> Cc: dev@spark.apache.org; user
> Subject: Re: Spark SQL question: why build hashtable for both sides in 
> HashOuterJoin?
>  
> Hi Haopu,
>  
> How about full outer join? One hash table may not be efficient for this case. 
>  
> Liquan
>  
> On Mon, Sep 29, 2014 at 11:47 PM, Haopu Wang  wrote:
> Hi, Liquan, thanks for the response.
>  
> In your example, I think the hash table should be built on the "right" side, 
> so Spark can iterate through the left side and find matches in the right side 
> from the hash table efficiently. Please comment and suggest, thanks again!
>  
> From: Liquan Pei [mailto:liquan...@gmail.com] 
> Sent: 2014年9月30日 12:31
> To: Haopu Wang
> Cc: dev@spark.apache.org; user
> Subject: Re: Spark SQL question: why build hashtable for both sides in 
> HashOuterJoin?
>  
> Hi Haopu,
>  
> My understanding is that the hashtable on both left and right side is used 
> for including null values in result in an efficient manner. If hash table is 
> only built on one side, let's say left side and we perform a left outer join, 
> for each row in left side, a scan over the right side is needed to make sure 
> that no matching tuples for that row on left side. 
>  
> Hope this helps!
> Liquan
>  
> On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang  wrote:
> I take a look at HashOuterJoin and it's building a Hashtable for both
> sides.
> 
> This consumes quite a lot of memory when the partition is big. And it
> doesn't reduce the iteration on streamed relation, right?
> 
> Thanks!
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 
> 
>  
> -- 
> Liquan Pei 
> Department of Physics 
> University of Massachusetts Amherst
> 
> 
>  
> -- 
> Liquan Pei 
> Department of Physics 
> University of Massachusetts Amherst



Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Liquan Pei
I am working on a PR to leverage the HashJoin trait code to optimize the
Left/Right outer join. It's already been tested locally and will send out
the PR soon after some clean up.

Thanks,
Liquan

On Wed, Oct 8, 2014 at 12:09 AM, Matei Zaharia 
wrote:

> I'm pretty sure inner joins on Spark SQL already build only one of the
> sides. Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators.
> Only outer joins do both, and it seems like we could optimize it for those
> that are not full.
>
> Matei
>
>
>
> On Oct 7, 2014, at 11:04 PM, Haopu Wang  wrote:
>
> Liquan, yes, for full outer join, one hash table on both sides is more
> efficient.
>
> For the left/right outer join, it looks like one hash table should be
> enought.
>
> --
> *From:* Liquan Pei [mailto:liquan...@gmail.com ]
> *Sent:* 2014年9月30日 18:34
> *To:* Haopu Wang
> *Cc:* dev@spark.apache.org; user
> *Subject:* Re: Spark SQL question: why build hashtable for both sides in
> HashOuterJoin?
>
> Hi Haopu,
>
> How about full outer join? One hash table may not be efficient for this
> case.
>
> Liquan
>
> On Mon, Sep 29, 2014 at 11:47 PM, Haopu Wang  wrote:
> Hi, Liquan, thanks for the response.
>
> In your example, I think the hash table should be built on the "right"
> side, so Spark can iterate through the left side and find matches in the
> right side from the hash table efficiently. Please comment and suggest,
> thanks again!
>
> --
> *From:* Liquan Pei [mailto:liquan...@gmail.com]
> *Sent:* 2014年9月30日 12:31
> *To:* Haopu Wang
> *Cc:* dev@spark.apache.org; user
> *Subject:* Re: Spark SQL question: why build hashtable for both sides in
> HashOuterJoin?
>
> Hi Haopu,
>
> My understanding is that the hashtable on both left and right side is used
> for including null values in result in an efficient manner. If hash table
> is only built on one side, let's say left side and we perform a left outer
> join, for each row in left side, a scan over the right side is needed to
> make sure that no matching tuples for that row on left side.
>
> Hope this helps!
> Liquan
>
> On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang  wrote:
>
> I take a look at HashOuterJoin and it's building a Hashtable for both
> sides.
>
> This consumes quite a lot of memory when the partition is big. And it
> doesn't reduce the iteration on streamed relation, right?
>
> Thanks!
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
> --
> Liquan Pei
> Department of Physics
> University of Massachusetts Amherst
>
>
>
> --
> Liquan Pei
> Department of Physics
> University of Massachusetts Amherst
>
>
>


-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst


Re: How to do broadcast join in SparkSQL

2014-10-08 Thread Jianshi Huang
Ok, currently there's cost-based optimization however Parquet statistics is
not implemented...

What's the good way if I want to join a big fact table with several tiny
dimension tables in Spark SQL (1.1)?

I wish we can allow user hint for the join.

Jianshi

On Wed, Oct 8, 2014 at 2:18 PM, Jianshi Huang 
wrote:

> Looks like https://issues.apache.org/jira/browse/SPARK-1800 is not merged
> into master?
>
> I cannot find spark.sql.hints.broadcastTables in latest master, but it's
> in the following patch.
>
>
> https://github.com/apache/spark/commit/76ca4341036b95f71763f631049fdae033990ab5
>
>
> Jianshi
>
>
> On Mon, Sep 29, 2014 at 1:24 AM, Jianshi Huang 
> wrote:
>
>> Yes, looks like it can only be controlled by the
>> parameter spark.sql.autoBroadcastJoinThreshold, which is a little bit weird
>> to me.
>>
>> How am I suppose to know the exact bytes of a table? Let me specify the
>> join algorithm is preferred I think.
>>
>> Jianshi
>>
>> On Sun, Sep 28, 2014 at 11:57 PM, Ted Yu  wrote:
>>
>>> Have you looked at SPARK-1800 ?
>>>
>>> e.g. see sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
>>> Cheers
>>>
>>> On Sun, Sep 28, 2014 at 1:55 AM, Jianshi Huang 
>>> wrote:
>>>
 I cannot find it in the documentation. And I have a dozen dimension
 tables to (left) join...


 Cheers,
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github & Blog: http://huangjs.github.com/

>>>
>>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Standardized Distance Functions in MLlib

2014-10-08 Thread Yu Ishikawa
Hi all, 

In my limited understanding of the MLlib, it is a good idea to use the
various distance functions on some machine learning algorithms. For example,
we can only use Euclidean distance metric in KMeans. And I am tackling with
contributing hierarchical clustering to MLlib
(https://issues.apache.org/jira/browse/SPARK-2429). I would like to support
the various distance functions in it.

Should we support the standardized distance function in MLlib or not?
You know, Spark depends on Breeze. So I think we have two approaches in
order to use distance functions in MLlib. One is implementing some distance
functions in MLlib. The other is wrapping the functions of Breeze. And I am
a bit worried about using Breeze directly in Spark. For example,  we can't
absolutely control the release of Breeze. 

I sent a PR before. But it is stopping. I'd like to get your thoughts on it,
community.
https://github.com/apache/spark/pull/1964#issuecomment-54953348

Best,



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Standardized-Distance-Functions-in-MLlib-tp8697.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Unneeded branches/tags

2014-10-08 Thread Nicholas Chammas
So:

   - tags: can delete
   - branches: stuck with ‘em

Correct?

Nick
​

On Wed, Oct 8, 2014 at 1:52 AM, Patrick Wendell  wrote:

> Actually - weirdly - we can delete old tags and it works with the
> mirroring. Nick if you put together a list of un-needed tags I can
> delete them.
>
> On Tue, Oct 7, 2014 at 6:27 PM, Reynold Xin  wrote:
> > Those branches are no longer active. However, I don't think we can delete
> > branches from github due to the way ASF mirroring works. I might be wrong
> > there.
> >
> >
> >
> > On Tue, Oct 7, 2014 at 6:25 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com
> >> wrote:
> >
> >> Just curious: Are there branches and/or tags on the repo that we don't
> need
> >> anymore?
> >>
> >> What are the scala-2.9 and streaming branches for, for example? And do
> we
> >> still need branches for older versions of Spark that we are not
> backporting
> >> stuff to, like branch-0.5?
> >>
> >> Nick
> >>
> >>
>


Re: Spark on Mesos 0.20

2014-10-08 Thread RJ Nowling
Yep!  That's the example I was talking about.

Is an error message printed when it hangs? I get :

14/09/30 13:23:14 ERROR BlockManagerMasterActor: Got two different
block manager registrations on 20140930-131734-1723727882-5050-1895-1



On Tue, Oct 7, 2014 at 8:36 PM, Fairiz Azizi  wrote:

> Sure, could you point me to the example?
>
> The only thing I could find was
>
> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/LogQuery.scala
>
> So do you mean running it like:
>MASTER="mesos://xxx*:5050*" ./run-example LogQuery
>
> I tried that and I can see the job run and the tasks complete on the slave
> nodes, but the client process seems to hang forever, it's probably a
> different problem. BTW, only a dozen or so tasks kick off.
>
> I actually haven't done much with Scala and Spark (it's been all python).
>
> Fi
>
>
>
> Fairiz "Fi" Azizi
>
> On Tue, Oct 7, 2014 at 6:29 AM, RJ Nowling  wrote:
>
>> I was able to reproduce it on a small 4 node cluster (1 mesos master and
>> 3 mesos slaves) with relatively low-end specs.  As I said, I just ran the
>> log query examples with the fine-grained mesos mode.
>>
>> Spark 1.1.0 and mesos 0.20.1.
>>
>> Fairiz, could you try running the logquery example included with Spark
>> and see what you get?
>>
>> Thanks!
>>
>> On Mon, Oct 6, 2014 at 8:07 PM, Fairiz Azizi  wrote:
>>
>>> That's what great about Spark, the community is so active! :)
>>>
>>> I compiled Mesos 0.20.1 from the source tarball.
>>>
>>> Using the Mapr3 Spark 1.1.0 distribution from the Spark downloads page
>>>  (spark-1.1.0-bin-mapr3.tgz).
>>>
>>> I see no problems for the workloads we are trying.
>>>
>>> However, the cluster is small (less than 100 cores across 3 nodes).
>>>
>>> The workloads reads in just a few gigabytes from HDFS, via an ipython
>>> notebook spark shell.
>>>
>>> thanks,
>>> Fi
>>>
>>>
>>>
>>> Fairiz "Fi" Azizi
>>>
>>> On Mon, Oct 6, 2014 at 9:20 AM, Timothy Chen  wrote:
>>>
 Ok I created SPARK-3817 to track this, will try to repro it as well.

 Tim

 On Mon, Oct 6, 2014 at 6:08 AM, RJ Nowling  wrote:
 > I've recently run into this issue as well. I get it from running Spark
 > examples such as log query.  Maybe that'll help reproduce the issue.
 >
 >
 > On Monday, October 6, 2014, Gurvinder Singh <
 gurvinder.si...@uninett.no>
 > wrote:
 >>
 >> The issue does not occur if the task at hand has small number of map
 >> tasks. I have a task which has 978 map tasks and I see this error as
 >>
 >> 14/10/06 09:34:40 ERROR BlockManagerMasterActor: Got two different
 block
 >> manager registrations on 20140711-081617-711206558-5050-2543-5
 >>
 >> Here is the log from the mesos-slave where this container was
 running.
 >>
 >> http://pastebin.com/Q1Cuzm6Q
 >>
 >> If you look for the code from where error produced by spark, you will
 >> see that it simply exit and saying in comments "this should never
 >> happen, lets just quit" :-)
 >>
 >> - Gurvinder
 >> On 10/06/2014 09:30 AM, Timothy Chen wrote:
 >> > (Hit enter too soon...)
 >> >
 >> > What is your setup and steps to repro this?
 >> >
 >> > Tim
 >> >
 >> > On Mon, Oct 6, 2014 at 12:30 AM, Timothy Chen 
 wrote:
 >> >> Hi Gurvinder,
 >> >>
 >> >> I tried fine grain mode before and didn't get into that problem.
 >> >>
 >> >>
 >> >> On Sun, Oct 5, 2014 at 11:44 PM, Gurvinder Singh
 >> >>  wrote:
 >> >>> On 10/06/2014 08:19 AM, Fairiz Azizi wrote:
 >>  The Spark online docs indicate that Spark is compatible with
 Mesos
 >>  0.18.1
 >> 
 >>  I've gotten it to work just fine on 0.18.1 and 0.18.2
 >> 
 >>  Has anyone tried Spark on a newer version of Mesos, i.e. Mesos
 >>  v0.20.0?
 >> 
 >>  -Fi
 >> 
 >> >>> Yeah we are using Spark 1.1.0 with Mesos 0.20.1. It runs fine in
 >> >>> coarse
 >> >>> mode, in fine grain mode there is an issue with blockmanager
 names
 >> >>> conflict. I have been waiting for it to be fixed but it is still
 >> >>> there.
 >> >>>
 >> >>> -Gurvinder
 >> >>>
 >> >>>
 -
 >> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 >> >>> For additional commands, e-mail: dev-h...@spark.apache.org
 >> >>>
 >>
 >>
 >> -
 >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 >> For additional commands, e-mail: dev-h...@spark.apache.org
 >>
 >
 >
 > --
 > em rnowl...@gmail.com
 > c 954.496.2314

>>>
>>>
>>
>>
>> --
>> em rnowl...@gmail.com
>> c 954.496.2314
>>
>
>


-- 
em rnowl...@gmail.com
c 954.496.2314


Re: Extending Scala style checks

2014-10-08 Thread Nicholas Chammas
I've created SPARK-3849: Automate remaining Scala style rules
.

Please create sub-tasks on this issue for rules that we have not automated
and let's work through them as possible.

I went ahead and created the first sub-task, SPARK-3850: Scala style:
Disallow trailing spaces .

Nick

On Tue, Oct 7, 2014 at 4:45 PM, Nicholas Chammas  wrote:

> For starters, do we have a list of all the Scala style rules that are
> currently not enforced automatically but are likely well-suited for
> automation?
>
> Let's put such a list together in a JIRA issue and work through
> implementing them.
>
> Nick
>
> On Thu, Oct 2, 2014 at 12:06 AM, Cheng Lian  wrote:
>
>> Since we can easily catch the list of all changed files in a PR, I think
>> we can start with adding the no trailing space check for newly changed
>> files only?
>>
>>
>> On 10/2/14 9:24 AM, Nicholas Chammas wrote:
>>
>>> Yeah, I remember that hell when I added PEP 8 to the build checks and
>>> fixed
>>> all the outstanding Python style issues. I had to keep rebasing and
>>> resolving merge conflicts until the PR was merged.
>>>
>>> It's a rough process, but thankfully it's also a one-time process. I
>>> might
>>> be able to help with that in the next week or two if no-one else wants to
>>> pick it up.
>>>
>>> Nick
>>>
>>> On Wed, Oct 1, 2014 at 9:20 PM, Michael Armbrust >> >
>>> wrote:
>>>
>>>  The hard part here is updating the existing code base... which is going
 to
 create merge conflicts with like all of the open PRs...

 On Wed, Oct 1, 2014 at 6:13 PM, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

  Ah, since there appears to be a built-in rule for end-of-line
> whitespace,
> Michael and Cheng, y'all should be able to add this in pretty easily.
>
> Nick
>
> On Wed, Oct 1, 2014 at 6:37 PM, Patrick Wendell 
> wrote:
>
>  Hey Nick,
>>
>> We can always take built-in rules. Back when we added this Prashant
>> Sharma actually did some great work that lets us write our own style
>> rules in cases where rules don't exist.
>>
>> You can see some existing rules here:
>>
>>
>>  https://github.com/apache/spark/tree/master/project/
> spark-style/src/main/scala/org/apache/spark/scalastyle
>
>> Prashant has over time contributed a lot of our custom rules upstream
>> to stalastyle, so now there are only a couple there.
>>
>> - Patrick
>>
>> On Wed, Oct 1, 2014 at 2:36 PM, Ted Yu  wrote:
>>
>>> Please take a look at WhitespaceEndOfLineChecker under:
>>> http://www.scalastyle.org/rules-0.1.0.html
>>>
>>> Cheers
>>>
>>> On Wed, Oct 1, 2014 at 2:01 PM, Nicholas Chammas <
>>>
>> nicholas.cham...@gmail.com
>>
>>> wrote:
 As discussed here , it

>>> would be
>>
>>> good to extend our Scala style checks to programmatically enforce as

>>> many
>>
>>> of our style rules as possible.

 Does anyone know if it's relatively straightforward to enforce

>>> additional
>>
>>> rules like the "no trailing spaces" rule mentioned in the linked PR?

 Nick



>>
>


will/when Spark/SparkSQL will support ORCFile format

2014-10-08 Thread James Yu
Didn't see anyone asked the question before, but I was wondering if anyone
knows if Spark/SparkSQL will support ORCFile format soon? ORCFile is
getting more and more popular hi Hive world.

Thanks,
James


Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-08 Thread Evan Chan
James,

Michael at the meetup last night said there was some development
activity around ORCFiles.

I'm curious though, what are the pros and cons of ORCFiles vs Parquet?

On Wed, Oct 8, 2014 at 10:03 AM, James Yu  wrote:
> Didn't see anyone asked the question before, but I was wondering if anyone
> knows if Spark/SparkSQL will support ORCFile format soon? ORCFile is
> getting more and more popular hi Hive world.
>
> Thanks,
> James

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-08 Thread Mark Hamstra
https://github.com/apache/spark/pull/2576



On Wed, Oct 8, 2014 at 11:01 AM, Evan Chan  wrote:

> James,
>
> Michael at the meetup last night said there was some development
> activity around ORCFiles.
>
> I'm curious though, what are the pros and cons of ORCFiles vs Parquet?
>
> On Wed, Oct 8, 2014 at 10:03 AM, James Yu  wrote:
> > Didn't see anyone asked the question before, but I was wondering if
> anyone
> > knows if Spark/SparkSQL will support ORCFile format soon? ORCFile is
> > getting more and more popular hi Hive world.
> >
> > Thanks,
> > James
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Standardized Distance Functions in MLlib

2014-10-08 Thread Xiangrui Meng
Hi Yu,

We upgraded breeze to 0.10 yesterday. So we can call the distance
functions you contributed to breeze easily. We don't want to maintain
another copy of the implementation in MLlib to keep the maintenance
cost low. Both spark and breeze are open-source projects. We should
try our best to avoid duplicate effort and forking, even though we
don't have control the release of breeze.

As we discussed in the PR, if we want users to call them directly,
they should live in breeze. If we want users to specify them in
clustering algorithms, we should hide the implementation from users.
So simple wrappers over the breeze implementation should be
sufficient. We are reviewing

https://github.com/apache/spark/pull/2634

and try to see how we can embed distance measures there. In the
k-means implementation, we don't use (Vector, Vector) => Double.
Instead, we cache the norms and use inner product to derive the
distance, which is faster and takes advantage of sparsity. It would be
really nice if you can help review it and discuss how to embed
distance measures there. Thanks!

Best,
Xiangrui

On Wed, Oct 8, 2014 at 4:19 AM, Yu Ishikawa
 wrote:
> Hi all,
>
> In my limited understanding of the MLlib, it is a good idea to use the
> various distance functions on some machine learning algorithms. For example,
> we can only use Euclidean distance metric in KMeans. And I am tackling with
> contributing hierarchical clustering to MLlib
> (https://issues.apache.org/jira/browse/SPARK-2429). I would like to support
> the various distance functions in it.
>
> Should we support the standardized distance function in MLlib or not?
> You know, Spark depends on Breeze. So I think we have two approaches in
> order to use distance functions in MLlib. One is implementing some distance
> functions in MLlib. The other is wrapping the functions of Breeze. And I am
> a bit worried about using Breeze directly in Spark. For example,  we can't
> absolutely control the release of Breeze.
>
> I sent a PR before. But it is stopping. I'd like to get your thoughts on it,
> community.
> https://github.com/apache/spark/pull/1964#issuecomment-54953348
>
> Best,
>
>
>
> -
> -- Yu Ishikawa
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Standardized-Distance-Functions-in-MLlib-tp8697.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: How to do broadcast join in SparkSQL

2014-10-08 Thread Michael Armbrust
Thanks for the input.  We purposefully made sure that the config option did
not make it into a release as it is not something that we are willing to
support long term.  That said we'll try and make this easier in the future
either through hints or better support for statistics.

In this particular case you can get what you want by registering the tables
as external tables and setting an flag.  Here's a helper function to do
what you need.

/**
 * Sugar for creating a Hive external table from a parquet path.
 */
def createParquetTable(name: String, file: String): Unit = {
  import org.apache.spark.sql.hive.HiveMetastoreTypes

  val rdd = parquetFile(file)
  val schema = rdd.schema.fields.map(f => s"${f.name}
${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")
  val ddl = s"""
|CREATE EXTERNAL TABLE $name (
|  $schema
|)
|ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
|STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
|OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'
|LOCATION '$file'""".stripMargin
  sql(ddl)
  setConf("spark.sql.hive.convertMetastoreParquet", "true")
}

You'll also need to run this to populate the statistics:

ANALYZE TABLE  tableName COMPUTE STATISTICS noscan;


On Wed, Oct 8, 2014 at 1:44 AM, Jianshi Huang 
wrote:

> Ok, currently there's cost-based optimization however Parquet statistics
> is not implemented...
>
> What's the good way if I want to join a big fact table with several tiny
> dimension tables in Spark SQL (1.1)?
>
> I wish we can allow user hint for the join.
>
> Jianshi
>
> On Wed, Oct 8, 2014 at 2:18 PM, Jianshi Huang 
> wrote:
>
>> Looks like https://issues.apache.org/jira/browse/SPARK-1800 is not
>> merged into master?
>>
>> I cannot find spark.sql.hints.broadcastTables in latest master, but it's
>> in the following patch.
>>
>>
>> https://github.com/apache/spark/commit/76ca4341036b95f71763f631049fdae033990ab5
>>
>>
>> Jianshi
>>
>>
>> On Mon, Sep 29, 2014 at 1:24 AM, Jianshi Huang 
>> wrote:
>>
>>> Yes, looks like it can only be controlled by the
>>> parameter spark.sql.autoBroadcastJoinThreshold, which is a little bit weird
>>> to me.
>>>
>>> How am I suppose to know the exact bytes of a table? Let me specify the
>>> join algorithm is preferred I think.
>>>
>>> Jianshi
>>>
>>> On Sun, Sep 28, 2014 at 11:57 PM, Ted Yu  wrote:
>>>
 Have you looked at SPARK-1800 ?

 e.g. see sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
 Cheers

 On Sun, Sep 28, 2014 at 1:55 AM, Jianshi Huang >>> > wrote:

> I cannot find it in the documentation. And I have a dozen dimension
> tables to (left) join...
>
>
> Cheers,
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>


>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>


Re: Parquet schema migrations

2014-10-08 Thread Michael Armbrust
>
> The kind of change we've made that it probably makes most sense to support
> is adding a nullable column. I think that also implies supporting
> "removing" a nullable column, as long as you don't end up with columns of
> the same name but different type.
>

Filed here: https://issues.apache.org/jira/browse/SPARK-3851


> I'm not sure semantically that it makes sense to do schema merging as part
> of union all, and definitely doesn't make sense to do it by default.  I
> wouldn't want two accidentally compatible schema to get merged without
> warning.  It's also a little odd since unlike a normal sql database union
> all can happen before there are any projections or filters... e.g. what
> order do columns come back in if someone does select *.
>

I was proposing you manually convert each different format into one unified
format  (by adding literal nulls and such for missing columns) and then
union these converted datasets.  It would be weird to have union all try
and do this automatically.


spark-ec2 can't initialize spark-standalone module

2014-10-08 Thread Nicholas Chammas
This line

in setup.sh initializes several modules, which are defined here

.

# Install / Init module
for module in $MODULES; do
  echo "Initializing $module"
  if [[ -e $module/init.sh ]]; then
source $module/init.sh
  fi
  cd /root/spark-ec2  # guard against init.sh changing the cwd
done

One of these modules is spark-standalone. However, it does not have an
init.sh file

.

Should it have one? It’s the only module without an init.sh.

Nick
​


Re: Parquet schema migrations

2014-10-08 Thread Cody Koeninger
On Wed, Oct 8, 2014 at 3:19 PM, Michael Armbrust 
wrote:

>
> I was proposing you manually convert each different format into one
> unified format  (by adding literal nulls and such for missing columns) and
> then union these converted datasets.  It would be weird to have union all
> try and do this automatically.
>


Sure, I was just musing on what an api for doing the merging without manual
user input should look like / do.   I'll comment on the ticket, thanks for
making it


Fwd: Accumulator question

2014-10-08 Thread Nathan Kronenfeld
I notice that accumulators register themselves with a private Accumulators
object.

I don't notice any way to unregister them when one is done.

Am I missing something? If not, is there any plan for how to free up that
memory?

I've a case where we're gathering data from repeated queries using some
relatively sizable accumulators; at the moment, we're creating one per
query, and running out of memory after far too few queries.

I've tried methods that don't involve accumulators; they involve a shuffle
instead, and take 10x as long.

Thanks,
  -Nathan




-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com


new jenkins update + tentative release date

2014-10-08 Thread shane knapp
greetings!

i've got some updates regarding our new jenkins infrastructure, as well as
the initial date and plan for rolling things out:

*** current testing/build break whack-a-mole:
a lot of out of date artifacts are cached in the current jenkins, which has
caused a few builds during my testing to break due to dependency resolution
failure[1][2].

bumping these versions can cause your builds to fail, due to public api
changes and the like.  consider yourself warned that some projects might
require some debugging...  :)

tomorrow, i will be at databricks working w/@joshrosen to make sure that
the spark builds have any bugs hammered out.

***  deployment plan:
unless something completely horrible happens, THE NEW JENKINS WILL GO LIVE
ON MONDAY (october 13th).

all jenkins infrastructure will be DOWN for the entirety of the day
(starting at ~8am).  this means no builds, period.  i'm hoping that the
downtime will be much shorter than this, but we'll have to see how
everything goes.

all test/build history WILL BE PRESERVED.  i will be rsyncing the jenkins
jobs/ directory over, complete w/history as part of the deployment.

once i'm feeling good about the state of things, i'll point the original
url to the new instances and send out an all clear.

if you are a student at UC berkeley, you can log in to jenkins using your
LDAP login, and (by default) view but not change plans.  if you do not have
a UC berkeley LDAP login, you can still view plans anonymously.

IF YOU ARE A PLAN ADMIN, THEN PLEASE REACH OUT, ASAP, PRIVATELY AND I WILL
SET UP ADMIN ACCESS TO YOUR BUILDS.

***  post deployment plan:
fix all of the things that break!

i will be keeping a VERY close eye on the builds, checking for breaks, and
helping out where i can.  if the situation is dire, i can always roll back
to the old jenkins infra...  but i hope we never get to that point!  :)

i'm hoping that things will go smoothly, but please be patient as i'm
certain we'll hit a few bumps in the road.

please let me know if you guys have any comments/questions/concerns...  :)

shane

1 - https://github.com/bigdatagenomics/bdg-services/pull/18
2 - https://github.com/bigdatagenomics/avocado/pull/111


Re: Standardized Distance Functions in MLlib

2014-10-08 Thread Yu Ishikawa
Hi Xiangrui, 

Thank you very much for replying and letting me know that you upgraded
breeze to 0.10 yesterday.
Sorry that I didn't know that.

> We don't want to maintain 
> another copy of the implementation in MLlib to keep the maintenance 
> cost low. Both spark and breeze are open-source projects. We should 
> try our best to avoid duplicate effort and forking, even though we 
> don't have control the release of breeze. 

I got it. I agree with keeping linear algebra in MLlib lightweight.

> It would be really nice if you can help review it and discuss how to embed
> distance measures there. 

All right. I will check it.

thanks,
Yu Ishikawa



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Standardized-Distance-Functions-in-MLlib-tp8697p8711.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-08 Thread James Yu
Thanks Mark! I will keep eye on it.

@Evan, I saw people use both format, so I really want to have Spark support
ORCFile.


On Wed, Oct 8, 2014 at 11:12 AM, Mark Hamstra 
wrote:

> https://github.com/apache/spark/pull/2576
>
>
>
> On Wed, Oct 8, 2014 at 11:01 AM, Evan Chan 
> wrote:
>
>> James,
>>
>> Michael at the meetup last night said there was some development
>> activity around ORCFiles.
>>
>> I'm curious though, what are the pros and cons of ORCFiles vs Parquet?
>>
>> On Wed, Oct 8, 2014 at 10:03 AM, James Yu  wrote:
>> > Didn't see anyone asked the question before, but I was wondering if
>> anyone
>> > knows if Spark/SparkSQL will support ORCFile format soon? ORCFile is
>> > getting more and more popular hi Hive world.
>> >
>> > Thanks,
>> > James
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: spark-ec2 can't initialize spark-standalone module

2014-10-08 Thread Shivaram Venkataraman
There is a check to see if init.sh file exists (` if [[ -e
$module/init.sh ]]; then`), so it just won't get called. Regarding
spark-standalone not having a init.sh that is because we dont have any
initialization work to do for it  (its not necessary for all modules
to have a init.sh) as the spark module downloads and installs Spark.

Thanks
Shivaram

On Wed, Oct 8, 2014 at 2:50 PM, Nicholas Chammas
 wrote:
> This line
> 
> in setup.sh initializes several modules, which are defined here
> 
> .
>
> # Install / Init module
> for module in $MODULES; do
>   echo "Initializing $module"
>   if [[ -e $module/init.sh ]]; then
> source $module/init.sh
>   fi
>   cd /root/spark-ec2  # guard against init.sh changing the cwd
> done
>
> One of these modules is spark-standalone. However, it does not have an
> init.sh file
> 
> .
>
> Should it have one? It’s the only module without an init.sh.
>
> Nick
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-08 Thread Cheng Lian
The foreign data source API PR also matters here 
https://www.github.com/apache/spark/pull/2475


Foreign data source like ORC can be added more easily and systematically 
after this PR is merged.


On 10/9/14 8:22 AM, James Yu wrote:

Thanks Mark! I will keep eye on it.

@Evan, I saw people use both format, so I really want to have Spark support
ORCFile.


On Wed, Oct 8, 2014 at 11:12 AM, Mark Hamstra 
wrote:


https://github.com/apache/spark/pull/2576



On Wed, Oct 8, 2014 at 11:01 AM, Evan Chan 
wrote:


James,

Michael at the meetup last night said there was some development
activity around ORCFiles.

I'm curious though, what are the pros and cons of ORCFiles vs Parquet?

On Wed, Oct 8, 2014 at 10:03 AM, James Yu  wrote:

Didn't see anyone asked the question before, but I was wondering if

anyone

knows if Spark/SparkSQL will support ORCFile format soon? ORCFile is
getting more and more popular hi Hive world.

Thanks,
James

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Extending Scala style checks

2014-10-08 Thread Reynold Xin
Thanks. I added one.


On Wed, Oct 8, 2014 at 8:49 AM, Nicholas Chammas  wrote:

> I've created SPARK-3849: Automate remaining Scala style rules
> .
>
> Please create sub-tasks on this issue for rules that we have not automated
> and let's work through them as possible.
>
> I went ahead and created the first sub-task, SPARK-3850: Scala style:
> Disallow trailing spaces  >.
>
> Nick
>
> On Tue, Oct 7, 2014 at 4:45 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com
> > wrote:
>
> > For starters, do we have a list of all the Scala style rules that are
> > currently not enforced automatically but are likely well-suited for
> > automation?
> >
> > Let's put such a list together in a JIRA issue and work through
> > implementing them.
> >
> > Nick
> >
> > On Thu, Oct 2, 2014 at 12:06 AM, Cheng Lian 
> wrote:
> >
> >> Since we can easily catch the list of all changed files in a PR, I think
> >> we can start with adding the no trailing space check for newly changed
> >> files only?
> >>
> >>
> >> On 10/2/14 9:24 AM, Nicholas Chammas wrote:
> >>
> >>> Yeah, I remember that hell when I added PEP 8 to the build checks and
> >>> fixed
> >>> all the outstanding Python style issues. I had to keep rebasing and
> >>> resolving merge conflicts until the PR was merged.
> >>>
> >>> It's a rough process, but thankfully it's also a one-time process. I
> >>> might
> >>> be able to help with that in the next week or two if no-one else wants
> to
> >>> pick it up.
> >>>
> >>> Nick
> >>>
> >>> On Wed, Oct 1, 2014 at 9:20 PM, Michael Armbrust <
> mich...@databricks.com
> >>> >
> >>> wrote:
> >>>
> >>>  The hard part here is updating the existing code base... which is
> going
>  to
>  create merge conflicts with like all of the open PRs...
> 
>  On Wed, Oct 1, 2014 at 6:13 PM, Nicholas Chammas <
>  nicholas.cham...@gmail.com> wrote:
> 
>   Ah, since there appears to be a built-in rule for end-of-line
> > whitespace,
> > Michael and Cheng, y'all should be able to add this in pretty easily.
> >
> > Nick
> >
> > On Wed, Oct 1, 2014 at 6:37 PM, Patrick Wendell 
> > wrote:
> >
> >  Hey Nick,
> >>
> >> We can always take built-in rules. Back when we added this Prashant
> >> Sharma actually did some great work that lets us write our own style
> >> rules in cases where rules don't exist.
> >>
> >> You can see some existing rules here:
> >>
> >>
> >>  https://github.com/apache/spark/tree/master/project/
> > spark-style/src/main/scala/org/apache/spark/scalastyle
> >
> >> Prashant has over time contributed a lot of our custom rules
> upstream
> >> to stalastyle, so now there are only a couple there.
> >>
> >> - Patrick
> >>
> >> On Wed, Oct 1, 2014 at 2:36 PM, Ted Yu  wrote:
> >>
> >>> Please take a look at WhitespaceEndOfLineChecker under:
> >>> http://www.scalastyle.org/rules-0.1.0.html
> >>>
> >>> Cheers
> >>>
> >>> On Wed, Oct 1, 2014 at 2:01 PM, Nicholas Chammas <
> >>>
> >> nicholas.cham...@gmail.com
> >>
> >>> wrote:
>  As discussed here , it
> 
> >>> would be
> >>
> >>> good to extend our Scala style checks to programmatically enforce
> as
> 
> >>> many
> >>
> >>> of our style rules as possible.
> 
>  Does anyone know if it's relatively straightforward to enforce
> 
> >>> additional
> >>
> >>> rules like the "no trailing spaces" rule mentioned in the linked
> PR?
> 
>  Nick
> 
> 
> 
> >>
> >
>