Re: Hive on Spark Vs Spark SQL

2015-11-15 Thread kiran lonikar
So does not benefit from Project Tungsten right?


On Mon, Nov 16, 2015 at 12:07 PM, Reynold Xin  wrote:

> It's a completely different path.
>
>
> On Sun, Nov 15, 2015 at 10:37 PM, kiran lonikar  wrote:
>
>> I would like to know if Hive on Spark uses or shares the execution code
>> with Spark SQL or DataFrames?
>>
>> More specifically, does Hive on Spark benefit from the changes made to
>> Spark SQL, project Tungsten? Or is it completely different execution path
>> where it creates its own plan and executes on RDD?
>>
>> -Kiran
>>
>>
>


Hive on Spark Vs Spark SQL

2015-11-15 Thread kiran lonikar
I would like to know if Hive on Spark uses or shares the execution code
with Spark SQL or DataFrames?

More specifically, does Hive on Spark benefit from the changes made to
Spark SQL, project Tungsten? Or is it completely different execution path
where it creates its own plan and executes on RDD?

-Kiran


Re: Code generation for GPU

2015-09-12 Thread kiran lonikar
Thanks for pointing to the yarn JIRA. For now, it would be good for my talk
since it brings out that hadoop and big data community is already aware of
the GPUs and making effort to exploit it.

Good luck for your talk. That fear is lurking in my mind too :)
On 10-Sep-2015 2:08 pm, "Steve Loughran"  wrote:

>
> > On 9 Sep 2015, at 20:18, lonikar  wrote:
> >
> > I have seen a perf improvement of 5-10 times on expression evaluation
> even
> > on "ordinary" laptop GPUs. Thus, it will be a good demo along with some
> > concrete proposals for vectorization. As you said, I will have to hook
> up to
> > a column structure and perform computation and let the existing spark
> > computation also proceed and compare the performance.
> >
>
> you might also be interested to know that there's now a YARN JIRA on
> making GPU another resource you can ask for
> https://issues.apache.org/jira/browse/YARN-4122
>
> if implemented, it'd let you submit work into the cluster asking for GPUs,
> and get allocated containers on servers with the GPU capacity you need.
> This'd allow you to share GPUs with other code (including your own
> containers)
>
> > I will focus on the slides early (7th Oct is deadline), and then continue
> > the work for another 3 weeks till the summit. It still gives me enough
> time
> > to do considerable work. Hope your fear does not come true.
>
> good luck. And the fear is about my talk at apachecon on the Hadoop stack
> & Kerberos
> >
>
>


Re: Code generation for GPU

2015-09-12 Thread kiran lonikar
Thanks. Yes thats exactly what i would like to do: copy large amounts of
data to GPU RAM, perform computation and get bulk rows back for map/filter
or reduce result. It is true that non trivial operations benefit more. Even
streaming data to GPU RAM and interleaving computation with data transfer
works but it complicates the design and doing it in spark would be even
more so.

Thanks for bringing out the sorting. Its a good idea since its already
isolated as you pointed out. I was looking at the terasort effort and
something I always wanted to take up. But somehow thought expression would
be easier to deal with in a short term. Would love to work on that after
this especially because unsafe is for primitive types and suited for GPUs
computation model. It would be exciting to better the terasort record too.

Kiran
On 10-Sep-2015 1:12 pm, "Paul Wais"  wrote:

> In order to get a major speedup from applying *single-pass*
> map/filter/reduce
> operations on an array in GPU memory, wouldn't you need to stream the
> columnar data directly into GPU memory somehow?  You might find in your
> experiments that GPU memory allocation is a bottleneck.  See e.g. John
> Canny's paper here (Section 1.1 paragraph 2):
> http://www.cs.berkeley.edu/~jfc/papers/13/BIDMach.pdfIf the per-item
> operation is very non-trivial, though, a dramatic GPU speedup may be more
> likely.
>
> Something related (and perhaps easier to contribute to Spark) might be a
> GPU-accelerated sorter for sorting Unsafe records.  Especially since that
> stuff is already broken out somewhat well-- e.g. `UnsafeInMemorySorter`.
> Spark appears to use (single-threaded) Timsort for sorting Unsafe records,
> so I imagine a multi-thread/multi-core GPU solution could handily beat
> that.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Code-generation-for-GPU-tp13954p14030.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Spark 1.5.0: setting up debug env

2015-09-11 Thread lonikar
I have setup spark debug env on windows and mac, and thought its worth
sharing given some of the issues I encountered and the  instructions given
here

  
did not work for *eclipse* (possibly outdated now). The first step "sbt/sbt"
or "build/sbt" hangs in downloading sbt with the message "Getting
org.scala-sbt sbt 0.13.7 ...". I tried the alternative "build/mvn
eclipse:eclipse", but that too failed as the generated .classpath files
contained classpathentry only for java files.

1. Build spark using maven on command line. This will download all the
necessary jars from maven repos and speed up eclipse build. Maven 3.3.3 is
required. Spark ships with it. Just use build/mvn and ensure that there is
no "mvn" command in PATH (build/mvn -Pyarn -Phadoop-2.6
-Dhadoop.version=2.6.0 -DskipTests clean package).
2. Download latest scala-ide (4.1.1 as of now) from http://scala-ide.org
3. Check if the eclipse scala maven plugin is installed. If not, install it:
Help --> Install New Software -->
http://alchim31.free.fr/m2e-scala/update-site/ which is sourced from
https://github.com/sonatype/m2eclipse-scala.
4. If using scala 2.10, add installation 2.10.4. If you build spark using
steps in  described here
  , (build/mvn
-Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package), it
gets installed in build/scala-2.10.4. In Eclipse Preferences -> Scala ->
Installations -> Add, specify the /build/scala-2.10.4/lib.
5. In Eclipse -> Project, disable Build Automatically. This is to avoid
building projects till all projects are imported and some settings are
changed. Otherwise, eclipse takes up hours building projects while in half
baked state.
6. In Eclipse -> Preferences -> Java -> Compiler -> Errors/Warnings -->
Deprecated and Restricted API, change the setting to Warning from earlier
Error. This is to take care of Unsafe classes for project tungsten.
7. Import maven projects: In eclipse, File --> Import --> Maven --> Existing
Maven Projects (*not General --> Existing projects in workspace*).
8. After the projects are completely imported, select all projects except
java8-tests_2.10, spark-assembly_2.10, spark-parent_2.10, right click and
choose Scala -> Set the Scala Installation. Choose 2.10.4. This step is also 
described here

 
. It does not work for some projects. Right click on each project,
Properties -> Scala Compiler -> Check Use Project Settings, Select Scala
Installation as scala 2.10.4 and click OK.
9. Some projects will give error "Plugin execution not covered by lifecycle
configuration" when building. The issue is  described here

 
. The pom.xml of those projects will need  ...
 around the  like below:

 The projects which need this change are spark-streaming-flume-sink_2.10
(external/flume-sink/pom.xml), spark-repl_2.10 (repl/pom.xml),
spark-sql_2.10 (sql/pom.xml), spark-hive_2.10 (sql/hive/pom.xml),
spark-hive-thriftserver_2.10 (sql/hive-thriftserver_2.10/pom.xml),
spark-unsafe_2.10 (unsafe/pom.xml).
10. Right click on project spark-streaming-flume-sink_2.10, Properties ->
Java Build Path -> Source -> Add Folder. Navigate to target -> scala-2.10 ->
src_managed -> main -> compiled_avro. Check the checkbox and click OK.
11. Now enable Project -> Build Automatically. Sit back and relax. If build
fails for some projects (SBT crashes sometimes), just select those, Project
-> Clean -> Clean selected projects.
12. After the build completes (hopefully without any errors), run/debug an
example from spark-examples_2.10. You should be able to put breakpoints in
spark code and debug. You may have to change source of examples to add
*/.setMaster("local")/* on the */val sparkConf/* line. After this minor
change, it will work. Also, the first time you debug, it will ask you
specify source path. Just select Add -> Java Project -> select all spark
projects. Let the first debugging session complete as will not show any
spark code. You may disable breakpoints in this session to let it go.
Subsequent sessions allow you to walk through step by step in spark code.
Enjoy  

You may not have to go through all this if using scala 2.11 or IntelliJ. But
if you are like me, who uses eclipse and also the spark's current scala
2.10.4, you will find this useful and avoid a lot of googling 

The one issue I encountered is debugging/setting breakpoints in expression
generated java code. This code generated as string in spark-catalyst_2.10
--> org.apache.spark.sql.catalyst.expressions and
org.apache.spark.sql.catalyst.expressions.codegen. If anyone has figured out
how to do it, please update on this thread.



--
View this message in context: 
http://apa

Re: Spark 1.5: How to trigger expression execution through UnsafeRow/TungstenProject

2015-09-11 Thread lonikar
thanks that worked 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-5-How-to-trigger-expression-execution-through-UnsafeRow-TungstenProject-tp14026p14053.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark 1.5.x: Java files in src/main/scala and vice versa

2015-09-11 Thread lonikar
It does not cause any problem when building using maven. But when doing
eclipse:eclipse, the generated .classpath files contained only
. This caused
all the .scala sources to be ignored and caused all kinds of eclipse build
errors. It resolved only when I added prebuild jars in the java build path,
and it also prevented me from debugging spark code.

I understand eclipse:eclipse is not recommended way of creating eclipse
projects, but thats how I noticed this issue.

As sean said, its a matter of code organization, and its confusing to find
java files in src/main/scala. In my env, I moved the files and did not see
notice any issues. Unless there is any specific purpose, it will be better
if the code is reorganized.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-5-x-Java-files-in-src-main-scala-and-vice-versa-tp14032p14052.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Spark 1.5.x: Java files in src/main/scala and vice versa

2015-09-10 Thread lonikar
I found these files:
spark-1.5.0/sql/catalyst/*src/main/scala*/org/apache/spark/sql/types/*SQLUserDefinedType.java*

spark-1.5.0/core/src/main/java/org/apache/spark/api/java/function/package.scala
and several java files in spark-1.5.0/core/src/main/scala/.

Is this intentional or inadvertant?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-5-x-Java-files-in-src-main-scala-and-vice-versa-tp14032.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Spark 1.5: How to trigger expression execution through UnsafeRow/TungstenProject

2015-09-09 Thread lonikar
The tungsten, cogegen etc options are enabled by default. But I am not able
to get the execution through the UnsafeRow/TungstenProject. It still
executes using InternalRow/Project.

I see this in the SparkStrategies.scala: If unsafe mode is enabled and we
support these data types in Unsafe, use the tungsten project. Otherwise use
the normal project.

Can someone give an example code on what can trigger this? I tried some of
the primitive types but did not work.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-5-How-to-trigger-expression-execution-through-UnsafeRow-TungstenProject-tp14026.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Code generation for GPU

2015-09-09 Thread lonikar
I am already looking at the dataframes APIs and the implementation. In fact,
the columnar representation
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala
is what gave me the idea of my talk proposal. It is ideally suited for
computation on GPU. But from what Reynold said, it appears that the columnar
structure is not exploited for computation like expressions. It appears that
the columnar structure is used only for space efficient in memory storage
and not for computations. Even the TungstenProject invokes the operations on
a row by row basis. The UnsafeRow is optimized in the sense that it is only
a logical row as opposed to the InternalRow which has physical copies of the
values. But the computation is still on a per row basis rather than batches
of rows stored in columnar structure.

Thanks for some concrete suggestions on presentation. I do have the core
idea or theme of my talk ready in mind, but I will now present on the lines
you suggest. I wasn't really thinking of a demo, but now I will do that. I
was actually hoping to be able to contribute to spark code and show results
on those changes rather than offline changes. I will still try to do that by
hooking to the columnar structure, but it may not be in a shape that can go
in the spark code. Thats what I meant by severely limiting the scope of my
talk.

I have seen a perf improvement of 5-10 times on expression evaluation even
on "ordinary" laptop GPUs. Thus, it will be a good demo along with some
concrete proposals for vectorization. As you said, I will have to hook up to
a column structure and perform computation and let the existing spark
computation also proceed and compare the performance.

I will focus on the slides early (7th Oct is deadline), and then continue
the work for another 3 weeks till the summit. It still gives me enough time
to do considerable work. Hope your fear does not come true.






--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Code-generation-for-GPU-tp13954p14025.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Code generation for GPU

2015-09-07 Thread lonikar
Hi Reynold,

Thanks for responding. I was waiting for this on the spark user group and my
own email id since I had not posted this on spark dev. Just saw your reply.

1. I figured the various code generation classes have either *apply* or
*eval* method depending on whether it computes something or uses expression
as filter. And the code that executes this generated code is in
sql.execution.basicOperators.scala.

2. If the vectorization is difficult or a major effort, I am not sure how I
am going to implement even a glimpse of changes I would like to. I think I
will have to satisfied with only a partial effort. Batching rows defeats the
purpose as I have found that it consumes a considerable amount of CPU cycles
and producing one row at a time also takes away the performance benefit.
Whats really required is to access a large partition and produce the result
partition in one shot. 

I think I will have to severely limit the scope of my talk in that case. Or
re-orient it to propose the changes instead of presenting the results of
execution on GPU. Please suggest since you seem to have selected the talk.

3. I agree, its pretty high paced development. I have started working on
1.5.1 spapshot.

4. How do I tune the batch size (number of rows in the ByteBuffer)? Is it
through the property spark.sql.inMemoryColumnarStorage.batchSize?

-Kiran



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Code-generation-for-GPU-tp13954p13989.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org