Re: [ compress in-memory column storage used in sparksql cache table ]

2015-09-02 Thread Cheng Lian
Yeah, two of the reasons why the built-in in-memory columnar storage 
doesn't achieve comparable compression ratio as Parquet are:


1. The in-memory columnar representation doesn't handle nested types. So 
array/map/struct values are not compressed.
2. Parquet may use more than one kind of compression methods to compress 
a single column. For example, dictionary  + RLE.


Cheng

On 9/2/15 3:58 PM, Nitin Goyal wrote:

I think spark sql's in-memory columnar cache already does compression. Check
out classes in following path :-

https://github.com/apache/spark/tree/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/compression

Although compression ratio is not as good as Parquet.

Thanks
-Nitin



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/compress-in-memory-column-storage-used-in-sparksql-cache-table-tp13932p13937.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [ compress in-memory column storage used in sparksql cache table ]

2015-09-02 Thread Nitin Goyal
I think spark sql's in-memory columnar cache already does compression. Check
out classes in following path :-

https://github.com/apache/spark/tree/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/compression

Although compression ratio is not as good as Parquet.

Thanks
-Nitin



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/compress-in-memory-column-storage-used-in-sparksql-cache-table-tp13932p13937.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: taking an n number of rows from and RDD starting from an index

2015-09-02 Thread Hemant Bhanawat
I think rdd.toLocalIterator is what you want. But it will keep one
partition's data in-memory.

On Wed, Sep 2, 2015 at 10:05 AM, Niranda Perera 
wrote:

> Hi all,
>
> I have a large set of data which would not fit into the memory. So, I wan
> to take n number of data from the RDD given a particular index. for an
> example, take 1000 rows starting from the index 1001.
>
> I see that there is a  take(num: Int): Array[T] method in the RDD, but it
> only returns the 'first n number of rows'.
>
> the simplest use case of this, requirement is, say, I write a custom
> relation provider with a custom relation extending the InsertableRelation.
>
> say I submit this query,
> "insert into table abc select * from xyz sort by x asc"
>
> in my custom relation, I have implemented the def insert(data: DataFrame,
> overwrite: Boolean): Unit
> method. here, since the data is large, I can not call methods such as
> DataFrame.collect(). Instead, I could do, DataFrame.foreachpartition(...).
> As you could see, the resultant DF from the "select * from xyz sort by x
> asc" is sorted, and if I sun, foreachpartition on that DF and implement the
> insert method, this sorted order would be affected, since the inserting
> operation would be done in parallel in each partition.
>
> in order to handle this, my initial idea was to take rows from the RDD in
> batches and do the insert operation, and for that I was looking for a
> method to take n number of rows starting from a given index.
>
> is there any better way to handle this, in RDDs?
>
> your assistance in this regard is highly appreciated.
>
> cheers
>
> --
> Niranda
> @n1r44 
> https://pythagoreanscript.wordpress.com/
>


Re: taking an n number of rows from and RDD starting from an index

2015-09-02 Thread Juan Rodríguez Hortalá
Hi,

Maybe you could use zipWithIndex and filter to skip the first elements. For
example starting from

scala> sc.parallelize(100 to 120, 4).zipWithIndex.collect
res12: Array[(Int, Long)] = Array((100,0), (101,1), (102,2), (103,3),
(104,4), (105,5), (106,6), (107,7), (108,8), (109,9), (110,10), (111,11),
(112,12), (113,13), (114,14), (115,15), (116,16), (117,17), (118,18),
(119,19), (120,20))

we can get the 3 first elements starting from the 4th (counting from 0) as

scala> sc.parallelize(100 to 120, 4).zipWithIndex.filter(_._2 >=4).take(3)
res14: Array[(Int, Long)] = Array((104,4), (105,5), (106,6))

Hope that helps


2015-09-02 8:52 GMT+02:00 Hemant Bhanawat :

> I think rdd.toLocalIterator is what you want. But it will keep one
> partition's data in-memory.
>
> On Wed, Sep 2, 2015 at 10:05 AM, Niranda Perera 
> wrote:
>
>> Hi all,
>>
>> I have a large set of data which would not fit into the memory. So, I wan
>> to take n number of data from the RDD given a particular index. for an
>> example, take 1000 rows starting from the index 1001.
>>
>> I see that there is a  take(num: Int): Array[T] method in the RDD, but it
>> only returns the 'first n number of rows'.
>>
>> the simplest use case of this, requirement is, say, I write a custom
>> relation provider with a custom relation extending the InsertableRelation.
>>
>> say I submit this query,
>> "insert into table abc select * from xyz sort by x asc"
>>
>> in my custom relation, I have implemented the def insert(data: DataFrame,
>> overwrite: Boolean): Unit
>> method. here, since the data is large, I can not call methods such as
>> DataFrame.collect(). Instead, I could do, DataFrame.foreachpartition(...).
>> As you could see, the resultant DF from the "select * from xyz sort by x
>> asc" is sorted, and if I sun, foreachpartition on that DF and implement the
>> insert method, this sorted order would be affected, since the inserting
>> operation would be done in parallel in each partition.
>>
>> in order to handle this, my initial idea was to take rows from the RDD in
>> batches and do the insert operation, and for that I was looking for a
>> method to take n number of rows starting from a given index.
>>
>> is there any better way to handle this, in RDDs?
>>
>> your assistance in this regard is highly appreciated.
>>
>> cheers
>>
>> --
>> Niranda
>> @n1r44 
>> https://pythagoreanscript.wordpress.com/
>>
>
>


Re: OOM in spark driver

2015-09-02 Thread Mike Hynes
Just a thought; this has worked for me before on standalone client
with a similar OOM error in a driver thread. Try setting:
export SPARK_DAEMON_MEMORY=4G #or whatever size you can afford on your machine
in your environment/spark-env.sh before running spark-submit.
Mike

On 9/2/15, ankit tyagi  wrote:
> Hi All,
>
> I am using spark-sql 1.3.1 with hadoop 2.4.0 version.  I am running sql
> query against parquet files and wanted to save result on s3 but looks like
> https://issues.apache.org/jira/browse/SPARK-2984 problem still coming while
> saving data to s3.
>
> Hence Now i am saving result on hdfs and with the help
> of JavaSparkListener, copying file from hdfs to s3 with hadoop fileUtil
> in onApplicationEnd method. But  my job is getting failed with OOM in spark
> driver.
>
> *5/09/02 04:17:57 INFO cluster.YarnClusterSchedulerBackend: Asking each
> executor to shut down*
> *15/09/02 04:17:59 INFO
> scheduler.OutputCommitCoordinator$OutputCommitCoordinatorActor:
> OutputCommitCoordinator stopped!*
> *Exception in thread "Reporter" *
> *Exception: java.lang.OutOfMemoryError thrown from the
> UncaughtExceptionHandler in thread "Reporter"*
> *Exception in thread "SparkListenerBus" *
> *Exception: java.lang.OutOfMemoryError thrown from the
> UncaughtExceptionHandler in thread "SparkListenerBus"*
> *Exception in thread "Driver" *
> *Exception: java.lang.OutOfMemoryError thrown from the
> UncaughtExceptionHandler in thread "Driver"*
>
>
> Strage part is, result is getting saved on HDFS but while copying file job
> is getting failed. size of file is under 1MB.
>
> Any help or leads would be appreciated.
>


-- 
Thanks,
Mike

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: taking an n number of rows from and RDD starting from an index

2015-09-02 Thread Niranda Perera
Hi all,

thank you for your response.

after taking a look at the implementations of rdd.collect(), I thought of
using the rdd.runJob(...) method .

for (int i = 0; i < dataFrame.rdd().partitions().length; i++) {
dataFrame.sqlContext().sparkContext().runJob(data.rdd(),
some function, { i } , false, ClassTag$.MODULE$.Unit());
}

this iterates through the partitions of the dataframe.

I would like to know if this is an accepted way of iterating through
dataFrame partitions while conserving the order of rows encapsulated by the
dataframe?

cheers


On Wed, Sep 2, 2015 at 12:33 PM, Juan Rodríguez Hortalá <
juan.rodriguez.hort...@gmail.com> wrote:

> Hi,
>
> Maybe you could use zipWithIndex and filter to skip the first elements.
> For example starting from
>
> scala> sc.parallelize(100 to 120, 4).zipWithIndex.collect
> res12: Array[(Int, Long)] = Array((100,0), (101,1), (102,2), (103,3),
> (104,4), (105,5), (106,6), (107,7), (108,8), (109,9), (110,10), (111,11),
> (112,12), (113,13), (114,14), (115,15), (116,16), (117,17), (118,18),
> (119,19), (120,20))
>
> we can get the 3 first elements starting from the 4th (counting from 0) as
>
> scala> sc.parallelize(100 to 120, 4).zipWithIndex.filter(_._2 >=4).take(3)
> res14: Array[(Int, Long)] = Array((104,4), (105,5), (106,6))
>
> Hope that helps
>
>
> 2015-09-02 8:52 GMT+02:00 Hemant Bhanawat :
>
>> I think rdd.toLocalIterator is what you want. But it will keep one
>> partition's data in-memory.
>>
>> On Wed, Sep 2, 2015 at 10:05 AM, Niranda Perera > > wrote:
>>
>>> Hi all,
>>>
>>> I have a large set of data which would not fit into the memory. So, I
>>> wan to take n number of data from the RDD given a particular index. for an
>>> example, take 1000 rows starting from the index 1001.
>>>
>>> I see that there is a  take(num: Int): Array[T] method in the RDD, but
>>> it only returns the 'first n number of rows'.
>>>
>>> the simplest use case of this, requirement is, say, I write a custom
>>> relation provider with a custom relation extending the InsertableRelation.
>>>
>>> say I submit this query,
>>> "insert into table abc select * from xyz sort by x asc"
>>>
>>> in my custom relation, I have implemented the def insert(data:
>>> DataFrame, overwrite: Boolean): Unit
>>> method. here, since the data is large, I can not call methods such as
>>> DataFrame.collect(). Instead, I could do, DataFrame.foreachpartition(...).
>>> As you could see, the resultant DF from the "select * from xyz sort by x
>>> asc" is sorted, and if I sun, foreachpartition on that DF and implement the
>>> insert method, this sorted order would be affected, since the inserting
>>> operation would be done in parallel in each partition.
>>>
>>> in order to handle this, my initial idea was to take rows from the RDD
>>> in batches and do the insert operation, and for that I was looking for a
>>> method to take n number of rows starting from a given index.
>>>
>>> is there any better way to handle this, in RDDs?
>>>
>>> your assistance in this regard is highly appreciated.
>>>
>>> cheers
>>>
>>> --
>>> Niranda
>>> @n1r44 
>>> https://pythagoreanscript.wordpress.com/
>>>
>>
>>
>


-- 
Niranda
@n1r44 
https://pythagoreanscript.wordpress.com/


[HELP] Spark 1.4.1 tasks take ridiculously long time to complete

2015-09-02 Thread lankaz
Hi this is a image of the screent shot some take seconds to execute some take
hours.

 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/HELP-Spark-1-4-1-tasks-take-ridiculously-long-time-to-complete-tp13942.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Spark SQL sort by and collect by in multiple partitions

2015-09-02 Thread Niranda Perera
Hi all,

I have been using sort by and order by in spark sql and I observed the
following

when using SORT BY and collect results, the results are getting sorted
partition by partition.
example:
if we have 1, 2, ... , 12 and 4 partitions and I want to sort it in
descending order,
partition 0 (p0) would have 12, 8, 4
p1 = 11, 7, 3
p2 = 10, 6, 2
p3 = 9, 5, 1

so collect() would return 12, 8, 4, 11, 7, 3, 10, 6, 2, 9, 5, 1

BUT when I use ORDER BY and collect results
p0 = 12, 11, 10
p1 =  9, 8, 7
.
so collect() would return 12, 11, .., 1 which is the desirable result.

is this the intended behavior of SORT BY and ORDER BY or is there something
I'm missing?

cheers

-- 
Niranda
@n1r44 
https://pythagoreanscript.wordpress.com/


Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-02 Thread Sean Owen
- As usual the license and signatures are OK
- No blockers, check
- 9 "Critical" bugs for 1.5.0 are listed below just for everyone's
reference (48 total issues still targeted for 1.5.0)
- Under Java 7 + Ubuntu 15, I only had one consistent test failure,
but obviously it's not failing in Jenkins
- I saw more test failures with Java 8 but the seemed like flaky
tests, so am pretending I didn't see that


Test failure

DirectKafkaStreamSuite:
- offset recovery *** FAILED ***
  The code passed to eventually never returned normally. Attempted 197
times over 10.012973046 seconds. Last failure message:
strings.forall({
((elem: Any) => DirectKafkaStreamSuite.collectedData.contains(elem))
  }) was false. (DirectKafkaStreamSuite.scala:249)


1.5.0 Critical Bugs

SPARK-6484 Spark Core Ganglia metrics xml reporter doesn't escape
correctly Josh Rosen
SPARK-6701 Tests, YARN Flaky test: o.a.s.deploy.yarn.YarnClusterSuite
Python application
SPARK-7420 Tests Flaky test: o.a.s.streaming.JobGeneratorSuite "Do not
clear received block data too soon" Tathagata Das
SPARK-8119 Spark Core HeartbeatReceiver should not adjust application
executor resources Andrew Or
SPARK-8414 Spark Core Ensure ContextCleaner actually triggers clean
ups Andrew Or
SPARK-8447 Shuffle Test external shuffle service with all shuffle managers
SPARK-10224 Streaming BlockGenerator may lost data in the last block
SPARK-10310 SQL [Spark SQL] All result records will be popluated into
ONE line during the script transform due to missing the correct
line/filed delimeter
SPARK-10337 SQL Views are broken


On Tue, Sep 1, 2015 at 9:41 PM, Reynold Xin  wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
>
> The tag to be voted on is v1.5.0-rc3:
> https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (published as 1.5.0-rc3) can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1143/
>
> The staging repository for this release (published as 1.5.0) can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1142/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
>
> 
> What justifies a -1 vote for this release?
> 
> This vote is happening towards the end of the 1.5 QA period, so -1 votes
> should only occur for significant regressions from 1.4. Bugs already present
> in 1.4, minor regressions, or bugs related to new features will not block
> this release.
>
>
> ===
> What should happen to JIRA tickets still targeting 1.5.0?
> ===
> 1. It is OK for documentation patches to target 1.5.0 and still go into
> branch-1.5, since documentations will be packaged separately from the
> release.
> 2. New features for non-alpha-modules should target 1.6+.
> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
> version.
>
>
> ==
> Major changes to help you focus your testing
> ==
>
> As of today, Spark 1.5 contains more than 1000 commits from 220+
> contributors. I've curated a list of important changes for 1.5. For the
> complete list, please refer to Apache JIRA changelog.
>
> RDD/DataFrame/SQL APIs
>
> - New UDAF interface
> - DataFrame hints for broadcast join
> - expr function for turning a SQL expression into DataFrame column
> - Improved support for NaN values
> - StructType now supports ordering
> - TimestampType precision is reduced to 1us
> - 100 new built-in expressions, including date/time, string, math
> - memory and local disk only checkpointing
>
> DataFrame/SQL Backend Execution
>
> - Code generation on by default
> - Improved join, aggregation, shuffle, sorting with cache friendly
> algorithms and external 

Harmonic centrality in GraphX

2015-09-02 Thread Pavel Gladkov
Hi,

What do you think about this algorithm
https://github.com/webgeist/spark-centrality/blob/master/src/main/scala/cc/p2k/spark/graphx/lib/HarmonicCentrality.scala

This is an implementation of the Harmonic Centrality algorithm
http://infoscience.epfl.ch/record/200525/files/%5BEN%5DASNA09.pdf.

Should it be in GraphX lib?

-- 
Pavel Gladkov