Re: Spark 1.6.1

2016-02-22 Thread Reynold Xin
Yes, we don't want to clutter maven central.


The staging repo is included in the release candidate voting thread.

See the following for an example:

http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-6-0-RC1-td15424.html

On Mon, Feb 22, 2016 at 11:37 PM, Romi Kuntsman  wrote:

> Sounds fair. Is it to avoid cluttering maven central with too many
> intermediate versions?
>
> What do I need to add in my pom.xml  section to make it work?
>
> *Romi Kuntsman*, *Big Data Engineer*
> http://www.totango.com
>
> On Tue, Feb 23, 2016 at 9:34 AM, Reynold Xin  wrote:
>
>> We usually publish to a staging maven repo hosted by the ASF (not maven
>> central).
>>
>>
>>
>> On Mon, Feb 22, 2016 at 11:32 PM, Romi Kuntsman  wrote:
>>
>>> Is it possible to make RC versions available via Maven? (many projects
>>> do that)
>>> That will make integration much easier, so many more people can test the
>>> version before the final release.
>>> Thanks!
>>>
>>> *Romi Kuntsman*, *Big Data Engineer*
>>> http://www.totango.com
>>>
>>> On Tue, Feb 23, 2016 at 8:07 AM, Luciano Resende 
>>> wrote:
>>>


 On Mon, Feb 22, 2016 at 9:08 PM, Michael Armbrust <
 mich...@databricks.com> wrote:

> An update: people.apache.org has been shut down so the release
> scripts are broken. Will try again after we fix them.
>
>
 If you skip uploading to people.a.o, it should still be available in
 nexus for review.

 The other option is to add the RC into
 https://dist.apache.org/repos/dist/dev/



 --
 Luciano Resende
 http://people.apache.org/~lresende
 http://twitter.com/lresende1975
 http://lresende.blogspot.com/

>>>
>>>
>>
>


Re: Spark 1.6.1

2016-02-22 Thread Romi Kuntsman
Sounds fair. Is it to avoid cluttering maven central with too many
intermediate versions?

What do I need to add in my pom.xml  section to make it work?

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Tue, Feb 23, 2016 at 9:34 AM, Reynold Xin  wrote:

> We usually publish to a staging maven repo hosted by the ASF (not maven
> central).
>
>
>
> On Mon, Feb 22, 2016 at 11:32 PM, Romi Kuntsman  wrote:
>
>> Is it possible to make RC versions available via Maven? (many projects do
>> that)
>> That will make integration much easier, so many more people can test the
>> version before the final release.
>> Thanks!
>>
>> *Romi Kuntsman*, *Big Data Engineer*
>> http://www.totango.com
>>
>> On Tue, Feb 23, 2016 at 8:07 AM, Luciano Resende 
>> wrote:
>>
>>>
>>>
>>> On Mon, Feb 22, 2016 at 9:08 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
 An update: people.apache.org has been shut down so the release scripts
 are broken. Will try again after we fix them.


>>> If you skip uploading to people.a.o, it should still be available in
>>> nexus for review.
>>>
>>> The other option is to add the RC into
>>> https://dist.apache.org/repos/dist/dev/
>>>
>>>
>>>
>>> --
>>> Luciano Resende
>>> http://people.apache.org/~lresende
>>> http://twitter.com/lresende1975
>>> http://lresende.blogspot.com/
>>>
>>
>>
>


Re: Spark 1.6.1

2016-02-22 Thread Reynold Xin
We usually publish to a staging maven repo hosted by the ASF (not maven
central).



On Mon, Feb 22, 2016 at 11:32 PM, Romi Kuntsman  wrote:

> Is it possible to make RC versions available via Maven? (many projects do
> that)
> That will make integration much easier, so many more people can test the
> version before the final release.
> Thanks!
>
> *Romi Kuntsman*, *Big Data Engineer*
> http://www.totango.com
>
> On Tue, Feb 23, 2016 at 8:07 AM, Luciano Resende 
> wrote:
>
>>
>>
>> On Mon, Feb 22, 2016 at 9:08 PM, Michael Armbrust > > wrote:
>>
>>> An update: people.apache.org has been shut down so the release scripts
>>> are broken. Will try again after we fix them.
>>>
>>>
>> If you skip uploading to people.a.o, it should still be available in
>> nexus for review.
>>
>> The other option is to add the RC into
>> https://dist.apache.org/repos/dist/dev/
>>
>>
>>
>> --
>> Luciano Resende
>> http://people.apache.org/~lresende
>> http://twitter.com/lresende1975
>> http://lresende.blogspot.com/
>>
>
>


Re: Spark 1.6.1

2016-02-22 Thread Romi Kuntsman
Is it possible to make RC versions available via Maven? (many projects do
that)
That will make integration much easier, so many more people can test the
version before the final release.
Thanks!

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Tue, Feb 23, 2016 at 8:07 AM, Luciano Resende 
wrote:

>
>
> On Mon, Feb 22, 2016 at 9:08 PM, Michael Armbrust 
> wrote:
>
>> An update: people.apache.org has been shut down so the release scripts
>> are broken. Will try again after we fix them.
>>
>>
> If you skip uploading to people.a.o, it should still be available in nexus
> for review.
>
> The other option is to add the RC into
> https://dist.apache.org/repos/dist/dev/
>
>
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>


Re: Spark 1.6.1

2016-02-22 Thread Luciano Resende
On Mon, Feb 22, 2016 at 9:08 PM, Michael Armbrust 
wrote:

> An update: people.apache.org has been shut down so the release scripts
> are broken. Will try again after we fix them.
>
>
If you skip uploading to people.a.o, it should still be available in nexus
for review.

The other option is to add the RC into
https://dist.apache.org/repos/dist/dev/



-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Spark 1.6.1

2016-02-22 Thread Michael Armbrust
An update: people.apache.org has been shut down so the release scripts are
broken. Will try again after we fix them.

On Mon, Feb 22, 2016 at 6:28 PM, Michael Armbrust 
wrote:

> I've kicked off the build.  Please be extra careful about merging into
> branch-1.6 until after the release.
>
> On Mon, Feb 22, 2016 at 10:24 AM, Michael Armbrust  > wrote:
>
>> I will cut the RC today.  Sorry for the delay!
>>
>> On Mon, Feb 22, 2016 at 5:19 AM, Patrick Woody 
>> wrote:
>>
>>> Hey Michael,
>>>
>>> Any update on a first cut of the RC?
>>>
>>> Thanks!
>>> -Pat
>>>
>>> On Mon, Feb 15, 2016 at 6:50 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
 I'm not going to be able to do anything until after the Spark Summit,
 but I will kick off RC1 after that (end of week).  Get your patches in
 before then!

 On Sat, Feb 13, 2016 at 4:57 PM, Jong Wook Kim 
 wrote:

> Is 1.6.1 going to be ready this week? I see that the two last
> unresolved issues targeting 1.6.1 are fixed
>  now
> .
>
> On 3 February 2016 at 08:16, Daniel Darabos <
> daniel.dara...@lynxanalytics.com> wrote:
>
>>
>> On Tue, Feb 2, 2016 at 7:10 PM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>>> What about the memory leak bug?
 https://issues.apache.org/jira/browse/SPARK-11293
 Even after the memory rewrite in 1.6.0, it still happens in some
 cases.
 Will it be fixed for 1.6.1?

>>>
>>> I think we have enough issues queued up that I would not hold the
>>> release for that, but if there is a patch we should try and review it.  
>>> We
>>> can always do 1.6.2 when more issues have been resolved.  Is this an 
>>> actual
>>> issue that is affecting a production workload or are we concerned about 
>>> an
>>> edge case?
>>>
>>
>> The way we (Lynx Analytics) use RDDs, this affects almost everything
>> we do in production. Thankfully it does not cause any issues, it just 
>> logs
>> a lot of errors. I think the adverse effect may be that the memory 
>> manager
>> does not have a fully correct picture. But as long as the leak fits in 
>> the
>> "other" (unmanaged) memory fraction this will not cause issues. We don't
>> see this as an urgent issue. Thanks!
>>
>
>

>>>
>>
>


Re: Opening a JIRA for QuantileDiscretizer bug

2016-02-22 Thread Ted Yu
When you click on Create, you're brought to 'Create Issue' dialog where you
choose Project Spark.
Component should be MLlib.

Please see also:
http://search-hadoop.com/m/q3RTtmsshe1W6cH22/spark+pull+template=pull+request+template


On Mon, Feb 22, 2016 at 6:45 PM, Pierson, Oliver C  wrote:

> Hello,
>
>   I've discovered a bug in the QuantileDiscretizer estimator.
> Specifically, for large DataFrames QuantileDiscretizer will only create one
> split (i.e. two bins).
>
>
> The error happens in lines 113 and 114 of QuantileDiscretizer.scala:
>
>
> val requiredSamples = math.max(numBins * numBins, 1)
>
> val fraction = math.min(requiredSamples / dataset.count(), 1.0)
>
>
> After the first line, requiredSamples is an Int.  Therefore, if
> requiredSamples > dataset.count() then fraction is always 0.0.
>
>
> The problem can be simply fixed by replacing the first with:
>
>
>   val requiredSamples = math.max(numBins * numBins, 1.0)
>
>
> I've implemented this change in my fork and all tests passed (except for
> docker integration, but I think that's another issue).  I'm happy to submit
> a PR if it will ease someone else's workload.  However, I'm unsure of how
> to create a JIRA.  I've created an account on the issue tracker (
> issues.apache.org) but when I try to create an issue it asks me to choose
> a "Service Desk".  Which one should I be choosing?
>
>
> Thanks much,
>
> Oliver Pierson
>
>
>
>


Opening a JIRA for QuantileDiscretizer bug

2016-02-22 Thread Pierson, Oliver C
Hello,

  I've discovered a bug in the QuantileDiscretizer estimator.  Specifically, 
for large DataFrames QuantileDiscretizer will only create one split (i.e. two 
bins).


The error happens in lines 113 and 114 of QuantileDiscretizer.scala:


val requiredSamples = math.max(numBins * numBins, 1)

val fraction = math.min(requiredSamples / dataset.count(), 1.0)


After the first line, requiredSamples is an Int.  Therefore, if requiredSamples 
> dataset.count() then fraction is always 0.0.


The problem can be simply fixed by replacing the first with:


  val requiredSamples = math.max(numBins * numBins, 1.0)


I've implemented this change in my fork and all tests passed (except for docker 
integration, but I think that's another issue).  I'm happy to submit a PR if it 
will ease someone else's workload.  However, I'm unsure of how to create a 
JIRA.  I've created an account on the issue tracker (issues.apache.org) but 
when I try to create an issue it asks me to choose a "Service Desk".  Which one 
should I be choosing?


Thanks much,

Oliver Pierson




Re: Spark 1.6.1

2016-02-22 Thread Michael Armbrust
I've kicked off the build.  Please be extra careful about merging into
branch-1.6 until after the release.

On Mon, Feb 22, 2016 at 10:24 AM, Michael Armbrust 
wrote:

> I will cut the RC today.  Sorry for the delay!
>
> On Mon, Feb 22, 2016 at 5:19 AM, Patrick Woody 
> wrote:
>
>> Hey Michael,
>>
>> Any update on a first cut of the RC?
>>
>> Thanks!
>> -Pat
>>
>> On Mon, Feb 15, 2016 at 6:50 PM, Michael Armbrust > > wrote:
>>
>>> I'm not going to be able to do anything until after the Spark Summit,
>>> but I will kick off RC1 after that (end of week).  Get your patches in
>>> before then!
>>>
>>> On Sat, Feb 13, 2016 at 4:57 PM, Jong Wook Kim 
>>> wrote:
>>>
 Is 1.6.1 going to be ready this week? I see that the two last
 unresolved issues targeting 1.6.1 are fixed
  now
 .

 On 3 February 2016 at 08:16, Daniel Darabos <
 daniel.dara...@lynxanalytics.com> wrote:

>
> On Tue, Feb 2, 2016 at 7:10 PM, Michael Armbrust <
> mich...@databricks.com> wrote:
>
>> What about the memory leak bug?
>>> https://issues.apache.org/jira/browse/SPARK-11293
>>> Even after the memory rewrite in 1.6.0, it still happens in some
>>> cases.
>>> Will it be fixed for 1.6.1?
>>>
>>
>> I think we have enough issues queued up that I would not hold the
>> release for that, but if there is a patch we should try and review it.  
>> We
>> can always do 1.6.2 when more issues have been resolved.  Is this an 
>> actual
>> issue that is affecting a production workload or are we concerned about 
>> an
>> edge case?
>>
>
> The way we (Lynx Analytics) use RDDs, this affects almost everything
> we do in production. Thankfully it does not cause any issues, it just logs
> a lot of errors. I think the adverse effect may be that the memory manager
> does not have a fully correct picture. But as long as the leak fits in the
> "other" (unmanaged) memory fraction this will not cause issues. We don't
> see this as an urgent issue. Thanks!
>


>>>
>>
>


Re: Spark not able to fetch events from Amazon Kinesis

2016-02-22 Thread Yash Sharma
Answering my own Question -

I have got some success with Spark Kinesis integration, and the key being
the unionStreams.foreachRDD.

There are 2 versions of the foreachRDD available
- unionStreams.foreachRDD
- unionStreams.foreachRDD ((rdd: RDD[Array[Byte]], time: Time)

For some reason the first one is not able to get me the results but
changing to the second one fetches me the results as expected. Yet to
explore the reason.

Adding a code snippet below for reference.

Hope it helps someone :)

Thanks everyone for help.


> val kinesisStreams = (0 until numStreams).map {
>   count =>
> val stream = KinesisUtils.createStream(
>   ssc,
>   consumerName,
>   streamName,
>   endpointUrl,
>   regionName,
>   InitialPositionInStream.TRIM_HORIZON,
>   kinesisCheckpointInterval,
>   StorageLevel.MEMORY_AND_DISK_2
> )
>
> stream
> }
> val unionStreams = ssc.union(kinesisStreams)
>
> println(s"")
> println(s"Num of streams: ${numStreams}")
> println(s"")
>
> /*unionStreams.foreachRDD{ // Doesn't Work !!
>   rdd =>
> println(rdd.count)
> println("rdd isempty:" + rdd.isEmpty)
> }*/

unionStreams.foreachRDD ((rdd: RDD[Array[Byte]], time: Time) => { // Works,
> Yeah !!
>   println(rdd.count)
>   println("rdd isempty:" + rdd.isEmpty)
>   }
> )
>
> ssc.start()
> ssc.awaitTermination()
>
> 


On Sun, Jan 31, 2016 at 12:11 PM, Yash Sharma  wrote:
>
> Thanks Burak,
> By any chance were you able to work around these errors or get the setup
working ? Is there anything else that you might have tried ?
>
> Regards
>
> On Sun, Jan 31, 2016 at 4:41 AM, Burak Yavuz  wrote:
>>
>> Hi Yash,
>>
>> I've run into multiple problems due to version incompatibilities, either
due to protobuf or jackson. That may be your culprit. The problem is that
all failures by the Kinesis Client Lib is silent, therefore don't show up
on the logs. It's very hard to debug those buggers.
>>
>> Best,
>> Burak
>>
>> On Sat, Jan 30, 2016 at 5:36 AM, Yash Sharma  wrote:
>>>
>>> Thanks Ted, Rebuilding would not be possible for the setup
unfortunately so just wanted to check if the version mismatch is the
primary issue here. Wanted to know if anyone has hit across similar issue
and how they have solved this.
>>>
>>> Thanks
>>>
>>> On Sat, Jan 30, 2016 at 10:23 PM, Ted Yu  wrote:

 w.r.t. protobuf-java version mismatch, I wonder if you can rebuild
Spark with the following change (using maven):

 http://pastebin.com/fVQAYWHM

 Cheers

 On Sat, Jan 30, 2016 at 12:49 AM, Yash Sharma 
wrote:
>
> Hi All,
> I have a quick question if anyone has experienced this here.
>
> I have been trying to get Spark read events from Kinesis recently but
am having problem in receiving the events. While Spark is able to connect
to Kinesis and is able to get metadata from Kinesis, Its not able to get
events from it. It always fetches zero elements back.
>
> There are no errors, just empty results back. Spark is able to get
metadata (Eg. number of shards in kinesis etc).
>
> I have used these [1 & 2] guides for getting it working but have not
got much luck yet. I have also tried couple of suggestions from SO [3]. The
cluster has sufficient resources/cores available.
>
> We have seen a version conflict in Protobuf Version between Spark and
Kinesis which could also be a cause for this behavior. Spark uses
protobuf-java version 2.5.0 and kinesis probably uses
protobuf-java-2.6.1.jar.
>
> Just wondered if anyone has come across this behavior or, has got
spark working with kinesis.
>
> Have tried with Spark 1.5.0, Spark 1.6.0.
>
> Appreciate any pointers.
>
> Best Regards,
> Yash
>
> 1.
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
> 2.
https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
>
> 3.
http://stackoverflow.com/questions/26941844/apache-spark-kinesis-sample-not-working
>

>>>
>>
>


Re: Spark 1.6.1

2016-02-22 Thread Michael Armbrust
I will cut the RC today.  Sorry for the delay!

On Mon, Feb 22, 2016 at 5:19 AM, Patrick Woody 
wrote:

> Hey Michael,
>
> Any update on a first cut of the RC?
>
> Thanks!
> -Pat
>
> On Mon, Feb 15, 2016 at 6:50 PM, Michael Armbrust 
> wrote:
>
>> I'm not going to be able to do anything until after the Spark Summit, but
>> I will kick off RC1 after that (end of week).  Get your patches in before
>> then!
>>
>> On Sat, Feb 13, 2016 at 4:57 PM, Jong Wook Kim 
>> wrote:
>>
>>> Is 1.6.1 going to be ready this week? I see that the two last unresolved
>>> issues targeting 1.6.1 are fixed
>>>  now
>>> .
>>>
>>> On 3 February 2016 at 08:16, Daniel Darabos <
>>> daniel.dara...@lynxanalytics.com> wrote:
>>>

 On Tue, Feb 2, 2016 at 7:10 PM, Michael Armbrust <
 mich...@databricks.com> wrote:

> What about the memory leak bug?
>> https://issues.apache.org/jira/browse/SPARK-11293
>> Even after the memory rewrite in 1.6.0, it still happens in some
>> cases.
>> Will it be fixed for 1.6.1?
>>
>
> I think we have enough issues queued up that I would not hold the
> release for that, but if there is a patch we should try and review it.  We
> can always do 1.6.2 when more issues have been resolved.  Is this an 
> actual
> issue that is affecting a production workload or are we concerned about an
> edge case?
>

 The way we (Lynx Analytics) use RDDs, this affects almost everything we
 do in production. Thankfully it does not cause any issues, it just logs a
 lot of errors. I think the adverse effect may be that the memory manager
 does not have a fully correct picture. But as long as the leak fits in the
 "other" (unmanaged) memory fraction this will not cause issues. We don't
 see this as an urgent issue. Thanks!

>>>
>>>
>>
>


Re: 回复: a new FileFormat 5x~100x faster than parquet

2016-02-22 Thread Ted Yu
The referenced benchmark is in Chinese. Please provide English version so
that more people can understand.

For item 7, looks like the speed of ingest is much slower compared to using
Parquet.

Cheers

On Mon, Feb 22, 2016 at 6:12 AM, 开心延年  wrote:

> 1.ya100 is not only the invert index  ,but also include the TOP N sort
> lazy read,also include label .
> 2.our test on ya100 and parquet is on this link address
> https://github.com/ycloudnet/ya100/blob/master/v1.0.8/ya100%E6%80%A7%E8%83%BD%E6%B5%8B%E8%AF%95%E6%8A%A5%E5%91%8A.docx?raw=true
>
> 3.you are right ,the load into ya100 is longer then parquet ,but the query
> time is very fast.
> 4. the code is already available ,but i`m sorry ya100 and ydb is a
> Commercial product now by our company (called ycloud),but i think if apache
> like it ,the source code is not a problem.
>
> our invert index indeed is apache lucene-but a quite different useage,
> such as solr or es.
> we can make a test on ya100(we support java jar),it is really very faster
> than  parquet especially on sort 、group by 、filter and so on.
>
>
> provide more information? what any thing other infomation do you want ,i
> can provide it  tomorrow.
>
> -- 原始邮件 --
> *发件人:* "Gavin Yue";;
> *发送时间:* 2016年2月22日(星期一) 晚上9:33
> *收件人:* "开心延年";
> *抄送:* "Akhil Das"; "user"<
> u...@spark.apache.org>; "dev";
> *主题:* Re: 回复: a new FileFormat 5x~100x faster than parquet
>
> I recommend you provide more information. Using inverted index certainly
> speed up the query time if hitting the index, but it would take longer to
> create and insert.
>
> Is the source code not available at this moment?
>
> Thanks
> Gavin
>
> On Feb 22, 2016, at 20:27, 开心延年  wrote:
>
> if apache enjoy this project , of course we provider the source code .
>
> BUt if apache dislike the porject , we had continue to improve the project
> by myself .
>
> ya100 and ydb  max process data is 180billions rows data per day for neary
> realtime import .
>
> because of index ,we make the search 10 secondes return in 1800billion
> (10days) rows data.
>
>
> -- 原始邮件 --
> *发件人:* "Akhil Das";;
> *发送时间:* 2016年2月22日(星期一) 晚上8:42
> *收件人:* "开心延年";
> *抄送:* "user"; "dev";
> *主题:* Re: a new FileFormat 5x~100x faster than parquet
>
> Would be good to see the source code and the documentation in English.
>
> Thanks
> Best Regards
>
> On Mon, Feb 22, 2016 at 4:44 PM, 开心延年  wrote:
>
>> Ya100 is a FileFormat 5x~100x  faster than parquet。
>> we can get ya100 from this link
>> https://github.com/ycloudnet/ya100/tree/master/v1.0.8
>>
>>
>> 
>>
>> 1.we used the inverted index,so we skip the rows that we does need.
>>
>>   for example  the trade log search SQL
>>
>>
>>
>> select
>>
>> (1)
>> phonenum,usernick,ydb_grade,ydb_age,ydb_blood,ydb_zhiye,ydb_earn,ydb_day,amtlong
>>
>>
>> from spark_txt where
>>
>> (2)tradeid=' 2014012213870282671'
>>
>> limit 10;
>>
>>
>>  this sql is compose by two part
>>
>>  (1)the part 1 is return the result which has 9 columns
>>
>>  (2) the part 2 is the filter condition ,filter by tradeid
>>
>>
>>
>>   let guess which plan is faster
>>
>>  plan A :first read all the 9 columns result then filter by tradeid
>>
>>  plan B: first filter by tradeid ,then we read the match 9 columns
>> result.
>>
>>
>> Ya100 choose plan B
>>
>>
>>  contrast  performance Ya100`index with parquet
>>
>>
>> <60C62790@C4EFD745.B6FECA56>
>>
>>
>>
>> 2.TOP N sort ,the non sort column we doesn`t read it until the last
>>
>>   for example  we sort by the logtime
>>
>> select
>>
>> (1)
>> phonenum,usernick,ydb_grade,ydb_age,ydb_blood,ydb_zhiye,ydb_earn,ydb_day,amtlong
>>
>>
>> from spark_txt
>>
>> (2)order by logtime desc
>>
>> limit 10;
>>
>>
>>   this sql is compose by two part
>>
>>  (1)the part 1 is return the result which has 9 columns
>>
>>  (2) the part 2 is the column need to sort
>>
>>
>>
>>   let guess which plan is faster
>>  plan A :first read all the 9 columns result then sort by logtime
>>
>>  plan B: first sort by logtime ,then we read the match 9 columns
>> result.
>>
>>
>>Ya100 choose plan B
>>
>>
>>  contrast  performance Ya100`lazy read with parquet
>> <88FABE61@C4EFD745.B6FECA56>
>>
>> 3.we used label instead of the original value for grouping and sorting
>>
>> 
>>
>> 1).General situation,the data has a lot of repeat value,for exampe the
>> sex file ,the age field .
>> 2).if we store the original value ,that will weast a lot of storage.
>> so we make a small modify at original  value, Additional add a new filed
>> called label.
>> make a unique value sort by fields, and then gave each term a unique
>> 

?????? ?????? a new FileFormat 5x~100x faster than parquet

2016-02-22 Thread ????????
1.ya100 is not only the invert index  ,but also include the TOP N sort lazy 
read,also include label .
2.our test on ya100 and parquet is on this link address 
https://github.com/ycloudnet/ya100/blob/master/v1.0.8/ya100%E6%80%A7%E8%83%BD%E6%B5%8B%E8%AF%95%E6%8A%A5%E5%91%8A.docx?raw=true

3.you are right ,the load into ya100 is longer then parquet ,but the query time 
is very fast.
4. the code is already available ??but i`m sorry ya100 and ydb is a  
Commercial product now by our company (called ycloud),but i think if apache 
like it ,the source code is not a problem.

our invert index indeed is apache lucene-but a quite different useage, such as 
solr or es.
we can make a test on ya100(we support java jar),it is really very faster than  
parquet  especially on sort ??group by ??filter and 
so on.


provide more information? what any thing other infomation do you want ,i can 
provide it  tomorrow.


--  --
??: "Gavin Yue";;
: 2016??2??22??(??) 9:33
??: ""; 
: "Akhil Das"; "user"; 
"dev"; 
: Re: ?? a new FileFormat 5x~100x faster than parquet




I recommend you provide more information. Using inverted index certainly speed 
up the query time if hitting the index, but it would take longer to create and 
insert.  


Is the source code not available at this moment? 


Thanks 
Gavin 

On Feb 22, 2016, at 20:27,   wrote:


if apache enjoy this project , of course we provider the source code .

BUt if apache dislike the porject , we had continue to improve the project by 
myself .

ya100 and ydb  max process data is 180billions rows data per day for neary 
realtime import .

because of index ,we make the search 10 secondes return in 1800billion (10days) 
rows data.




--  --
??: "Akhil Das";;
: 2016??2??22??(??) 8:42
??: ""; 
: "user"; "dev"; 
: Re: a new FileFormat 5x~100x faster than parquet



Would be good to see the source code and the documentation in English.


ThanksBest Regards



 
On Mon, Feb 22, 2016 at 4:44 PM,   wrote:
Ya100 is a FileFormat 5x~100x  faster than parquet??
we can get ya100 from this link 
https://github.com/ycloudnet/ya100/tree/master/v1.0.8




 
1.we used the inverted index??so we skip the rows that we does need.

  for example  the trade log search SQL

 

select 
 
??1??phonenum,usernick,ydb_grade,ydb_age,ydb_blood,ydb_zhiye,ydb_earn,ydb_day,amtlong
  
 
from spark_txt where   
 
??2??tradeid=' 2014012213870282671'
 
limit 10;  
 





 this sql is compose by two part

 (1)the part 1 is return the result which has 9 columns

 (2) the part 2 is the filter condition ,filter by tradeid

  


  let guess which plan is faster

 plan A :first read all the 9 columns result then filter by tradeid

 plan B: first filter by tradeid ,then we read the match 9 columns result.




Ya100 choose plan B




 contrast  performance Ya100`index with parquet



<60C62790@C4EFD745.B6FECA56>




 2.TOP N sort ,the non sort column we doesn`t read it until the last 


  for example  we sort by the logtime

 

select 
 
??1??phonenum,usernick,ydb_grade,ydb_age,ydb_blood,ydb_zhiye,ydb_earn,ydb_day,amtlong
  
 
from spark_txt 
 
??2??order by logtime desc 
 
limit 10;  
 


  this sql is compose by two part
 (1)the part 1 is return the result which has 9 columns

 (2) the part 2 is the column need to sort

  


  let guess which plan is faster
 plan A :first read all the 9 columns result then sort by logtime
 plan B: first sort by logtime  ,then we read the match 9 columns result.




   Ya100 choose plan B




 contrast  performance Ya100`lazy read with parquet
<88FABE61@C4EFD745.B6FECA56>

3.we used label instead of the original value for grouping and sorting



1).General situation,the data has a lot of repeat value,for exampe the sex file 
,the age field .
2).if we store the original value ,that will weast a lot of storage.
so we make a small modify at original  value, Additional add a new filed called 
label.
make a unique value sort by fields, and then gave each term a unique  Number 
from begin to end  .
3).we use number value(we called label) instead of original  value.lable is 
store by fixed length. the file could be read by random read.
4).the label`s order is the same with dictionary  order .so if we do some 
calculation like order by or group by only need to  read the label. we don`t 
need to read the original value.
5).some field like sex field ,only have 2 different values.so we 

Re: Spark 1.6.1

2016-02-22 Thread Patrick Woody
Hey Michael,

Any update on a first cut of the RC?

Thanks!
-Pat

On Mon, Feb 15, 2016 at 6:50 PM, Michael Armbrust 
wrote:

> I'm not going to be able to do anything until after the Spark Summit, but
> I will kick off RC1 after that (end of week).  Get your patches in before
> then!
>
> On Sat, Feb 13, 2016 at 4:57 PM, Jong Wook Kim  wrote:
>
>> Is 1.6.1 going to be ready this week? I see that the two last unresolved
>> issues targeting 1.6.1 are fixed
>>  now
>> .
>>
>> On 3 February 2016 at 08:16, Daniel Darabos <
>> daniel.dara...@lynxanalytics.com> wrote:
>>
>>>
>>> On Tue, Feb 2, 2016 at 7:10 PM, Michael Armbrust >> > wrote:
>>>
 What about the memory leak bug?
> https://issues.apache.org/jira/browse/SPARK-11293
> Even after the memory rewrite in 1.6.0, it still happens in some cases.
> Will it be fixed for 1.6.1?
>

 I think we have enough issues queued up that I would not hold the
 release for that, but if there is a patch we should try and review it.  We
 can always do 1.6.2 when more issues have been resolved.  Is this an actual
 issue that is affecting a production workload or are we concerned about an
 edge case?

>>>
>>> The way we (Lynx Analytics) use RDDs, this affects almost everything we
>>> do in production. Thankfully it does not cause any issues, it just logs a
>>> lot of errors. I think the adverse effect may be that the memory manager
>>> does not have a fully correct picture. But as long as the leak fits in the
>>> "other" (unmanaged) memory fraction this will not cause issues. We don't
>>> see this as an urgent issue. Thanks!
>>>
>>
>>
>


Builds are failing

2016-02-22 Thread Iulian Dragoș
Just in case you missed this:
https://issues.apache.org/jira/browse/SPARK-13431

Builds are failing with 'Method code too large' in the "shading" step with
Maven.

iulian

-- 

--
Iulian Dragos

--
Reactive Apps on the JVM
www.typesafe.com


Re: How do we run that PR auto-close script again?

2016-02-22 Thread Sean Owen
That's what I'm talking about, yes, but I'm looking for the actual
script. I'm sure there was a discussion about where it was and how to
run it somewhere. Really just looking to have it run again.

On Mon, Feb 22, 2016 at 10:44 AM, Akhil Das  wrote:
> This?
> http://apache-spark-developers-list.1001551.n3.nabble.com/Automated-close-of-PR-s-td15862.html
>
> Thanks
> Best Regards
>
> On Mon, Feb 22, 2016 at 2:47 PM, Sean Owen  wrote:
>>
>> I know Patrick told us at some point, but I can't find the email or
>> wiki that describes how to run the script that auto-closes PRs with
>> "do you mind closing this PR". Does anyone know? I think it's been a
>> long time since it was run.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: How do we run that PR auto-close script again?

2016-02-22 Thread Akhil Das
This?
http://apache-spark-developers-list.1001551.n3.nabble.com/Automated-close-of-PR-s-td15862.html

Thanks
Best Regards

On Mon, Feb 22, 2016 at 2:47 PM, Sean Owen  wrote:

> I know Patrick told us at some point, but I can't find the email or
> wiki that describes how to run the script that auto-closes PRs with
> "do you mind closing this PR". Does anyone know? I think it's been a
> long time since it was run.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


How do we run that PR auto-close script again?

2016-02-22 Thread Sean Owen
I know Patrick told us at some point, but I can't find the email or
wiki that describes how to run the script that auto-closes PRs with
"do you mind closing this PR". Does anyone know? I think it's been a
long time since it was run.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org