Re: [Spark SQL] Making InferSchema and JacksonParser public

2017-01-18 Thread Brian Hong
Yes that is the option I took while implementing this under Spark 1.4.  But
every time there is a major update in Spark, I needed to re-copy the needed
parts, which is very time consuming.

The reason is that InferSchema and JacksonParser uses many more Spark
internal methods, which makes this very hard to copy and maintain.

Thanks!

On Thu, Jan 19, 2017 at 2:41 AM Reynold Xin  wrote:

> That is internal, but the amount of code is not a lot. Can you just copy
> the relevant classes over to your project?
>
> On Wed, Jan 18, 2017 at 5:52 AM Brian Hong 
> wrote:
>
> I work for a mobile game company. I'm solving a simple question: "Can we
> efficiently/cheaply query for the log of a particular user within given
> date period?"
>
> I've created a special JSON text-based file format that has these traits:
>  - Snappy compressed, saved in AWS S3
>  - Partitioned by date. ie. 2017-01-01.sz, 2017-01-02.sz, ...
>  - Sorted by a primary key (log_type) and a secondary key (user_id),
> Snappy block compressed by 5MB blocks
>  - Blocks are indexed with primary/secondary key in file 2017-01-01.json
>  - Efficient block based random access on primary key (log_type) and
> secondary key (user_id) using the index
>
> I've created a Spark SQL DataFrame relation that can query this file
> format.  Since the schema of each log type is fairly consistent, I've
> reused the `InferSchema.inferSchema` method and `JacksonParser`in the Spark
> SQL code to support structured querying.  I've also implemented filter
> push-down to optimize the file access.
>
> It is very fast when querying for a single user or querying for a single
> log type with a sampling ratio of 1 to 1 compared to parquet file
> format.  (We do use parquet for some log types when we need batch analysis.)
>
> One of the problems we face is that the methods we use above are private
> API.  So we are forced to use hacks to use these methods.  (Things like
> copying the code or using the org.apache.spark.sql package namespace)
>
> I've been following Spark SQL code since 1.4, and the JSON schema
> inferencing code and JacksonParser seem to be relatively stable recently.
> Can the core-devs make these APIs public?
>
> We are willing to open source this file format because it is very
> excellent for archiving user related logs in S3.  The key dependency of
> private APIs in Spark SQL is the main hurdle in making this a reality.
>
> Thank you for reading!
>
>
>
>


Re: GraphX-related "open" issues

2017-01-18 Thread Dongjin Lee
Hi all,

I am currently working on SPARK-15880[^1] and also have some interest
on SPARK-7244[^2] and SPARK-7257[^3]. In fact, SPARK-7244 and SPARK-7257
have some importance on graph analysis field.
Could you make them an exception? Since I am working on graph analysis, I
hope to take them.

If needed, I can take SPARK-10335 and SPARK-8497 after them.

Thanks,
Dongjin

On Wed, Jan 18, 2017 at 2:40 AM, Sean Owen  wrote:

> WontFix or Later is fine. There's not really any practical distinction. I
> figure that if something times out and is closed, it's very unlikely to be
> looked at again. Therefore marking it as something to do 'later' seemed
> less accurate.
>
> On Tue, Jan 17, 2017 at 5:30 PM Takeshi Yamamuro 
> wrote:
>
>> Thank for your comment!
>> I'm just thinking I'll set "Won't Fix" though, "Later" is also okay.
>> But, I re-checked "Contributing to JIRA Maintenance" in the contribution
>> guide (http://spark.apache.org/contributing.html) and
>> I couldn't find any setting policy about "Later".
>> So, IMO it's okay to set "Won't Fix" for now and those who'd like to make
>> prs feel free to (re?-)open tickets.
>>
>>
>> On Wed, Jan 18, 2017 at 1:48 AM, Dongjoon Hyun 
>> wrote:
>>
>> Hi, Takeshi.
>>
>> > So, IMO it seems okay to close tickets about "Improvement" and "New
>> Feature" for now.
>>
>> I'm just wondering about what kind of field value you want to fill in the
>> `Resolution` field for those issues.
>>
>> Maybe, 'Later'? Or, 'Won't Fix'?
>>
>> Bests,
>> Dongjoon.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>


-- 
*Dongjin Lee*


*Software developer in Line+.So interested in massive-scale machine
learning.facebook: www.facebook.com/dongjin.lee.kr
linkedin:
kr.linkedin.com/in/dongjinleekr
github:
github.com/dongjinleekr
twitter: www.twitter.com/dongjinleekr
*


Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

2017-01-18 Thread Michael Armbrust
+1, we should just fix the error to explain why months aren't allowed and
suggest that you manually specify some number of days.

On Wed, Jan 18, 2017 at 9:52 AM, Maciej Szymkiewicz 
wrote:

> Thanks for the response Burak,
>
> As any sane person I try to steer away from the objects which have both
> calendar and unsafe in their fully qualified names but if there is no
> bigger picture I missed here I would go with 1 as well. And of course fix
> the error message. I understand this has been introduced with structured
> streaming in mind, but it is an useful feature in general, not only in high
> precision scale. To be honest I would love to see some generalized version
> which could be used (I mean without hacking) with arbitrary numeric
> sequence. It could address at least some scenarios in which people try to
> use window functions without PARTITION BY clause and fail miserably.
>
> Regarding ambiguity... Sticking with days doesn't really resolve the
> problem, does it? If one were to nitpick it doesn't look like this
> implementation even touches all the subtleties of DST or leap second.
>
>
>
>
> On 01/18/2017 05:52 PM, Burak Yavuz wrote:
>
> Hi Maciej,
>
> I believe it would be useful to either fix the documentation or fix the
> implementation. I'll leave it to the community to comment on. The code
> right now disallows intervals provided in months and years, because they
> are not a "consistently" fixed amount of time. A month can be 28, 29, 30,
> or 31 days. A year is 12 months for sure, but is it 360 days (sometimes
> used in finance), 365 days or 366 days?
>
> Therefore we could either:
>   1) Allow windowing when intervals are given in days and less, even
> though it could be 365 days, and fix the documentation.
>   2) Explicitly disallow it as there may be a lot of data for a given
> window, but partial aggregations should help with that.
>
> My thoughts are to go with 1. What do you think?
>
> Best,
> Burak
>
> On Wed, Jan 18, 2017 at 10:18 AM, Maciej Szymkiewicz <
> mszymkiew...@gmail.com> wrote:
>
>> Hi,
>>
>> Can I ask for some clarifications regarding intended behavior of window /
>> TimeWindow?
>>
>> PySpark documentation states that "Windows in the order of months are not
>> supported". This is further confirmed by the checks in
>> TimeWindow.getIntervalInMicroseconds (https://git.io/vMP5l).
>>
>> Surprisingly enough we can pass interval much larger than a month by
>> expressing interval in days or another unit of a higher precision. So this
>> fails:
>>
>> Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month"))
>>
>> while following is accepted:
>>
>> Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days"))
>>
>> with results which look sensible at first glance.
>>
>> Is it a matter of a faulty validation logic (months will be assigned only
>> if there is a match against years or months https://git.io/vMPdi) or
>> expected behavior and I simply misunderstood the intentions?
>>
>> --
>> Best,
>> Maciej
>>
>>
>
>


Re: Possible bug - Java iterator/iterable inconsistency

2017-01-18 Thread Sean Owen
Hm. Unless I am also totally missing or forgetting something, I think
you're right. The equivalent in PairRDDFunctions.scala operations on a
function from T to TraversableOnce[U] and a TraversableOnce is most like
java.util.Iterator.

You can work around it by wrapping it in a faked IteratorIterable.

I think this is fixable in the API by deprecating this method and adding a
new one that takes a FlatMapFunction. We'd have to triple-check in a test
that this doesn't cause an API compatibility problem with respect to Java 8
lambdas, but if that's settled, I think this could be fixed without
breaking the API.

On Wed, Jan 18, 2017 at 8:50 PM Asher Krim  wrote:

> In Spark 2 + Java + RDD api, the use of iterables was replaced with
> iterators. I just encountered an inconsistency in `flatMapValues` that may
> be a bug:
>
> `flatMapValues` (
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala#L677)
> takes a `FlatMapFunction` (
> https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/java/org/apache/spark/api/java/function/FlatMapFunction.java
> )
>
> The problem is that `FlatMapFunction` was changed to return an iterator,
> but `rdd.flatMapValues` still expects an iterable. Am I using these
> constructs correctly? Is there a workaround other than converting the
> iterator to an iterable outside of the function?
>
> Thanks,
> --
> Asher Krim
> Senior Software Engineer
>


Possible bug - Java iterator/iterable inconsistency

2017-01-18 Thread Asher Krim
In Spark 2 + Java + RDD api, the use of iterables was replaced with
iterators. I just encountered an inconsistency in `flatMapValues` that may
be a bug:

`flatMapValues` (https://github.com/apache/spark/blob/master/core/src/
main/scala/org/apache/spark/api/java/JavaPairRDD.scala#L677) takes
a `FlatMapFunction` (
https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/java/org/apache/spark/api/java/function/FlatMapFunction.java
)

The problem is that `FlatMapFunction` was changed to return an iterator,
but `rdd.flatMapValues` still expects an iterable. Am I using these
constructs correctly? Is there a workaround other than converting the
iterator to an iterable outside of the function?

Thanks,
-- 
Asher Krim
Senior Software Engineer


Re: clientMode in RpcEnv.create in Spark on YARN vs general case (driver vs executors)?

2017-01-18 Thread Marcelo Vanzin
On Wed, Jan 18, 2017 at 1:29 AM, Jacek Laskowski  wrote:
> I'm trying to get the gist of clientMode input parameter for
> RpcEnv.create [1]. It is disabled (i.e. false) by default.

"clientMode" means whether the RpcEnv only opens external connections
(client) or also accepts incoming connections. From that you should be
able to understand why it's true or false in the cases you looked at.


-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: can someone review my PR?

2017-01-18 Thread Marcelo Vanzin
On Wed, Jan 18, 2017 at 6:16 AM, Steve Loughran  wrote:
> it's failing on the dependency check as the dependencies have changed.
> that's what it's meant to do. should I explicitly be changing the values so
> that the build doesn't notice the change?

Yes. There's no automated way to do that, intentionally.

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

2017-01-18 Thread Maciej Szymkiewicz
Thanks for the response Burak,

As any sane person I try to steer away from the objects which have both
calendar and unsafe in their fully qualified names but if there is no
bigger picture I missed here I would go with 1 as well. And of course
fix the error message. I understand this has been introduced with
structured streaming in mind, but it is an useful feature in general,
not only in high precision scale. To be honest I would love to see some
generalized version which could be used (I mean without hacking) with
arbitrary numeric sequence. It could address at least some scenarios in
which people try to use window functions without PARTITION BY clause and
fail miserably.

Regarding ambiguity... Sticking with days doesn't really resolve the
problem, does it? If one were to nitpick it doesn't look like this
implementation even touches all the subtleties of DST or leap second.



On 01/18/2017 05:52 PM, Burak Yavuz wrote:
> Hi Maciej,
>
> I believe it would be useful to either fix the documentation or fix
> the implementation. I'll leave it to the community to comment on. The
> code right now disallows intervals provided in months and years,
> because they are not a "consistently" fixed amount of time. A month
> can be 28, 29, 30, or 31 days. A year is 12 months for sure, but is it
> 360 days (sometimes used in finance), 365 days or 366 days? 
>
> Therefore we could either:
>   1) Allow windowing when intervals are given in days and less, even
> though it could be 365 days, and fix the documentation.
>   2) Explicitly disallow it as there may be a lot of data for a given
> window, but partial aggregations should help with that.
>
> My thoughts are to go with 1. What do you think?
>
> Best,
> Burak
>
> On Wed, Jan 18, 2017 at 10:18 AM, Maciej Szymkiewicz
> > wrote:
>
> Hi,
>
> Can I ask for some clarifications regarding intended behavior of
> window / TimeWindow?
>
> PySpark documentation states that "Windows in the order of months
> are not supported". This is further confirmed by the checks in
> TimeWindow.getIntervalInMicroseconds (https://git.io/vMP5l).
>
> Surprisingly enough we can pass interval much larger than a month
> by expressing interval in days or another unit of a higher
> precision. So this fails:
>
> Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month"))
>
> while following is accepted:
>
> Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days"))
>
> with results which look sensible at first glance.
>
> Is it a matter of a faulty validation logic (months will be
> assigned only if there is a match against years or months
> https://git.io/vMPdi) or expected behavior and I simply
> misunderstood the intentions?
>
> -- 
> Best,
> Maciej
>
>



Re: [Spark SQL] Making InferSchema and JacksonParser public

2017-01-18 Thread Reynold Xin
That is internal, but the amount of code is not a lot. Can you just copy
the relevant classes over to your project?

On Wed, Jan 18, 2017 at 5:52 AM Brian Hong 
wrote:

> I work for a mobile game company. I'm solving a simple question: "Can we
> efficiently/cheaply query for the log of a particular user within given
> date period?"
>
> I've created a special JSON text-based file format that has these traits:
>  - Snappy compressed, saved in AWS S3
>  - Partitioned by date. ie. 2017-01-01.sz, 2017-01-02.sz, ...
>  - Sorted by a primary key (log_type) and a secondary key (user_id),
> Snappy block compressed by 5MB blocks
>  - Blocks are indexed with primary/secondary key in file 2017-01-01.json
>  - Efficient block based random access on primary key (log_type) and
> secondary key (user_id) using the index
>
> I've created a Spark SQL DataFrame relation that can query this file
> format.  Since the schema of each log type is fairly consistent, I've
> reused the `InferSchema.inferSchema` method and `JacksonParser`in the Spark
> SQL code to support structured querying.  I've also implemented filter
> push-down to optimize the file access.
>
> It is very fast when querying for a single user or querying for a single
> log type with a sampling ratio of 1 to 1 compared to parquet file
> format.  (We do use parquet for some log types when we need batch analysis.)
>
> One of the problems we face is that the methods we use above are private
> API.  So we are forced to use hacks to use these methods.  (Things like
> copying the code or using the org.apache.spark.sql package namespace)
>
> I've been following Spark SQL code since 1.4, and the JSON schema
> inferencing code and JacksonParser seem to be relatively stable recently.
> Can the core-devs make these APIs public?
>
> We are willing to open source this file format because it is very
> excellent for archiving user related logs in S3.  The key dependency of
> private APIs in Spark SQL is the main hurdle in making this a reality.
>
> Thank you for reading!
>
>
>
>


Re: [Spark SQL] Making InferSchema and JacksonParser public

2017-01-18 Thread Michael Allman
Personally I'd love to see some kind of pluggability, configurability in the 
JSON schema parsing, maybe as an option in the DataFrameReader. Perhaps you can 
propose an API?

> On Jan 18, 2017, at 5:51 AM, Brian Hong  wrote:
> 
> I work for a mobile game company. I'm solving a simple question: "Can we 
> efficiently/cheaply query for the log of a particular user within given date 
> period?"
> 
> I've created a special JSON text-based file format that has these traits:
>  - Snappy compressed, saved in AWS S3
>  - Partitioned by date. ie. 2017-01-01.sz , 
> 2017-01-02.sz , ...
>  - Sorted by a primary key (log_type) and a secondary key (user_id), Snappy 
> block compressed by 5MB blocks
>  - Blocks are indexed with primary/secondary key in file 2017-01-01.json
>  - Efficient block based random access on primary key (log_type) and 
> secondary key (user_id) using the index
> 
> I've created a Spark SQL DataFrame relation that can query this file format.  
> Since the schema of each log type is fairly consistent, I've reused the 
> `InferSchema.inferSchema` method and `JacksonParser`in the Spark SQL code to 
> support structured querying.  I've also implemented filter push-down to 
> optimize the file access.
> 
> It is very fast when querying for a single user or querying for a single log 
> type with a sampling ratio of 1 to 1 compared to parquet file format.  
> (We do use parquet for some log types when we need batch analysis.)
> 
> One of the problems we face is that the methods we use above are private API. 
>  So we are forced to use hacks to use these methods.  (Things like copying 
> the code or using the org.apache.spark.sql package namespace)
> 
> I've been following Spark SQL code since 1.4, and the JSON schema inferencing 
> code and JacksonParser seem to be relatively stable recently.  Can the 
> core-devs make these APIs public?
> 
> We are willing to open source this file format because it is very excellent 
> for archiving user related logs in S3.  The key dependency of private APIs in 
> Spark SQL is the main hurdle in making this a reality.
> 
> Thank you for reading!
> 



Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

2017-01-18 Thread Michael Allman
Based on what you've described, I think you should be able to use Spark's 
parquet reader plus partition pruning in 2.1.

> On Jan 17, 2017, at 10:44 PM, Raju Bairishetti  wrote:
> 
> Thanks for the detailed explanation. Is it completely fixed in spark-2.1.0?
> 
>   We are giving very high memory to spark-driver to avoid the OOM(heap space/ 
> GC overhead limit) errors in spark-app. But when we run two-three jobs 
> together, these are bringing down the Hive metastore. We had to forcefully 
> drop older partitions to avoid frequent downs of Hive Metastore.
> 
> 
> On Wed, Jan 18, 2017 at 2:09 PM, Michael Allman  > wrote:
> I think I understand. Partition pruning for the case where 
> spark.sql.hive.convertMetastoreParquet is true was not added to Spark until 
> 2.1.0. I think that in previous versions it only worked when 
> spark.sql.hive.convertMetastoreParquet is false. Unfortunately, that 
> configuration gives you data decoding errors. If it's possible for you to 
> write all of your data with Hive, then you should be able to read it without 
> decoding errors and with partition pruning turned on. Another possibility is 
> running your Spark app with a very large maximum heap configuration, like 8g 
> or even 16g. However, loading all of that partition metadata can be quite 
> slow for very large tables. I'm sorry I can't think of a better solution for 
> you.
> 
> Michael
> 
> 
> 
> 
>> On Jan 17, 2017, at 8:59 PM, Raju Bairishetti > > wrote:
>> 
>> Tested on both 1.5.2 and 1.61.
>> 
>> On Wed, Jan 18, 2017 at 12:52 PM, Michael Allman > > wrote:
>> What version of Spark are you running?
>> 
>>> On Jan 17, 2017, at 8:42 PM, Raju Bairishetti >> > wrote:
>>> 
>>>  describe dummy;
>>> 
>>> OK
>>> 
>>> sample  string 
>>> 
>>> yearstring 
>>> 
>>> month   string  
>>> 
>>> # Partition Information  
>>> 
>>> # col_namedata_type   comment
>>> 
>>> yearstring 
>>> 
>>> 
>>> month   string 
>>> 
>>> 
>>> 
>>> val df = sqlContext.sql("select count(1) from rajub.dummy where 
>>> year='2017'")
>>> 
>>> df: org.apache.spark.sql.DataFrame = [_c0: bigint]
>>> 
>>> scala> df.explain
>>> 
>>> == Physical Plan ==
>>> 
>>> TungstenAggregate(key=[], 
>>> functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#3070L])
>>> 
>>> +- TungstenExchange SinglePartition, None
>>> 
>>>+- TungstenAggregate(key=[], 
>>> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#3076L])
>>> 
>>>   +- Scan ParquetRelation: rajub.dummy[] InputPaths: 
>>> maprfs:/user/rajub/dummy/sample/year=2016/month=10, 
>>> maprfs:/user/rajub/dummy/sample/year=2016/month=11, 
>>> maprfs:/user/rajub/dummy/sample/year=2016/month=9, 
>>> maprfs:/user/rajub/dummy/sample/year=2017/month=10, 
>>> maprfs:/user/rajub/dummy/sample/year=2017/month=11, 
>>> maprfs:/user/rajub/dummy/sample/year=2017/month=9
>>> 
>>> 
>>> On Wed, Jan 18, 2017 at 12:25 PM, Michael Allman >> > wrote:
>>> Can you paste the actual query plan here, please?
>>> 
 On Jan 17, 2017, at 7:38 PM, Raju Bairishetti > wrote:
 
 
 On Wed, Jan 18, 2017 at 11:13 AM, Michael Allman > wrote:
 What is the physical query plan after you set 
 spark.sql.hive.convertMetastoreParquet to true?
 Physical plan continas all the partition locations 
 
 Michael
 
> On Jan 17, 2017, at 6:51 PM, Raju Bairishetti  > wrote:
> 
> Thanks Michael for the respopnse.
> 
> 
> On Wed, Jan 18, 2017 at 2:45 AM, Michael Allman  > wrote:
> Hi Raju,
> 
> I'm sorry this isn't working for you. I helped author this functionality 
> and will try my best to help.
> 
> First, I'm curious why you set spark.sql.hive.convertMetastoreParquet to 
> false? 
> I had set as suggested in SPARK-6910 and corresponsing pull reqs. It did 
> not work for me without  setting spark.sql.hive.convertMetastoreParquet 
> property. 
> 
> Can you link specifically to the jira issue or spark pr you referred to? 
> The first thing I would try is setting 
> spark.sql.hive.convertMetastoreParquet to true. Setting that to false 
> might also explain why you're getting parquet decode errors. If you're 
> writing your table data with Spark's parquet file writer and reading with 
> Hive's parquet file reader, 

Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

2017-01-18 Thread Burak Yavuz
Hi Maciej,

I believe it would be useful to either fix the documentation or fix the
implementation. I'll leave it to the community to comment on. The code
right now disallows intervals provided in months and years, because they
are not a "consistently" fixed amount of time. A month can be 28, 29, 30,
or 31 days. A year is 12 months for sure, but is it 360 days (sometimes
used in finance), 365 days or 366 days?

Therefore we could either:
  1) Allow windowing when intervals are given in days and less, even though
it could be 365 days, and fix the documentation.
  2) Explicitly disallow it as there may be a lot of data for a given
window, but partial aggregations should help with that.

My thoughts are to go with 1. What do you think?

Best,
Burak

On Wed, Jan 18, 2017 at 10:18 AM, Maciej Szymkiewicz  wrote:

> Hi,
>
> Can I ask for some clarifications regarding intended behavior of window /
> TimeWindow?
>
> PySpark documentation states that "Windows in the order of months are not
> supported". This is further confirmed by the checks in 
> TimeWindow.getIntervalInMicroseconds
> (https://git.io/vMP5l).
>
> Surprisingly enough we can pass interval much larger than a month by
> expressing interval in days or another unit of a higher precision. So this
> fails:
>
> Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month"))
>
> while following is accepted:
>
> Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days"))
>
> with results which look sensible at first glance.
>
> Is it a matter of a faulty validation logic (months will be assigned only
> if there is a match against years or months https://git.io/vMPdi) or
> expected behavior and I simply misunderstood the intentions?
>
> --
> Best,
> Maciej
>
>


ApacheCon CFP closing soon (11 February)

2017-01-18 Thread Rich Bowen
Hello, fellow Apache enthusiast. Thanks for your participation, and
interest in, the projects of the Apache Software Foundation.

I wanted to remind you that the Call For Papers (CFP) for ApacheCon
North America, and Apache: Big Data North America, closes in less than a
month. If you've been putting it off because there was lots of time
left, it's time to dig for that inspiration and get those talk proposals in.

It's also time to discuss with your developer and user community whether
there's a track of talks that you might want to propose, so that you
have more complete coverage of your project than a talk or two.

We're looking for talks directly, and indirectly, related to projects at
the Apache Software Foundation. These can be anything from in-depth
technical discussions of the projects you work with, to talks about
community, documentation, legal issues, marketing, and so on. We're also
very interested in talks about projects and services built on top of
Apache projects, and case studies of how you use Apache projects to
solve real-world problems.

We are particularly interested in presentations from Apache projects
either in the Incubator, or recently graduated. ApacheCon is where
people come to find out what technology they'll be using this time next
year.

Important URLs are:

To submit a talk for Apache: Big Data -
http://events.linuxfoundation.org/events/apache-big-data-north-america/program/cfp
To submit a talk for ApacheCon -
http://events.linuxfoundation.org/events/apachecon-north-america/program/cfp

To register for Apache: Big Data -
http://events.linuxfoundation.org/events/apache-big-data-north-america/attend/register-
To register for ApacheCon -
http://events.linuxfoundation.org/events/apachecon-north-america/attend/register-

Early Bird registration rates end March 12th, but if you're a committer
on an Apache project, you get the low committer rate, which is less than
half of the early bird rate!

For further updated about ApacheCon, follow us on Twitter, @ApacheCon,
or drop by our IRC channel, #apachecon on the Freenode IRC network. Or
contact me - rbo...@apache.org - with any questions or concerns.

Thanks!

Rich Bowen, VP Conferences, Apache Software Foundation

-- 
(You've received this email because you're on a dev@ or users@ mailing
list of an Apache Software Foundation project. For subscription and
unsubscription information, consult the headers of this email message,
as this varies from one list to another.)

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: GC limit exceed

2017-01-18 Thread Daniel van der Ende
Hi Marco,

What kind of scheduler are you using on your cluster? Yarn?

Also, are you running in client mode or cluster mode on the cluster?

Daniel

On Wed, Jan 18, 2017 at 3:22 PM, marco rocchi <
rocchi.1407...@studenti.uniroma1.it> wrote:

> I have a spark code that works well over a sample of data in local mode,
> but when I pass the same code on a cluster with the entire dataset I
> receive GC limited exceed error.
> In that section is possible to submit the code and have some hints in
> order to solve my problem?
> Thanks a lot for the attention
>
> Marco
>



-- 
Daniel


GC limit exceed

2017-01-18 Thread marco rocchi
I have a spark code that works well over a sample of data in local mode,
but when I pass the same code on a cluster with the entire dataset I
receive GC limited exceed error.
In that section is possible to submit the code and have some hints in order
to solve my problem?
Thanks a lot for the attention

Marco


Re: can someone review my PR?

2017-01-18 Thread Steve Loughran

On 18 Jan 2017, at 11:18, Sean Owen 
> wrote:

It still doesn't pass tests -- I'd usually not look until that point.

it's failing on the dependency check as the dependencies have changed. that's 
what it's meant to do. should I explicitly be changing the values so that the 
build doesn't notice the change?

On Wed, Jan 18, 2017 at 11:10 AM Steve Loughran 
> wrote:
I've had a PR outstanding on spark/object store integration, works for both 
maven and sbt builds

https://issues.apache.org/jira/browse/SPARK-7481
https://github.com/apache/spark/pull/12004

Can I get someone to review this as it appears to be being overlooked amongst 
all the PRs

thanks

-steve



[Spark SQL] Making InferSchema and JacksonParser public

2017-01-18 Thread Brian Hong
I work for a mobile game company. I'm solving a simple question: "Can we
efficiently/cheaply query for the log of a particular user within given
date period?"

I've created a special JSON text-based file format that has these traits:
 - Snappy compressed, saved in AWS S3
 - Partitioned by date. ie. 2017-01-01.sz, 2017-01-02.sz, ...
 - Sorted by a primary key (log_type) and a secondary key (user_id), Snappy
block compressed by 5MB blocks
 - Blocks are indexed with primary/secondary key in file 2017-01-01.json
 - Efficient block based random access on primary key (log_type) and
secondary key (user_id) using the index

I've created a Spark SQL DataFrame relation that can query this file
format.  Since the schema of each log type is fairly consistent, I've
reused the `InferSchema.inferSchema` method and `JacksonParser`in the Spark
SQL code to support structured querying.  I've also implemented filter
push-down to optimize the file access.

It is very fast when querying for a single user or querying for a single
log type with a sampling ratio of 1 to 1 compared to parquet file
format.  (We do use parquet for some log types when we need batch analysis.)

One of the problems we face is that the methods we use above are private
API.  So we are forced to use hacks to use these methods.  (Things like
copying the code or using the org.apache.spark.sql package namespace)

I've been following Spark SQL code since 1.4, and the JSON schema
inferencing code and JacksonParser seem to be relatively stable recently.
Can the core-devs make these APIs public?

We are willing to open source this file format because it is very excellent
for archiving user related logs in S3.  The key dependency of private APIs
in Spark SQL is the main hurdle in making this a reality.

Thank you for reading!


Re: can someone review my PR?

2017-01-18 Thread Sean Owen
It still doesn't pass tests -- I'd usually not look until that point.

On Wed, Jan 18, 2017 at 11:10 AM Steve Loughran 
wrote:

> I've had a PR outstanding on spark/object store integration, works for
> both maven and sbt builds
>
> https://issues.apache.org/jira/browse/SPARK-7481
> https://github.com/apache/spark/pull/12004
>
> Can I get someone to review this as it appears to be being overlooked
> amongst all the PRs
>
> thanks
>
> -steve
>


can someone review my PR?

2017-01-18 Thread Steve Loughran
I've had a PR outstanding on spark/object store integration, works for both 
maven and sbt builds

https://issues.apache.org/jira/browse/SPARK-7481
https://github.com/apache/spark/pull/12004

Can I get someone to review this as it appears to be being overlooked amongst 
all the PRs

thanks

-steve


clientMode in RpcEnv.create in Spark on YARN vs general case (driver vs executors)?

2017-01-18 Thread Jacek Laskowski
Hi,

I'm trying to get the gist of clientMode input parameter for
RpcEnv.create [1]. It is disabled (i.e. false) by default.

I've managed to find out that, in the "general" case, it's enabled for
executors and disabled for the driver.

(it's also used for Spark Standalone's master and workers but it's
infra and I'm not interested in exploring it currently).

I've however noticed that in YARN the clientMode parameter means
something more, i.e. whether the Spark application runs in client or
cluster deploy mode.

In YARN my understanding of the parameter is that clientMode is
enabled explicitly when Spark on YARN's ApplicationMaster creates the
`sparkYarnAM` RPC Environment [2] (when executed for client deploy
mode [3])

This is because in client deploy mode the driver runs on some other
node and the AM acts simply as a proxy to Spark executors that run in
their own YARN containers. This is (also?) because in client deploy
mode in Spark on YARN we have separate JVM processes for the driver,
the AM and Spark executors. The distinction is

Is the last two paragraphs correct?

I'd appreciate if you could fix and fill out the gaps where necessary.
Thanks a lot to make it so much easier for me and...participants of my
Spark workshops ;-)

[1] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala#L42

[2] 
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L434

[3] 
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L254

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: 答复: Limit Query Performance Suggestion

2017-01-18 Thread Liang-Chi Hsieh

Hi zhenhua,

Thanks for the idea.

Actually, I think we can completely avoid shuffling the data in a limit
operation, no matter LocalLimit or GlobalLimit.



wangzhenhua (G) wrote
> How about this:
> 1. we can make LocalLimit shuffle to mutiple partitions, i.e. create a new
> partitioner to uniformly dispatch the data
> 
> class LimitUniformPartitioner(partitions: Int) extends Partitioner {
> 
>   def numPartitions: Int = partitions
>   
>   var num = 0
> 
>   def getPartition(key: Any): Int = {
> num = num + 1
> num % partitions
>   }
> 
>   override def equals(other: Any): Boolean = other match {
> case h: HashPartitioner =>
>   h.numPartitions == numPartitions
> case _ =>
>   false
>   }
> 
>   override def hashCode: Int = numPartitions
> }
> 
> 2. then in GlobalLimit, we only take the first
> limit_number/num_of_shufflepartitions elements in each partition.
> 
> One issue left is how to decide shuffle partition number. 
> We can have a config of the maximum number of elements for each
> GlobalLimit task to process,
> then do a factorization to get a number most close to that config.
> E.g. the config is 2000:
> if limit=1,  1 = 2000 * 5, we shuffle to 5 partitions
> if limit=,  =  * 9, we shuffle to 9 partitions
> if limit is a prime number, we just fall back to single partition
> 
> best regards,
> -zhenhua
> 
> 
> -邮件原件-
> 发件人: Liang-Chi Hsieh [mailto:

> viirya@

> ] 
> 发送时间: 2017年1月18日 15:48
> 收件人: 

> dev@.apache

> 主题: Re: Limit Query Performance Suggestion
> 
> 
> Hi Sujith,
> 
> I saw your updated post. Seems it makes sense to me now.
> 
> If you use a very big limit number, the shuffling before `GlobalLimit`
> would be a bottleneck for performance, of course, even it can eventually
> shuffle enough data to the single partition.
> 
> Unlike `CollectLimit`, actually I think there is no reason `GlobalLimit`
> must shuffle all limited data from all partitions to one single machine
> with respect to query execution. In other words, I think we can avoid
> shuffling data in `GlobalLimit`.
> 
> I have an idea to improve this and may update here later if I can make it
> work.
> 
> 
> sujith71955 wrote
>> Dear Liang,
>> 
>> Thanks for your valuable feedback.
>> 
>> There was a mistake in the previous post  i corrected it, as you 
>> mentioned the  `GlobalLimit` we will only take the required number of 
>> rows from the input iterator which really pulls data from local blocks 
>> and remote blocks.
>> but if the limit value is very high >= 1000,  and when there will 
>> be a shuffle exchange happens  between `GlobalLimit` and `LocalLimit` 
>> to retrieve data from all partitions to one partition, since the limit 
>> value is very large the performance bottleneck still exists.
>>  
>> soon in next  post i will publish a test report with sample data and 
>> also figuring out a solution for this problem.
>> 
>> Please let me know for any clarifications or suggestions regarding 
>> this issue.
>> 
>> Regards,
>> Sujith
> 
> 
> 
> 
> 
> -
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Limit-Query-Performance-Suggestion-tp20570p20652.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> 
> -
> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

> 
> 
> -
> To unsubscribe e-mail: 

> dev-unsubscribe@.apache





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Limit-Query-Performance-Suggestion-tp20570p20657.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



答复: Limit Query Performance Suggestion

2017-01-18 Thread wangzhenhua (G)
How about this:
1. we can make LocalLimit shuffle to mutiple partitions, i.e. create a new 
partitioner to uniformly dispatch the data

class LimitUniformPartitioner(partitions: Int) extends Partitioner {

  def numPartitions: Int = partitions
  
  var num = 0

  def getPartition(key: Any): Int = {
num = num + 1
num % partitions
  }

  override def equals(other: Any): Boolean = other match {
case h: HashPartitioner =>
  h.numPartitions == numPartitions
case _ =>
  false
  }

  override def hashCode: Int = numPartitions
}

2. then in GlobalLimit, we only take the first 
limit_number/num_of_shufflepartitions elements in each partition.

One issue left is how to decide shuffle partition number. 
We can have a config of the maximum number of elements for each GlobalLimit 
task to process,
then do a factorization to get a number most close to that config.
E.g. the config is 2000:
if limit=1,  1 = 2000 * 5, we shuffle to 5 partitions
if limit=,  =  * 9, we shuffle to 9 partitions
if limit is a prime number, we just fall back to single partition

best regards,
-zhenhua


-邮件原件-
发件人: Liang-Chi Hsieh [mailto:vii...@gmail.com] 
发送时间: 2017年1月18日 15:48
收件人: dev@spark.apache.org
主题: Re: Limit Query Performance Suggestion


Hi Sujith,

I saw your updated post. Seems it makes sense to me now.

If you use a very big limit number, the shuffling before `GlobalLimit` would be 
a bottleneck for performance, of course, even it can eventually shuffle enough 
data to the single partition.

Unlike `CollectLimit`, actually I think there is no reason `GlobalLimit` must 
shuffle all limited data from all partitions to one single machine with respect 
to query execution. In other words, I think we can avoid shuffling data in 
`GlobalLimit`.

I have an idea to improve this and may update here later if I can make it work.


sujith71955 wrote
> Dear Liang,
> 
> Thanks for your valuable feedback.
> 
> There was a mistake in the previous post  i corrected it, as you 
> mentioned the  `GlobalLimit` we will only take the required number of 
> rows from the input iterator which really pulls data from local blocks 
> and remote blocks.
> but if the limit value is very high >= 1000,  and when there will 
> be a shuffle exchange happens  between `GlobalLimit` and `LocalLimit` 
> to retrieve data from all partitions to one partition, since the limit 
> value is very large the performance bottleneck still exists.
>  
> soon in next  post i will publish a test report with sample data and 
> also figuring out a solution for this problem.
> 
> Please let me know for any clarifications or suggestions regarding 
> this issue.
> 
> Regards,
> Sujith





-
Liang-Chi Hsieh | @viirya
Spark Technology Center
http://www.spark.tc/
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Limit-Query-Performance-Suggestion-tp20570p20652.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

2017-01-18 Thread Maciej Szymkiewicz
Hi,

Can I ask for some clarifications regarding intended behavior of window
/ TimeWindow?

PySpark documentation states that "Windows in the order of months are
not supported". This is further confirmed by the checks in
TimeWindow.getIntervalInMicroseconds (https://git.io/vMP5l).

Surprisingly enough we can pass interval much larger than a month by
expressing interval in days or another unit of a higher precision. So
this fails:

Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month"))

while following is accepted:

Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days"))

with results which look sensible at first glance.

Is it a matter of a faulty validation logic (months will be assigned
only if there is a match against years or months https://git.io/vMPdi)
or expected behavior and I simply misunderstood the intentions?

-- 
Best,
Maciej



Re: RpcEnv(Factory) is no longer pluggable? spark.rpc is gone, isn't it?

2017-01-18 Thread Jacek Laskowski
On Wed, Jan 18, 2017 at 8:57 AM, Jacek Laskowski  wrote:

> p.s. How to know when the deprecation was introduced? The last change
> is for executor blacklisting so git blame does not show what I want :(
> Any ideas?

Figured that out myself!

$ git log --topo-order --graph -u -L
641,641:core/src/main/scala/org/apache/spark/SparkConf.scala

which gave me 
https://github.com/apache/spark/commit/4f5a24d7e73104771f233af041eeba4f41675974.

If you knew it, you're the git pro!

Jacek

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



RpcEnv(Factory) is no longer pluggable? spark.rpc is gone, isn't it?

2017-01-18 Thread Jacek Laskowski
Hi,

Given [1]:

> DeprecatedConfig("spark.rpc", "2.0", "Not used any more.")

I believe the comment in [2]:

> A RpcEnv implementation must have a [[RpcEnvFactory]] implementation with an 
> empty constructor so that it can be created via Reflection.

Correct? Deserves a pull request to get rid of the seemingly incorrect scaladoc?

p.s. How to know when the deprecation was introduced? The last change
is for executor blacklisting so git blame does not show what I want :(
Any ideas?

[1] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkConf.scala#L641
[2] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala#L32

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org