from:"Yin Huai"

Re: [vote] Apache Spark 3.0 RC3

2020-06-07 Thread Yin Huai

Hello everyone,

I am wondering if it makes more sense to not count Saturday and Sunday. I
doubt that any serious testing work was done during this past weekend. Can
we only count business days in the voting process?

Thanks,

Yin

On Sun, Jun 7, 2020 at 3:24 PM Denny Lee  wrote:

> +1 (non-binding)
>
> On Sun, Jun 7, 2020 at 3:21 PM Jungtaek Lim 
> wrote:
>
>> I'm seeing the effort of including the correctness issue SPARK-28067 [1]
>> to 3.0.0 via SPARK-31894 [2]. That doesn't seem to be a regression so
>> technically doesn't block the release, so while it'd be good to weigh its
>> worth (it requires some SS users to discard the state so might bring less
>> frightened requiring it in major version upgrade), it looks to be optional
>> to include SPARK-28067 to 3.0.0.
>>
>> Besides, I see all blockers look to be resolved, thanks all for the
>> amazing efforts!
>>
>> +1 (non-binding) if the decision of SPARK-28067 is "later".
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-28067
>> 2. https://issues.apache.org/jira/browse/SPARK-31894
>>
>> On Mon, Jun 8, 2020 at 5:23 AM Matei Zaharia 
>> wrote:
>>
>>> +1
>>>
>>> Matei
>>>
>>> On Jun 7, 2020, at 6:53 AM, Maxim Gekk 
>>> wrote:
>>>
>>> +1 (non-binding)
>>>
>>> On Sun, Jun 7, 2020 at 2:34 PM Takeshi Yamamuro 
>>> wrote:
>>>
 +1 (non-binding)

 I don't see any ongoing PR to fix critical bugs in my area.
 Bests,
 Takeshi

 On Sun, Jun 7, 2020 at 7:24 PM Mridul Muralidharan 
 wrote:

> +1
>
> Regards,
> Mridul
>
> On Sat, Jun 6, 2020 at 1:20 PM Reynold Xin 
> wrote:
>
>> Apologies for the mistake. The vote is open till 11:59pm Pacific time
>> on Mon June 9th.
>>
>> On Sat, Jun 6, 2020 at 1:08 PM Reynold Xin 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 3.0.0.
>>>
>>> The vote is open until [DUE DAY] and passes if a majority +1 PMC
>>> votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.0.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.0.0-rc3 (commit
>>> 3fdfce3120f307147244e5eaf46d61419a723d50):
>>> https://github.com/apache/spark/tree/v3.0.0-rc3
>>>
>>> The release files, including signatures, digests, etc. can be found
>>> at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>>
>>> https://repository.apache.org/content/repositories/orgapachespark-1350/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-docs/
>>>
>>> The list of bug fixes going into 3.0.0 can be found at the following
>>> URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12339177
>>>
>>> This release is using the release script of the tag v3.0.0-rc3.
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate,
>>> then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the
>>> Java/Scala
>>> you can add the staging repository to your projects resolvers and
>>> test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.0.0?
>>> ===
>>>
>>> The current list of open tickets targeted at 3.0.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for
>>> "Target Version/s" = 3.0.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not

Re: moving the spark jenkins job builder repo from dbricks --> spark

2018-10-17 Thread Yin Huai

Shane, Thank you for initiating this work! Can we do an audit of jenkins
users and trim down the list?

Also, for packaging jobs, those branch snapshot jobs are active (for
example,
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
for publishing snapshot builds from master branch). They still need
credentials. After we remove the encrypted credential file, are we planning
to use jenkins as the single place to manage those credentials and we just
refer to them in jenkins job config?

On Wed, Oct 10, 2018 at 12:06 PM shane knapp  wrote:

> Not sure if that's what you meant; but it should be ok for the jenkins
>> servers to manually sync with master after you (or someone else) have
>> verified the changes. That should prevent inadvertent breakages since
>> I don't expect it to be easy to test those scripts without access to
>> some test jenkins server.
>>
>> JJB has some built-in lint and testing, so that'll be the first step in
> verifying the build configs.
>
> i still have a dream where i have a fully functioning jenkins staging
> deployment...  one day i will make that happen.  :)
>
> shane
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

Re: python tests related to pandas are skipped in jenkins

2018-01-31 Thread Yin Huai

I created https://issues.apache.org/jira/browse/SPARK-23292 for this issue.

On Wed, Jan 31, 2018 at 8:17 PM, Yin Huai <yh...@databricks.com> wrote:

> btw, seems we also have the same skipping logic for pyarrow. But, I have
> not looked into if tests related to pyarrow are skipped or not.
>
> On Wed, Jan 31, 2018 at 8:15 PM, Yin Huai <yh...@databricks.com> wrote:
>
>> Hello,
>>
>> I was running python tests and found that pyspark.sql.tests.Groupby
>> AggPandasUDFTests.test_unsupported_types
>> <https://github.com/apache/spark/blob/52e00f70663a87b5837235bdf72a3e6f84e11411/python/pyspark/sql/tests.py#L4528-L4548>
>>  does
>> not run with Python 2 because the test uses "assertRaisesRegex" (supported
>> by Python 3) instead of "assertRaisesRegexp" (supported by Python 2).
>> However, spark jenkins does not fail because of this issue (see run history
>> at here
>> <https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/>).
>> After looking into this issue, seems test script will skip tests related
>> to pandas if pandas is not installed
>> <https://github.com/apache/spark/blob/2ac895be909de7e58e1051dc2a1bba98a25bf4be/python/pyspark/sql/tests.py#L51-L63>,
>> which means that jenkins does not have pandas installed.
>>
>> @Shane, can you help us check if jenkins workers have pandas installed?
>>
>> Thanks,
>>
>> Yin
>>
>
>

Re: [VOTE] Spark 2.3.0 (RC2)

2018-01-31 Thread Yin Huai

seems we are not running tests related to pandas in pyspark tests (see my
email "python tests related to pandas are skipped in jenkins"). I think we
should fix this test issue and make sure all tests are good before cutting
RC3.

On Wed, Jan 31, 2018 at 10:12 AM, Sameer Agarwal 
wrote:

> Just a quick status update on RC3 -- SPARK-23274
>  was resolved
> yesterday and tests have been quite healthy throughout this week and the
> last. I'll cut the new RC as soon as the remaining blocker (SPARK-23202
> ) is resolved.
>
>
> On 30 January 2018 at 10:12, Andrew Ash  wrote:
>
>> I'd like to nominate SPARK-23274
>>  as a potential
>> blocker for the 2.3.0 release as well, due to being a regression from
>> 2.2.0.  The ticket has a simple repro included, showing a query that works
>> in prior releases but now fails with an exception in the catalyst optimizer.
>>
>> On Fri, Jan 26, 2018 at 10:41 AM, Sameer Agarwal 
>> wrote:
>>
>>> This vote has failed due to a number of aforementioned blockers. I'll
>>> follow up with RC3 as soon as the 2 remaining (non-QA) blockers are
>>> resolved: https://s.apache.org/oXKi
>>>
>>>
>>> On 25 January 2018 at 12:59, Sameer Agarwal 
>>> wrote:
>>>

 Most tests pass on RC2, except I'm still seeing the timeout caused by
> https://issues.apache.org/jira/browse/SPARK-23055 ; the tests never
> finish. I followed the thread a bit further and wasn't clear whether it 
> was
> subsequently re-fixed for 2.3.0 or not. It says it's resolved along with
> https://issues.apache.org/jira/browse/SPARK-22908 for 2.3.0 though I
> am still seeing these tests fail or hang:
>
> - subscribing topic by name from earliest offsets (failOnDataLoss:
> false)
> - subscribing topic by name from earliest offsets (failOnDataLoss:
> true)
>

 Sean, while some of these tests were timing out on RC1, we're not aware
 of any known issues in RC2. Both maven (https://amplab.cs.berkeley.ed
 u/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-bra
 nch-2.3-test-maven-hadoop-2.6/146/testReport/org.apache.spar
 k.sql.kafka010/history/) and sbt (https://amplab.cs.berkeley.ed
 u/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-bra
 nch-2.3-test-sbt-hadoop-2.6/123/testReport/org.apache.spark.
 sql.kafka010/history/) historical builds on jenkins
 for org.apache.spark.sql.kafka010 look fairly healthy. If you're still
 seeing timeouts in RC2, can you create a JIRA with any applicable build/env
 info?



> On Tue, Jan 23, 2018 at 9:01 AM Sean Owen  wrote:
>
>> I'm not seeing that same problem on OS X and /usr/bin/tar. I tried
>> unpacking it with 'xvzf' and also unzipping it first, and it untarred
>> without warnings in either case.
>>
>> I am encountering errors while running the tests, different ones each
>> time, so am still figuring out whether there is a real problem or just
>> flaky tests.
>>
>> These issues look like blockers, as they are inherently to be
>> completed before the 2.3 release. They are mostly not done. I suppose I'd
>> -1 on behalf of those who say this needs to be done first, though, we can
>> keep testing.
>>
>> SPARK-23105 Spark MLlib, GraphX 2.3 QA umbrella
>> SPARK-23114 Spark R 2.3 QA umbrella
>>
>> Here are the remaining items targeted for 2.3:
>>
>> SPARK-15689 Data source API v2
>> SPARK-20928 SPIP: Continuous Processing Mode for Structured Streaming
>> SPARK-21646 Add new type coercion rules to compatible with Hive
>> SPARK-22386 Data Source V2 improvements
>> SPARK-22731 Add a test for ROWID type to OracleIntegrationSuite
>> SPARK-22735 Add VectorSizeHint to ML features documentation
>> SPARK-22739 Additional Expression Support for Objects
>> SPARK-22809 pyspark is sensitive to imports with dots
>> SPARK-22820 Spark 2.3 SQL API audit
>>
>>
>> On Mon, Jan 22, 2018 at 7:09 PM Marcelo Vanzin 
>> wrote:
>>
>>> +0
>>>
>>> Signatures check out. Code compiles, although I see the errors in [1]
>>> when untarring the source archive; perhaps we should add "use GNU
>>> tar"
>>> to the RM checklist?
>>>
>>> Also ran our internal tests and they seem happy.
>>>
>>> My concern is the list of open bugs targeted at 2.3.0 (ignoring the
>>> documentation ones). It is not long, but it seems some of those need
>>> to be looked at. It would be nice for the committers who are involved
>>> in those bugs to take a look.
>>>
>>> [1] https://superuser.com/questions/318809/linux-os-x-tar-incomp
>>>

Re: python tests related to pandas are skipped in jenkins

2018-01-31 Thread Yin Huai

btw, seems we also have the same skipping logic for pyarrow. But, I have
not looked into if tests related to pyarrow are skipped or not.

On Wed, Jan 31, 2018 at 8:15 PM, Yin Huai <yh...@databricks.com> wrote:

> Hello,
>
> I was running python tests and found that pyspark.sql.tests.
> GroupbyAggPandasUDFTests.test_unsupported_types
> <https://github.com/apache/spark/blob/52e00f70663a87b5837235bdf72a3e6f84e11411/python/pyspark/sql/tests.py#L4528-L4548>
>  does
> not run with Python 2 because the test uses "assertRaisesRegex" (supported
> by Python 3) instead of "assertRaisesRegexp" (supported by Python 2).
> However, spark jenkins does not fail because of this issue (see run history
> at here
> <https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/>).
> After looking into this issue, seems test script will skip tests related
> to pandas if pandas is not installed
> <https://github.com/apache/spark/blob/2ac895be909de7e58e1051dc2a1bba98a25bf4be/python/pyspark/sql/tests.py#L51-L63>,
> which means that jenkins does not have pandas installed.
>
> @Shane, can you help us check if jenkins workers have pandas installed?
>
> Thanks,
>
> Yin
>

python tests related to pandas are skipped in jenkins

2018-01-31 Thread Yin Huai

Hello,

I was running python tests and found that
pyspark.sql.tests.GroupbyAggPandasUDFTests.test_unsupported_types

does
not run with Python 2 because the test uses "assertRaisesRegex" (supported
by Python 3) instead of "assertRaisesRegexp" (supported by Python 2).
However, spark jenkins does not fail because of this issue (see run history
at here
).
After looking into this issue, seems test script will skip tests related to
pandas if pandas is not installed
,
which means that jenkins does not have pandas installed.

@Shane, can you help us check if jenkins workers have pandas installed?

Thanks,

Yin

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-11 Thread Yin Huai

+1

On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal 
wrote:

> +1 (non-binding)
>
> On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler  wrote:
>
>> +1 (non-binding) for the goals and non-goals of this SPIP.  I think it's
>> fine to work out the minor details of the API during review.
>>
>> Bryan
>>
>> On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN 
>> wrote:
>>
>>> Hi all,
>>>
>>> Thank you for voting and suggestions.
>>>
>>> As Wenchen mentioned and also we're discussing at JIRA, we need to
>>> discuss the size hint for the 0-parameter UDF.
>>> But I believe we got a consensus about the basic APIs except for the
>>> size hint, I'd like to submit a pr based on the current proposal and
>>> continue discussing in its review.
>>>
>>> https://github.com/apache/spark/pull/19147
>>>
>>> I'd keep this vote open to wait for more opinions.
>>>
>>> Thanks.
>>>
>>>
>>> On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan  wrote:
>>>
 +1 on the design and proposed API.

 One detail I'd like to discuss is the 0-parameter UDF, how we can
 specify the size hint. This can be done in the PR review though.

 On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung  wrote:

> +1 on this and like the suggestion of type in string form.
>
> Would it be correct to assume there will be data type check, for
> example the returned pandas data frame column data types match what are
> specified. We have seen quite a bit of issues/confusions with that in R.
>
> Would it make sense to have a more generic decorator name so that it
> could also be useable for other efficient vectorized format in the future?
> Or do we anticipate the decorator to be format specific and will have more
> in the future?
>
> --
> *From:* Reynold Xin 
> *Sent:* Friday, September 1, 2017 5:16:11 AM
> *To:* Takuya UESHIN
> *Cc:* spark-dev
> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>
> Ok, thanks.
>
> +1 on the SPIP for scope etc
>
>
> On API details (will deal with in code reviews as well but leaving a
> note here in case I forget)
>
> 1. I would suggest having the API also accept data type specification
> in string form. It is usually simpler to say "long" then "LongType()".
>
> 2. Think about what error message to show when the rows numbers don't
> match at runtime.
>
>
> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN 
> wrote:
>
>> Yes, the aggregation is out of scope for now.
>> I think we should continue discussing the aggregation at JIRA and we
>> will be adding those later separately.
>>
>> Thanks.
>>
>>
>> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin 
>> wrote:
>>
>>> Is the idea aggregate is out of scope for the current effort and we
>>> will be adding those later?
>>>
>>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN 
>>> wrote:
>>>
 Hi all,

 We've been discussing to support vectorized UDFs in Python and we
 almost got a consensus about the APIs, so I'd like to summarize
 and call for a vote.

 Note that this vote should focus on APIs for vectorized UDFs, not
 APIs for vectorized UDAFs or Window operations.

 https://issues.apache.org/jira/browse/SPARK-21190


 *Proposed API*

 We introduce a @pandas_udf decorator (or annotation) to define
 vectorized UDFs which takes one or more pandas.Series or one
 integer value meaning the length of the input value for 0-parameter 
 UDFs.
 The return value should be pandas.Series of the specified type and
 the length of the returned value should be the same as input value.

 We can define vectorized UDFs as:

   @pandas_udf(DoubleType())
   def plus(v1, v2):
   return v1 + v2

 or we can define as:

   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())

 We can use it similar to row-by-row UDFs:

   df.withColumn('sum', plus(df.v1, df.v2))

 As for 0-parameter UDFs, we can define and use as:

   @pandas_udf(LongType())
   def f0(size):
   return pd.Series(1).repeat(size)

   df.select(f0())



 The vote will be up for the next 72 hours. Please reply with your
 vote:

 +1: Yeah, let's go forward and implement the SPIP.
 +0: Don't really care.
 -1: I don't think this is a good idea because of the following 
 technical

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-06 Thread Yin Huai

+1

On Thu, Jul 6, 2017 at 8:40 PM, Hyukjin Kwon  wrote:

> +1
>
> 2017-07-07 6:41 GMT+09:00 Reynold Xin :
>
>> +1
>>
>>
>> On Fri, Jun 30, 2017 at 6:44 PM, Michael Armbrust > > wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.2.0. The vote is open until Friday, July 7th, 2017 at 18:00
>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v2.2.0-rc6
>>>  (a2c7b2133cfee7f
>>> a9abfaa2bfbfb637155466783)
>>>
>>> List of JIRA tickets resolved can be found with this filter
>>> 
>>> .
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1245/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.1.
>>>
>>
>>
>

Re: [ANNOUNCE] Announcing Apache Spark 2.1.0

2016-12-29 Thread Yin Huai

Hello Jacek,

Actually, Reynold is still the release manager and I am just sending this
message for him :) Sorry. I should have made it clear in my original email.

Thanks,

Yin

On Thu, Dec 29, 2016 at 10:58 AM, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi Yan,
>
> I've been surprised the first time when I noticed rxin stepped back and a
> new release manager stepped in. Congrats on your first ANNOUNCE!
>
> I can only expect even more great stuff coming in to Spark from the dev
> team after Reynold spared some time 
>
> Can't wait to read the changes...
>
> Jacek
>
> On 29 Dec 2016 5:03 p.m., "Yin Huai" <yh...@databricks.com> wrote:
>
>> Hi all,
>>
>> Apache Spark 2.1.0 is the second release of Spark 2.x line. This release
>> makes significant strides in the production readiness of Structured
>> Streaming, with added support for event time watermarks
>> <https://spark.apache.org/docs/2.1.0/structured-streaming-programming-guide.html#handling-late-data-and-watermarking>
>> and Kafka 0.10 support
>> <https://spark.apache.org/docs/2.1.0/structured-streaming-kafka-integration.html>.
>> In addition, this release focuses more on usability, stability, and polish,
>> resolving over 1200 tickets.
>>
>> We'd like to thank our contributors and users for their contributions and
>> early feedback to this release. This release would not have been possible
>> without you.
>>
>> To download Spark 2.1.0, head over to the download page:
>> http://spark.apache.org/downloads.html
>>
>> To view the release notes: https://spark.apache.or
>> g/releases/spark-release-2-1-0.html
>>
>> (note: If you see any issues with the release notes, webpage or published
>> artifacts, please contact me directly off-list)
>>
>>
>>

[ANNOUNCE] Announcing Apache Spark 2.1.0

2016-12-29 Thread Yin Huai

Hi all,

Apache Spark 2.1.0 is the second release of Spark 2.x line. This release
makes significant strides in the production readiness of Structured
Streaming, with added support for event time watermarks

and Kafka 0.10 support
.
In addition, this release focuses more on usability, stability, and polish,
resolving over 1200 tickets.

We'd like to thank our contributors and users for their contributions and
early feedback to this release. This release would not have been possible
without you.

To download Spark 2.1.0, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-2-1-0.html

(note: If you see any issues with the release notes, webpage or published
artifacts, please contact me directly off-list)

Re: [VOTE] Apache Spark 2.1.0 (RC2)

2016-12-12 Thread Yin Huai

-1

I hit https://issues.apache.org/jira/browse/SPARK-18816, which prevents
executor page from showing the log links if an application does not have
executors initially.

On Mon, Dec 12, 2016 at 3:02 PM, Marcelo Vanzin  wrote:

> Actually this is not a simple pom change. The code in
> UDFRegistration.scala calls this method:
>
>   if (returnType == null) {
>returnType =
> JavaTypeInference.inferDataType(TypeToken.of(udfReturnType))._1
>  }
>
> Because we shade guava, it's generally not very safe to call methods
> in different modules that expose shaded APIs. Can this code be
> modified to call the variant that just takes a java.lang.Class instead
> of Guava's TypeToken? It seems like that would work, since that method
> basically just wraps the argument with "TypeToken.of".
>
>
>
> On Mon, Dec 12, 2016 at 2:03 PM, Marcelo Vanzin 
> wrote:
> > I'm running into this when building / testing on 1.7 (haven't tried 1.8):
> >
> > udf3Test(test.org.apache.spark.sql.JavaUDFSuite)  Time elapsed: 0.079
> > sec  <<< ERROR!
> > java.lang.NoSuchMethodError:
> > org.apache.spark.sql.catalyst.JavaTypeInference$.
> inferDataType(Lcom/google/common/reflect/TypeToken;)Lsc
> > ala/Tuple2;
> >at test.org.apache.spark.sql.JavaUDFSuite.udf3Test(
> JavaUDFSuite.java:107)
> >
> >
> > Results :
> >
> > Tests in error:
> >  JavaUDFSuite.udf3Test:107 » NoSuchMethod
> > org.apache.spark.sql.catalyst.JavaTyp...
> >
> >
> > Given the error I'm mostly sure it's something easily fixable by
> > adding Guava explicitly in the pom, so probably shouldn't block
> > anything.
> >
> >
> > On Thu, Dec 8, 2016 at 12:39 AM, Reynold Xin 
> wrote:
> >> Please vote on releasing the following candidate as Apache Spark version
> >> 2.1.0. The vote is open until Sun, December 11, 2016 at 1:00 PT and
> passes
> >> if a majority of at least 3 +1 PMC votes are cast.
> >>
> >> [ ] +1 Release this package as Apache Spark 2.1.0
> >> [ ] -1 Do not release this package because ...
> >>
> >>
> >> To learn more about Apache Spark, please see http://spark.apache.org/
> >>
> >> The tag to be voted on is v2.1.0-rc2
> >> (080717497365b83bc202ab16812ced93eb1ea7bd)
> >>
> >> List of JIRA tickets resolved are:
> >> https://issues.apache.org/jira/issues/?jql=project%20%
> 3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-bin/
> >>
> >> Release artifacts are signed with the following key:
> >> https://people.apache.org/keys/committer/pwendell.asc
> >>
> >> The staging repository for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1217
> >>
> >> The documentation corresponding to this release can be found at:
> >> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/
> >>
> >>
> >> (Note that the docs and staging repo are still being uploaded and will
> be
> >> available soon)
> >>
> >>
> >> ===
> >> How can I help test this release?
> >> ===
> >> If you are a Spark user, you can help us test this release by taking an
> >> existing Spark workload and running on this release candidate, then
> >> reporting any regressions.
> >>
> >> ===
> >> What should happen to JIRA tickets still targeting 2.1.0?
> >> ===
> >> Committers should look at those and triage. Extremely important bug
> fixes,
> >> documentation, and API tweaks that impact compatibility should be
> worked on
> >> immediately. Everything else please retarget to 2.1.1 or 2.2.0.
> >
> >
> >
> > --
> > Marcelo
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-09 Thread Yin Huai

+1

On Wed, Nov 9, 2016 at 1:14 PM, Yin Huai <yh...@databricks.com> wrote:

> +!
>
> On Wed, Nov 9, 2016 at 1:02 PM, Denny Lee <denny.g@gmail.com> wrote:
>
>> +1 (non binding)
>>
>>
>>
>> On Tue, Nov 8, 2016 at 10:14 PM vaquar khan <vaquar.k...@gmail.com>
>> wrote:
>>
>>> *+1 (non binding)*
>>>
>>> On Tue, Nov 8, 2016 at 10:21 PM, Weiqing Yang <yangweiqing...@gmail.com>
>>> wrote:
>>>
>>>  +1 (non binding)
>>>
>>>
>>> Environment: CentOS Linux release 7.0.1406 (Core) / openjdk version
>>> "1.8.0_111"
>>>
>>>
>>>
>>> ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
>>> -Dpyspark -Dsparkr -DskipTests clean package
>>>
>>> ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
>>> -Dpyspark -Dsparkr test
>>>
>>>
>>>
>>> On Tue, Nov 8, 2016 at 7:38 PM, Liwei Lin <lwl...@gmail.com> wrote:
>>>
>>> +1 (non-binding)
>>>
>>> Cheers,
>>> Liwei
>>>
>>> On Tue, Nov 8, 2016 at 9:50 PM, Ricardo Almeida <
>>> ricardo.alme...@actnowib.com> wrote:
>>>
>>> +1 (non-binding)
>>>
>>> over Ubuntu 16.10, Java 8 (OpenJDK 1.8.0_111) built with Hadoop 2.7.3,
>>> YARN, Hive
>>>
>>>
>>> On 8 November 2016 at 12:38, Herman van Hövell tot Westerflier <
>>> hvanhov...@databricks.com> wrote:
>>>
>>> +1
>>>
>>> On Tue, Nov 8, 2016 at 7:09 AM, Reynold Xin <r...@databricks.com> wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
>>> a majority of at least 3+1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.0.2
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> The tag to be voted on is v2.0.2-rc3 (584354eaac02531c9584188b14336
>>> 7ba694b0c34)
>>>
>>> This release candidate resolves 84 issues:
>>> https://s.apache.org/spark-2.0.2-jira
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1214/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
>>>
>>>
>>> Q: How can I help test this release?
>>> A: If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions from 2.0.1.
>>>
>>> Q: What justifies a -1 vote for this release?
>>> A: This is a maintenance release in the 2.0.x series. Bugs already
>>> present in 2.0.1, missing features, or bugs related to new features will
>>> not necessarily block this release.
>>>
>>> Q: What fix version should I use for patches merging into branch-2.0
>>> from now on?
>>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
>>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Vaquar Khan
>>> +1 -224-436-0783 <(224)%20436-0783>
>>>
>>> IT Architect / Lead Consultant
>>> Greater Chicago
>>>
>>
>

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-09 Thread Yin Huai

+!

On Wed, Nov 9, 2016 at 1:02 PM, Denny Lee  wrote:

> +1 (non binding)
>
>
>
> On Tue, Nov 8, 2016 at 10:14 PM vaquar khan  wrote:
>
>> *+1 (non binding)*
>>
>> On Tue, Nov 8, 2016 at 10:21 PM, Weiqing Yang 
>> wrote:
>>
>>  +1 (non binding)
>>
>>
>> Environment: CentOS Linux release 7.0.1406 (Core) / openjdk version
>> "1.8.0_111"
>>
>>
>>
>> ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
>> -Dpyspark -Dsparkr -DskipTests clean package
>>
>> ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
>> -Dpyspark -Dsparkr test
>>
>>
>>
>> On Tue, Nov 8, 2016 at 7:38 PM, Liwei Lin  wrote:
>>
>> +1 (non-binding)
>>
>> Cheers,
>> Liwei
>>
>> On Tue, Nov 8, 2016 at 9:50 PM, Ricardo Almeida <
>> ricardo.alme...@actnowib.com> wrote:
>>
>> +1 (non-binding)
>>
>> over Ubuntu 16.10, Java 8 (OpenJDK 1.8.0_111) built with Hadoop 2.7.3,
>> YARN, Hive
>>
>>
>> On 8 November 2016 at 12:38, Herman van Hövell tot Westerflier <
>> hvanhov...@databricks.com> wrote:
>>
>> +1
>>
>> On Tue, Nov 8, 2016 at 7:09 AM, Reynold Xin  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
>> a majority of at least 3+1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.0.2
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v2.0.2-rc3 (584354eaac02531c9584188b143367
>> ba694b0c34)
>>
>> This release candidate resolves 84 issues: https://s.apache.org/spark-2.
>> 0.2-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1214/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
>>
>>
>> Q: How can I help test this release?
>> A: If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions from 2.0.1.
>>
>> Q: What justifies a -1 vote for this release?
>> A: This is a maintenance release in the 2.0.x series. Bugs already
>> present in 2.0.1, missing features, or bugs related to new features will
>> not necessarily block this release.
>>
>> Q: What fix version should I use for patches merging into branch-2.0 from
>> now on?
>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Regards,
>> Vaquar Khan
>> +1 -224-436-0783 <(224)%20436-0783>
>>
>> IT Architect / Lead Consultant
>> Greater Chicago
>>
>

Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-04 Thread Yin Huai

+1

On Tue, Nov 1, 2016 at 9:51 PM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.0.2. The vote is open until Fri, Nov 4, 2016 at 22:00 PDT and passes if a
> majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.2-rc2 (a6abe1ee22141931614bf27a4f371c
> 46d8379e33)
>
> This release candidate resolves 84 issues: https://s.apache.org/spark-2.
> 0.2-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1210/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc2-docs/
>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.1.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series. Bugs already present
> in 2.0.1, missing features, or bugs related to new features will not
> necessarily block this release.
>
> Q: What fix version should I use for patches merging into branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> (i.e. RC3) is cut, I will change the fix version of those patches to 2.0.2.
>

Re: [VOTE] Release Apache Spark 1.6.3 (RC2)

2016-11-03 Thread Yin Huai

+1

On Thu, Nov 3, 2016 at 12:57 PM, Herman van Hövell tot Westerflier <
hvanhov...@databricks.com> wrote:

> +1
>
> On Thu, Nov 3, 2016 at 6:58 PM, Michael Armbrust 
> wrote:
>
>> +1
>>
>> On Wed, Nov 2, 2016 at 5:40 PM, Reynold Xin  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.6.3. The vote is open until Sat, Nov 5, 2016 at 18:00 PDT and passes if a
>>> majority of at least 3+1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.6.3
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> The tag to be voted on is v1.6.3-rc2 (1e860747458d74a4ccbd081103a05
>>> 42a2367b14b)
>>>
>>> This release candidate addresses 52 JIRA tickets:
>>> https://s.apache.org/spark-1.6.3-jira
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc2-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1212/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc2-docs/
>>>
>>>
>>> ===
>>> == How can I help test this release?
>>> ===
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions from 1.6.2.
>>>
>>> 
>>> == What justifies a -1 vote for this release?
>>> 
>>> This is a maintenance release in the 1.6.x series.  Bugs already present
>>> in 1.6.2, missing features, or bugs related to new features will not
>>> necessarily block this release.
>>>
>>>
>>
>

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Yin Huai

+1

On Thu, Sep 29, 2016 at 4:07 PM, Luciano Resende 
wrote:

> +1 (non-binding)
>
> On Wed, Sep 28, 2016 at 7:14 PM, Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and passes if a
>> majority of at least 3+1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.0.1
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v2.0.1-rc4 (933d2c1ea4e5f5c4ec8d375b5ccaa
>> 4577ba4be38)
>>
>> This release candidate resolves 301 issues:
>> https://s.apache.org/spark-2.0.1-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc4-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1203/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc4-docs/
>>
>>
>> Q: How can I help test this release?
>> A: If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions from 2.0.0.
>>
>> Q: What justifies a -1 vote for this release?
>> A: This is a maintenance release in the 2.0.x series.  Bugs already
>> present in 2.0.0, missing features, or bugs related to new features will
>> not necessarily block this release.
>>
>> Q: What fix version should I use for patches merging into branch-2.0 from
>> now on?
>> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new RC
>> (i.e. RC5) is cut, I will change the fix version of those patches to 2.0.1.
>>
>>
>>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-25 Thread Yin Huai

+1

On Sun, Sep 25, 2016 at 11:40 AM, Dongjoon Hyun  wrote:

> +1 (non binding)
>
> RC3 is compiled and tested on the following two systems, too. All tests
> passed.
>
> * CentOS 7.2 / Oracle JDK 1.8.0_77 / R 3.3.1
>with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
> -Dsparkr
> * CentOS 7.2 / Open JDK 1.8.0_102
>with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
>
> Cheers,
> Dongjoon
>
>
>
> On Saturday, September 24, 2016, Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.0.1. The vote is open until Tue, Sep 27, 2016 at 15:30 PDT and passes if
>> a majority of at least 3+1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.0.1
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v2.0.1-rc3 (9d28cc10357a8afcfb2fa2e6eecb5
>> c2cc2730d17)
>>
>> This release candidate resolves 290 issues:
>> https://s.apache.org/spark-2.0.1-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1201/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-docs/
>>
>>
>> Q: How can I help test this release?
>> A: If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions from 2.0.0.
>>
>> Q: What justifies a -1 vote for this release?
>> A: This is a maintenance release in the 2.0.x series.  Bugs already
>> present in 2.0.0, missing features, or bugs related to new features will
>> not necessarily block this release.
>>
>> Q: What fix version should I use for patches merging into branch-2.0 from
>> now on?
>> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new RC
>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.1.
>>
>>
>>

Re: [master] ERROR RetryingHMSHandler: AlreadyExistsException(message:Database default already exists)

2016-08-17 Thread Yin Huai

Yea. Please create a jira. Thanks!

On Tue, Aug 16, 2016 at 11:06 PM, Jacek Laskowski <ja...@japila.pl> wrote:

> On Tue, Aug 16, 2016 at 10:51 PM, Yin Huai <yh...@databricks.com> wrote:
>
> > Do you want to try it?
>
> Yes, indeed! I'd be more than happy. Guide me if you don't mind. Thanks.
>
> Should I create a JIRA for this?
>
> Jacek
>

Re: [master] ERROR RetryingHMSHandler: AlreadyExistsException(message:Database default already exists)

2016-08-16 Thread Yin Huai

Hi Jacek,

We will try to create the default database if it does not exist. Hive
actually relies on that AlreadyExistsException to determine if a db already
exists and ignore the error to implement the logic of "CREATE DATABASE IF
NOT EXISTS". So, that message does not mean any bad thing happened. I think
we can avoid of having this error log by changing
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala#L84-L91.
Basically, we will check if the default db already exists and we only call
create database if the default db does not exist. Do you want to try it?

Thanks,

Yin

On Tue, Aug 16, 2016 at 7:33 PM, Jacek Laskowski  wrote:

> Hi,
>
> I'm working with today's build and am facing the issue:
>
> scala> Seq(A(4)).toDS
> 16/08/16 19:26:26 ERROR RetryingHMSHandler:
> AlreadyExistsException(message:Database default already exists)
> at org.apache.hadoop.hive.metastore.HiveMetaStore$
> HMSHandler.create_database(HiveMetaStore.java:891)
> ...
>
> res1: org.apache.spark.sql.Dataset[A] = [id: int]
>
> scala> spark.version
> res2: String = 2.1.0-SNAPSHOT
>
> See the complete stack trace at
> https://gist.github.com/jaceklaskowski/a969fdd5c2c9cdb736bf647b01257a3e.
>
> I'm quite positive that it didn't happen a day or two ago.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-23 Thread Yin Huai

-1 because of https://issues.apache.org/jira/browse/SPARK-16121.

This jira was resolved after 2.0.0-RC1 was cut. Without the fix, Spark
SQL effectively only uses the driver to list files when loading datasets
and the driver-side file listing is very slow for datasets having many
files and partitions. Since this bug causes a serious performance
regression, I am giving -1.

On Thu, Jun 23, 2016 at 1:25 AM, Pete Robbins  wrote:

> I'm also seeing some of these same failures:
>
> - spilling with compression *** FAILED ***
> I have seen this occassionaly
>
> - to UTC timestamp *** FAILED ***
> This was fixed yesterday in branch-2.0 (
> https://issues.apache.org/jira/browse/SPARK-16078)
>
> - offset recovery *** FAILED ***
> Haven't seen this for a while and thought the flaky test was fixed but it
> popped up again in one of our builds.
>
> StateStoreSuite:
> - maintenance *** FAILED ***
> Just seen this has been failing for last 2 days on one build machine
> (linux amd64)
>
>
> On 23 June 2016 at 08:51, Sean Owen  wrote:
>
>> First pass of feedback on the RC: all the sigs, hashes, etc are fine.
>> Licensing is up to date to the best of my knowledge.
>>
>> I'm hitting test failures, some of which may be spurious. Just putting
>> them out there to see if they ring bells. This is Java 8 on Ubuntu 16.
>>
>>
>> - spilling with compression *** FAILED ***
>>   java.lang.Exception: Test failed with compression using codec
>> org.apache.spark.io.SnappyCompressionCodec:
>> assertion failed: expected cogroup to spill, but did not
>>   at scala.Predef$.assert(Predef.scala:170)
>>   at org.apache.spark.TestUtils$.assertSpilled(TestUtils.scala:170)
>>   at org.apache.spark.util.collection.ExternalAppendOnlyMapSuite.org
>> $apache$spark$util$collection$ExternalAppendOnlyMapSuite$$testSimpleSpilling(ExternalAppendOnlyMapSuite.scala:263)
>> ...
>>
>> I feel like I've seen this before, and see some possibly relevant
>> fixes, but they're in 2.0.0 already:
>> https://github.com/apache/spark/pull/10990
>> Is this something where a native library needs to be installed or
>> something?
>>
>>
>> - to UTC timestamp *** FAILED ***
>>   "2016-03-13 [02]:00:00.0" did not equal "2016-03-13 [10]:00:00.0"
>> (DateTimeUtilsSuite.scala:506)
>>
>> I know, we talked about this for the 1.6.2 RC, but I reproduced this
>> locally too. I will investigate, could still be spurious.
>>
>>
>> StateStoreSuite:
>> - maintenance *** FAILED ***
>>   The code passed to eventually never returned normally. Attempted 627
>> times over 10.000180116 seconds. Last failure message:
>> StateStoreSuite.this.fileExists(provider, 1L, false) was true earliest
>> file not deleted. (StateStoreSuite.scala:395)
>>
>> No idea.
>>
>>
>> - offset recovery *** FAILED ***
>>   The code passed to eventually never returned normally. Attempted 197
>> times over 10.040864806 seconds. Last failure message:
>> strings.forall({
>> ((x$1: Any) => DirectKafkaStreamSuite.collectedData.contains(x$1))
>>   }) was false. (DirectKafkaStreamSuite.scala:250)
>>
>> Also something that was possibly fixed already for 2.0.0 and that I
>> just back-ported into 1.6. Could be just a very similar failure.
>>
>> On Wed, Jun 22, 2016 at 2:26 AM, Reynold Xin  wrote:
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 2.0.0. The vote is open until Friday, June 24, 2016 at 19:00 PDT and
>> passes
>> > if a majority of at least 3+1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.0.0
>> > [ ] -1 Do not release this package because ...
>> >
>> >
>> > The tag to be voted on is v2.0.0-rc1
>> > (0c66ca41afade6db73c9aeddd5aed6e5dcea90df).
>> >
>> > This release candidate resolves ~2400 issues:
>> > https://s.apache.org/spark-2.0.0-rc1-jira
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-bin/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1187/
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/
>> >
>> >
>> > ===
>> > == How can I help test this release? ==
>> > ===
>> > If you are a Spark user, you can help us test this release by taking an
>> > existing Spark workload and running on this release candidate, then
>> > reporting any regressions from 1.x.
>> >
>> > 
>> > == What justifies a -1 vote for this release? ==
>> > 
>> > Critical bugs impacting major functionalities.
>> >
>> > Bugs already present in 1.x,

Re: Inconsistent joinWith behavior?

2016-06-20 Thread Yin Huai

Hello Richard,

Looks like the Dataset is Dataset[(Int, Int)]. I guess for the case of
"ds.joinWith(other, expr, Outer).map({ case (t, u) => (Option(t),
Option(u)) })". We are trying to use null to create a "(Int, Int)" and
somehow it ended up with a tuple2 having default values.

Can you create a jira? We will investigate the issue.

Thanks!

Yin

On Mon, Jun 20, 2016 at 8:21 AM, Richard Marscher 
wrote:

> I know recently outer join was changed to preserve actual nulls through
> the join in https://github.com/apache/spark/pull/13425. I am seeing what
> seems like inconsistent behavior though based on how the join is interacted
> with. In one case the default datatype values are still used instead of
> nulls whereas the other case passes the nulls through. I have a small
> databricks notebook showing the case against 2.0 preview:
>
>
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/160347920874755/4268263383756277/673639177603143/latest.html
>
> --
> *Richard Marscher*
> Senior Software Engineer
> Localytics
> Localytics.com  | Our Blog
>  | Twitter  |
> Facebook  | LinkedIn
> 
>

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Yin Huai

+1

On Wed, May 18, 2016 at 10:49 AM, Reynold Xin  wrote:

> Hi Ovidiu-Cristian ,
>
> The best source of truth is change the filter with target version to
> 2.1.0. Not a lot of tickets have been targeted yet, but I'd imagine as we
> get closer to 2.0 release, more will be retargeted at 2.1.0.
>
>
>
> On Wed, May 18, 2016 at 10:43 AM, Ovidiu-Cristian MARCU <
> ovidiu-cristian.ma...@inria.fr> wrote:
>
>> Yes, I can filter..
>> Did that and for example:
>>
>> https://issues.apache.org/jira/browse/SPARK-15370?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20affectedVersion%20%3D%202.0.0
>> 
>>
>> To rephrase: for 2.0 do you have specific issues that are not a priority
>> and will released maybe with 2.1 for example?
>>
>> Keep up the good work!
>>
>> On 18 May 2016, at 18:19, Reynold Xin  wrote:
>>
>> You can find that by changing the filter to target version = 2.0.0.
>> Cheers.
>>
>> On Wed, May 18, 2016 at 9:00 AM, Ovidiu-Cristian MARCU <
>> ovidiu-cristian.ma...@inria.fr> wrote:
>>
>>> +1 Great, I see the list of resolved issues, do you have a list of known
>>> issue you plan to stay with this release?
>>>
>>> with
>>> build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Phive
>>> -Phive-thriftserver -DskipTests clean package
>>>
>>> mvn -version
>>> Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5;
>>> 2015-11-10T17:41:47+01:00)
>>> Maven home: /Users/omarcu/tools/apache-maven-3.3.9
>>> Java version: 1.7.0_80, vendor: Oracle Corporation
>>> Java home:
>>> /Library/Java/JavaVirtualMachines/jdk1.7.0_80.jdk/Contents/Home/jre
>>> Default locale: en_US, platform encoding: UTF-8
>>> OS name: "mac os x", version: "10.11.5", arch: "x86_64", family: “mac"
>>>
>>> [INFO] Reactor Summary:
>>> [INFO]
>>> [INFO] Spark Project Parent POM ... SUCCESS [
>>> 2.635 s]
>>> [INFO] Spark Project Tags . SUCCESS [
>>> 1.896 s]
>>> [INFO] Spark Project Sketch ... SUCCESS [
>>> 2.560 s]
>>> [INFO] Spark Project Networking ... SUCCESS [
>>> 6.533 s]
>>> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>>> 4.176 s]
>>> [INFO] Spark Project Unsafe ... SUCCESS [
>>> 4.809 s]
>>> [INFO] Spark Project Launcher . SUCCESS [
>>> 6.242 s]
>>> [INFO] Spark Project Core . SUCCESS
>>> [01:20 min]
>>> [INFO] Spark Project GraphX ... SUCCESS [
>>> 9.148 s]
>>> [INFO] Spark Project Streaming  SUCCESS [
>>> 22.760 s]
>>> [INFO] Spark Project Catalyst . SUCCESS [
>>> 50.783 s]
>>> [INFO] Spark Project SQL .. SUCCESS
>>> [01:05 min]
>>> [INFO] Spark Project ML Local Library . SUCCESS [
>>> 4.281 s]
>>> [INFO] Spark Project ML Library ... SUCCESS [
>>> 54.537 s]
>>> [INFO] Spark Project Tools  SUCCESS [
>>> 0.747 s]
>>> [INFO] Spark Project Hive . SUCCESS [
>>> 33.032 s]
>>> [INFO] Spark Project HiveContext Compatibility  SUCCESS [
>>> 3.198 s]
>>> [INFO] Spark Project REPL . SUCCESS [
>>> 3.573 s]
>>> [INFO] Spark Project YARN Shuffle Service . SUCCESS [
>>> 4.617 s]
>>> [INFO] Spark Project YARN . SUCCESS [
>>> 7.321 s]
>>> [INFO] Spark Project Hive Thrift Server ... SUCCESS [
>>> 16.496 s]
>>> [INFO] Spark Project Assembly . SUCCESS [
>>> 2.300 s]
>>> [INFO] Spark Project External Flume Sink .. SUCCESS [
>>> 4.219 s]
>>> [INFO] Spark Project External Flume ... SUCCESS [
>>> 6.987 s]
>>> [INFO] Spark Project External Flume Assembly .. SUCCESS [
>>> 1.465 s]
>>> [INFO] Spark Integration for Kafka 0.8  SUCCESS [
>>> 6.891 s]
>>> [INFO] Spark Project Examples . SUCCESS [
>>> 13.465 s]
>>> [INFO] Spark Project External Kafka Assembly .. SUCCESS [
>>> 2.815 s]
>>> [INFO]
>>> 
>>> [INFO] BUILD SUCCESS
>>> [INFO]
>>> 
>>> [INFO] Total time: 07:04 min
>>> [INFO] Finished at: 2016-05-18T17:55:33+02:00
>>> [INFO] Final Memory: 90M/824M
>>> [INFO]
>>> 
>>>
>>> On 18 May 2016, at 16:28, Sean Owen  wrote:
>>>
>>> I think it's a good idea. Although releases have been preceded before
>>> by release candidates for developers, it would be

Re: HiveContext.refreshTable() missing in spark 2.0

2016-05-17 Thread Yin Huai

Hi Yang,

I think it's deleted accidentally while we were working on the API
migration. We will add it back (
https://issues.apache.org/jira/browse/SPARK-15367).

Thanks,

Yin

On Fri, May 13, 2016 at 2:47 AM, 汪洋  wrote:

> Hi all,
>
> I notice that HiveContext used to have a refreshTable() method, but it
> doesn’t in branch-2.0.
>
> Do we drop that intentionally? If yes, how do we achieve similar
> functionality?
>
> Thanks.
>
> Yang
>

Re: [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-08 Thread Yin Huai

+1

On Mon, Mar 7, 2016 at 12:39 PM, Reynold Xin  wrote:

> +1 (binding)
>
>
> On Sun, Mar 6, 2016 at 12:08 PM, Egor Pahomov 
> wrote:
>
>> +1
>>
>> Spark ODBC server is fine, SQL is fine.
>>
>> 2016-03-03 12:09 GMT-08:00 Yin Yang :
>>
>>> Skipping docker tests, the rest are green:
>>>
>>> [INFO] Spark Project External Kafka ... SUCCESS
>>> [01:28 min]
>>> [INFO] Spark Project Examples . SUCCESS
>>> [02:59 min]
>>> [INFO] Spark Project External Kafka Assembly .. SUCCESS [
>>> 11.680 s]
>>> [INFO]
>>> 
>>> [INFO] BUILD SUCCESS
>>> [INFO]
>>> 
>>> [INFO] Total time: 02:16 h
>>> [INFO] Finished at: 2016-03-03T11:17:07-08:00
>>> [INFO] Final Memory: 152M/4062M
>>>
>>> On Thu, Mar 3, 2016 at 8:55 AM, Yin Yang  wrote:
>>>
 When I ran test suite using the following command:

 build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6
 -Dhadoop.version=2.7.0 package

 I got failure in Spark Project Docker Integration Tests :

 16/03/02 17:36:46 INFO RemoteActorRefProvider$RemotingTerminator:
 Remote daemon shut down; proceeding with flushing remote transports.
 ^[[31m*** RUN ABORTED ***^[[0m
 ^[[31m  com.spotify.docker.client.DockerException:
 java.util.concurrent.ExecutionException:
 com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException:
 java.io.IOException: No such file or directory^[[0m
 ^[[31m  at
 com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:1141)^[[0m
 ^[[31m  at
 com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1082)^[[0m
 ^[[31m  at
 com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)^[[0m
 ^[[31m  at
 org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)^[[0m
 ^[[31m  at
 org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)^[[0m
 ^[[31m  at
 org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)^[[0m
 ^[[31m  at
 org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)^[[0m
 ^[[31m  at
 org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)^[[0m
 ^[[31m  at
 org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)^[[0m
 ^[[31m  at
 org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)^[[0m
 ^[[31m  ...^[[0m
 ^[[31m  Cause: java.util.concurrent.ExecutionException:
 com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException:
 java.io.IOException: No such file or directory^[[0m
 ^[[31m  at
 jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)^[[0m
 ^[[31m  at
 jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)^[[0m
 ^[[31m  at
 jersey.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)^[[0m
 ^[[31m  at
 com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1080)^[[0m
 ^[[31m  at
 com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)^[[0m
 ^[[31m  at
 org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)^[[0m
 ^[[31m  at
 org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)^[[0m
 ^[[31m  at
 org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)^[[0m
 ^[[31m  at
 org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)^[[0m
 ^[[31m  at
 org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)^[[0m
 ^[[31m  ...^[[0m
 ^[[31m  Cause:
 com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException:
 java.io.IOException: No such file or directory^[[0m
 ^[[31m  at
 org.glassfish.jersey.apache.connector.ApacheConnector.apply(ApacheConnector.java:481)^[[0m
 ^[[31m  at
 org.glassfish.jersey.apache.connector.ApacheConnector$1.run(ApacheConnector.java:491)^[[0m
 ^[[31m  at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)^[[0m
 ^[[31m  at java.util.concurrent.FutureTask.run(FutureTask.java:262)^[[0m
 ^[[31m  at
 jersey.repackaged.com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:299)^[[0m
 ^[[31m  at
 java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)^[[0m
 ^[[31m  at

Re: spark hivethriftserver problem on 1.5.0 -> 1.6.0 upgrade

2016-01-26 Thread Yin Huai

Can you post more logs, specially lines around "Initializing execution hive
..." (this is for an internal used fake metastore and it is derby) and
"Initializing HiveMetastoreConnection version ..." (this is for the real
metastore. It should be your remote one)? Also, those temp tables are
stored in the memory and are associated with a HiveContext. If you can not
see temp tables, it usually means that the HiveContext that you used with
JDBC was different from the one used to create the temp table. However, in
your case, you are using HiveThriftServer2.startWithContext(hiveContext).
So, it will be good to provide more logs and see what happened.

Thanks,

Yin

On Tue, Jan 26, 2016 at 1:33 AM, james.gre...@baesystems.com <
james.gre...@baesystems.com> wrote:

> Hi
>
> I posted this on the user list yesterday,  I am posting it here now
> because on further investigation I am pretty sure this is a bug:
>
>
> On upgrade from 1.5.0 to 1.6.0 I have a problem with the
> hivethriftserver2, I have this code:
>
> val hiveContext = new HiveContext(SparkContext.getOrCreate(conf));
>
> val thing = hiveContext.read.parquet("hdfs://
> dkclusterm1.imp.net:8020/user/jegreen1/ex208")
>
> thing.registerTempTable("thing")
>
> HiveThriftServer2.startWithContext(hiveContext)
>
>
> When I start things up on the cluster my hive-site.xml is found – I can
> see that the metastore connects:
>
>
> INFO  metastore - Trying to connect to metastore with URI thrift://
> dkclusterm2.imp.net:9083
> INFO  metastore - Connected to metastore.
>
>
> But then later on the thrift server seems not to connect to the remote
> hive metastore but to start a derby instance instead:
>
> INFO  AbstractService - Service:CLIService is started.
> INFO  ObjectStore - ObjectStore, initialize called
> INFO  Query - Reading in results for query
> "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used
> is closing
> INFO  MetaStoreDirectSql - Using direct SQL, underlying DB is DERBY
> INFO  ObjectStore - Initialized ObjectStore
> INFO  HiveMetaStore - 0: get_databases: default
> INFO  audit - ugi=jegreen1  ip=unknown-ip-addr  cmd=get_databases:
> default
> INFO  HiveMetaStore - 0: Shutting down the object store...
> INFO  audit - ugi=jegreen1  ip=unknown-ip-addr  cmd=Shutting down
> the object store...
> INFO  HiveMetaStore - 0: Metastore shutdown complete.
> INFO  audit - ugi=jegreen1  ip=unknown-ip-addr  cmd=Metastore
> shutdown complete.
> INFO  AbstractService - Service:ThriftBinaryCLIService is started.
> INFO  AbstractService - Service:HiveServer2 is started.
>
> On 1.5.0 the same bit of the log reads:
>
> INFO  AbstractService - Service:CLIService is started.
> INFO  metastore - Trying to connect to metastore with URI thrift://
> dkclusterm2.imp.net:9083  *** ie 1.5.0 connects to remote hive
> INFO  metastore - Connected to metastore.
> INFO  AbstractService - Service:ThriftBinaryCLIService is started.
> INFO  AbstractService - Service:HiveServer2 is started.
> INFO  ThriftCLIService - Starting ThriftBinaryCLIService on port 1
> with 5...500 worker threads
>
>
>
> So if I connect to this with JDBC I can see all the tables on the hive
> server – but not anything temporary – I guess they are going to derby.
>
> I see someone on the databricks website is also having this problem.
>
>
> Thanks
>
> James
> Please consider the environment before printing this email. This message
> should be regarded as confidential. If you have received this email in
> error please notify the sender and destroy it immediately. Statements of
> intent shall only become binding when confirmed in hard copy by an
> authorised signatory. The contents of this email may relate to dealings
> with other companies under the control of BAE Systems Applied Intelligence
> Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

2015-12-22 Thread Yin Huai

+1

On Tue, Dec 22, 2015 at 8:10 PM, Denny Lee  wrote:

> +1
>
> On Tue, Dec 22, 2015 at 7:05 PM Aaron Davidson  wrote:
>
>> +1
>>
>> On Tue, Dec 22, 2015 at 7:01 PM, Josh Rosen 
>> wrote:
>>
>>> +1
>>>
>>> On Tue, Dec 22, 2015 at 7:00 PM, Jeff Zhang  wrote:
>>>
 +1

 On Wed, Dec 23, 2015 at 7:36 AM, Mark Hamstra 
 wrote:

> +1
>
> On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust <
> mich...@databricks.com> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark
>> version 1.6.0!
>>
>> The vote is open until Friday, December 25, 2015 at 18:00 UTC and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc4
>> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>> *
>>
>> The release files, including signatures, digests, etc. can be found
>> at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>>
>> https://repository.apache.org/content/repositories/orgapachespark-1176/
>>
>> The test repository (versioned as v1.6.0-rc4) for this release can be
>> found at:
>>
>> https://repository.apache.org/content/repositories/orgapachespark-1175/
>>
>> The documentation corresponding to this release can be found at:
>>
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>>
>> ===
>> == How can I help test this release? ==
>> ===
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> 
>> == What justifies a -1 vote for this release? ==
>> 
>> This vote is happening towards the end of the 1.6 QA period, so -1
>> votes should only occur for significant regressions from 1.5. Bugs 
>> already
>> present in 1.5, minor regressions, or bugs related to new features will 
>> not
>> block this release.
>>
>> ===
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===
>> 1. It is OK for documentation patches to target 1.6.0 and still go
>> into branch-1.6, since documentations will be published separately from 
>> the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>> target version.
>>
>>
>> ==
>> == Major changes to help you focus your testing ==
>> ==
>>
>> Notable changes since 1.6 RC3
>>
>>   - SPARK-12404 - Fix serialization error for Datasets with
>> Timestamps/Arrays/Decimal
>>   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>>   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>>   - SPARK-12413 - Fix mesos HA
>>
>> Notable changes since 1.6 RC2
>> - SPARK_VERSION has been set correctly
>> - SPARK-12199 ML Docs are publishing correctly
>> - SPARK-12345 Mesos cluster mode has been fixed
>>
>> Notable changes since 1.6 RC1
>> Spark Streaming
>>
>>- SPARK-2629  
>>trackStateByKey has been renamed to mapWithState
>>
>> Spark SQL
>>
>>- SPARK-12165 
>>SPARK-12189  Fix
>>bugs in eviction of storage memory by execution.
>>- SPARK-12258  
>> correct
>>passing null into ScalaUDF
>>
>> Notable Features Since 1.5Spark SQL
>>
>>- SPARK-11787  
>> Parquet
>>Performance - Improve Parquet scan performance when using flat
>>schemas.
>>- SPARK-10810 
>>Session Management - Isolated

Re: [Spark SQL] SQLContext getOrCreate incorrect behaviour

2015-12-20 Thread Yin Huai

Hi Jerry,

Looks like https://issues.apache.org/jira/browse/SPARK-11739 is for the
issue you described. It has been fixed in 1.6. With this change, when you
call SQLContext.getOrCreate(sc2), we will first check if sc has been
stopped. If so, we will create a new SQLContext using sc2.

Thanks,

Yin

On Sun, Dec 20, 2015 at 2:59 PM, Jerry Lam  wrote:

> Hi Spark developers,
>
> I found that SQLContext.getOrCreate(sc: SparkContext) does not behave
> correctly when a different spark context is provided.
>
> ```
> val sc = new SparkContext
> val sqlContext =SQLContext.getOrCreate(sc)
> sc.stop
> ...
>
> val sc2 = new SparkContext
> val sqlContext2 = SQLContext.getOrCreate(sc2)
> sc2.stop
> ```
>
> The sqlContext2 will reference sc instead of sc2 and therefore, the
> program will not work because sc has been stopped.
>
> Best Regards,
>
> Jerry
>

Re: Spark 1.6 - Hive remote metastore not working

2015-12-16 Thread Yin Huai

oh i see. In your log, I guess you can find a line like "Initializing
execution hive, version". The line you showed is actually associated with
execution hive, which is a fake metastore that used by spark sql
internally. Logs related to the real metastore (the metastore storing table
metadata and etc.) starts from the line of "Initializing
HiveMetastoreConnection version 1.2.1 using Spark classes."

Hope this is helpful.

On Wed, Dec 16, 2015 at 12:05 PM, syepes  wrote:

> Thanks for the reply.
>
> The thing is that with 1.5 it never showed messages like the following:
>
> 15/12/16 00:06:11 INFO MetaStoreDirectSql: Using direct SQL, underlying DB
> is DERBY
> 15/12/16 00:06:11 WARN ObjectStore: Failed to get database default,
> returning NoSuchObjectException
>
> This is a bit misleading.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-6-H-ive-remote-metastore-not-working-tp15634p15656.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Spark 1.6 - Hive remote metastore not working

2015-12-16 Thread Yin Huai

I see

15/12/16 00:06:13 INFO metastore: Trying to connect to metastore with URI
thrift://remoteNode:9083
15/12/16 00:06:14 INFO metastore: Connected to metastore.

Looks like you were connected to your remote metastore.

On Tue, Dec 15, 2015 at 3:31 PM, syepes  wrote:

> Hello,
>
> I am testing out the 1.6 branch (#08aa3b4) and I have just noticed that
> spark-shell "HiveContext" is no longer able to connect to my remote
> metastore.
> Using the same build options and configuration files with 1.5 (#0fdf554) it
> works.
>
> Does anyone know if there have been any mayor changes on this component or
> any new config that's needed to make this work?
>
> spark-shell:
> --
> ...
> 15/12/16 00:06:06 INFO Persistence: Property
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 15/12/16 00:06:06 INFO Persistence: Property datanucleus.cache.level2
> unknown - will be ignored
> 15/12/16 00:06:06 WARN Connection: BoneCP specified but not present in
> CLASSPATH (or one of dependencies)
> 15/12/16 00:06:06 WARN Connection: BoneCP specified but not present in
> CLASSPATH (or one of dependencies)
> 15/12/16 00:06:08 INFO ObjectStore: Setting MetaStore object pin classes
> with
>
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Dat
> abase,Type,FieldSchema,Order"
> 15/12/16 00:06:09 INFO Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
> "embedded-only" so does not have its own datasto
> re table.
> 15/12/16 00:06:09 INFO Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
> "embedded-only"
> so does not have its own datastore tab
> le.
> 15/12/16 00:06:11 INFO Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
> "embedded-only" so does not have its own datasto
> re table.
> 15/12/16 00:06:11 INFO Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
> "embedded-only"
> so does not have its own datastore tab
> le.
> 15/12/16 00:06:11 INFO MetaStoreDirectSql: Using direct SQL, underlying DB
> is DERBY
> 15/12/16 00:06:11 INFO ObjectStore: Initialized ObjectStore
> 15/12/16 00:06:11 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so recording
> the schema versi
> on 1.2.0
> 15/12/16 00:06:11 WARN ObjectStore: Failed to get database default,
> returning NoSuchObjectException
> 15/12/16 00:06:11 INFO HiveMetaStore: Added admin role in metastore
> ..
> ..
> 15/12/16 00:06:12 INFO HiveContext: Initializing HiveMetastoreConnection
> version 1.2.1 using Spark classes.
> 15/12/16 00:06:12 INFO ClientWrapper: Inspected Hadoop version: 2.7.1
> 15/12/16 00:06:12 INFO ClientWrapper: Loaded
> org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.7.1
> 15/12/16 00:06:13 INFO metastore: Trying to connect to metastore with URI
> thrift://remoteNode:9083
> 15/12/16 00:06:14 INFO metastore: Connected to metastore.
> 15/12/16 00:06:14 INFO SessionState: Created local directory:
> /tmp/c3a4afbb-e4cf-4d20-85a0-01a53074efc8_resources
> 15/12/16 00:06:14 INFO SessionState: Created HDFS directory:
> /tmp/hive/syepes/c3a4afbb-e4cf-4d20-85a0-01a53074efc8
> 15/12/16 00:06:14 INFO SessionState: Created local directory:
> /tmp/root/c3a4afbb-e4cf-4d20-85a0-01a53074efc8
> 15/12/16 00:06:14 INFO SessionState: Created HDFS directory:
> /tmp/hive/syepes/c3a4afbb-e4cf-4d20-85a0-01a53074efc8/_tmp_space.db
> 15/12/16 00:06:14 INFO SparkILoop: Created sql context (with Hive
> support)..
> SQL context available as sqlContext.
> ..
> --
>
> hive-site.xml
> ---
> 
>   
> hive.metastore.uris
> thrift://remoteNode:9083
>   
> 
> ---
>
>
> Regards and thanks in advance for any info,
> Sebastian
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-6-H-ive-remote-metastore-not-working-tp15634.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-16 Thread Yin Huai

+1

On Wed, Dec 16, 2015 at 7:19 PM, Patrick Wendell  wrote:

> +1
>
> On Wed, Dec 16, 2015 at 6:15 PM, Ted Yu  wrote:
>
>> Ran test suite (minus docker-integration-tests)
>> All passed
>>
>> +1
>>
>> [INFO] Spark Project External ZeroMQ .. SUCCESS [
>> 13.647 s]
>> [INFO] Spark Project External Kafka ... SUCCESS [
>> 45.424 s]
>> [INFO] Spark Project Examples . SUCCESS
>> [02:06 min]
>> [INFO] Spark Project External Kafka Assembly .. SUCCESS [
>> 11.280 s]
>> [INFO]
>> 
>> [INFO] BUILD SUCCESS
>> [INFO]
>> 
>> [INFO] Total time: 01:49 h
>> [INFO] Finished at: 2015-12-16T17:06:58-08:00
>>
>> On Wed, Dec 16, 2015 at 4:37 PM, Andrew Or  wrote:
>>
>>> +1
>>>
>>> Mesos cluster mode regression in RC2 is now fixed (SPARK-12345
>>>  / PR10332
>>> ).
>>>
>>> Also tested on standalone client and cluster mode. No problems.
>>>
>>> 2015-12-16 15:16 GMT-08:00 Rad Gruchalski :
>>>
 I also noticed that spark.replClassServer.host and
 spark.replClassServer.port aren’t used anymore. The transport now happens
 over the main RpcEnv.

 Kind regards,
 Radek Gruchalski
 ra...@gruchalski.com 
 de.linkedin.com/in/radgruchalski/

 *Confidentiality:*This communication is intended for the above-named
 person and may be confidential and/or legally privileged.
 If it has come to you in error you must take no action based on it, nor
 must you copy or show it to anyone; please delete/destroy and inform the
 sender immediately.

 On Wednesday, 16 December 2015 at 23:43, Marcelo Vanzin wrote:

 I was going to say that spark.executor.port is not used anymore in
 1.6, but damn, there's still that akka backend hanging around there
 even when netty is being used... we should fix this, should be a
 simple one-liner.

 On Wed, Dec 16, 2015 at 2:35 PM, singinpirate <
 thesinginpir...@gmail.com> wrote:

 -0 (non-binding)

 I have observed that when we set spark.executor.port in 1.6, we get
 thrown a
 NPE in SparkEnv$.create(SparkEnv.scala:259). It used to work in 1.5.2.
 Is
 anyone else seeing this?

 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

>>>
>>
>

Re: [build system] brief downtime right now

2015-12-14 Thread Yin Huai

Hi Shane,

Seems Spark's lint-r started to fail from
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/4260/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/console.
Is it related to the upgrade work of R?

Thanks,

Yin

On Mon, Dec 14, 2015 at 11:55 AM, shane knapp  wrote:

> ...and we're back.  we were getting reverse proxy timeouts, which seem
> to have been caused by jenkins churning and doing a lot of IO.  i'll
> dig in to the logs and see if i can find out what happened.
>
> weird.
>
> shane
>
> On Mon, Dec 14, 2015 at 11:51 AM, shane knapp  wrote:
> > something is up w/apache.  looking.
> >
> > On Mon, Dec 14, 2015 at 11:37 AM, shane knapp 
> wrote:
> >> after killing and restarting jenkins, things seem to be VERY slow.
> >> i'm gonna kick jenkins again and see if that helps.
> >>
> >>
> >>
> >> On Mon, Dec 14, 2015 at 11:26 AM, shane knapp 
> wrote:
> >>> ok, we're back up and building.
> >>>
> >>> On Mon, Dec 14, 2015 at 10:31 AM, shane knapp 
> wrote:
>  last week i forgot to downgrade R to 3.1.1, and since there's not much
>  activity right now, i'm going to take jenkins down and finish up the
>  ticket.
> 
>  https://issues.apache.org/jira/browse/SPARK-11255
> 
>  we should be back up and running within 30 minutes.
> 
>  thanks!
> 
>  shane
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-12 Thread Yin Huai

+1

Critical and blocker issues of SQL have been addressed.

On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust 
wrote:

> I'll kick off the voting with a +1.
>
> On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc2
>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>> *
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1169/
>>
>> The test repository (versioned as v1.6.0-rc2) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1168/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>>
>> ===
>> == How can I help test this release? ==
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> 
>> == What justifies a -1 vote for this release? ==
>> 
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will not
>> block this release.
>>
>> ===
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
>> version.
>>
>>
>> ==
>> == Major changes to help you focus your testing ==
>> ==
>>
>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>>
>>- SPARK-2629  
>>trackStateByKey has been renamed to mapWithState
>>
>> Spark SQL
>>
>>- SPARK-12165 
>>SPARK-12189  Fix
>>bugs in eviction of storage memory by execution.
>>- SPARK-12258  correct
>>passing null into ScalaUDF
>>
>> Notable Features Since 1.5Spark SQL
>>
>>- SPARK-11787  Parquet
>>Performance - Improve Parquet scan performance when using flat
>>schemas.
>>- SPARK-10810 
>>Session Management - Isolated devault database (i.e USE mydb) even on
>>shared clusters.
>>- SPARK-   Dataset
>>API - A type-safe API (similar to RDDs) that performs many operations
>>on serialized binary data and code generation (i.e. Project Tungsten).
>>- SPARK-1  Unified
>>Memory Management - Shared memory for execution and caching instead
>>of exclusive division of the regions.
>>- SPARK-11197  SQL
>>Queries on Files - Concise syntax for running SQL queries over files
>>of any supported format without registering a table.
>>- SPARK-11745  Reading
>>non-standard JSON files - Added options to read non-standard JSON
>>files (e.g. single-quotes, unquoted attributes)
>>- SPARK-10412  
>> Per-operator
>>Metrics for SQL Execution - Display statistics on a peroperator basis
>>for memory usage and spilled data size.
>>- SPARK-11329  Star
>>(*) expansion

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

2015-12-06 Thread Yin Huai

-1

Tow blocker bugs have been found after this RC.
https://issues.apache.org/jira/browse/SPARK-12089 can cause data corruption
when an external sorter spills data.
https://issues.apache.org/jira/browse/SPARK-12155 can prevent tasks from
acquiring memory even when the executor indeed can allocate memory by
evicting storage memory.

https://issues.apache.org/jira/browse/SPARK-12089 has been fixed. We are
still working on https://issues.apache.org/jira/browse/SPARK-12155.

On Fri, Dec 4, 2015 at 3:04 PM, Mark Hamstra 
wrote:

> 0
>
> Currently figuring out who is responsible for the regression that I am
> seeing in some user code ScalaUDFs that make use of Timestamps and where
> NULL from a CSV file read in via a TestHive#registerTestTable is now
> producing 1969-12-31 23:59:59.99 instead of null.
>
> On Thu, Dec 3, 2015 at 1:57 PM, Sean Owen  wrote:
>
>> Licenses and signature are all fine.
>>
>> Docker integration tests consistently fail for me with Java 7 / Ubuntu
>> and "-Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver"
>>
>> *** RUN ABORTED ***
>>   java.lang.NoSuchMethodError:
>>
>> org.apache.http.impl.client.HttpClientBuilder.setConnectionManagerShared(Z)Lorg/apache/http/impl/client/HttpClientBuilder;
>>   at
>> org.glassfish.jersey.apache.connector.ApacheConnector.(ApacheConnector.java:240)
>>   at
>> org.glassfish.jersey.apache.connector.ApacheConnectorProvider.getConnector(ApacheConnectorProvider.java:115)
>>   at
>> org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:418)
>>   at
>> org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88)
>>   at
>> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120)
>>   at
>> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117)
>>   at
>> org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
>>   at
>> org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726)
>>   at
>> org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285)
>>   at
>> org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126)
>>
>> I also get this failure consistently:
>>
>> DirectKafkaStreamSuite
>> - offset recovery *** FAILED ***
>>   recoveredOffsetRanges.forall(((or: (org.apache.spark.streaming.Time,
>> Array[org.apache.spark.streaming.kafka.OffsetRange])) =>
>>
>> earlierOffsetRangesAsSets.contains(scala.Tuple2.apply[org.apache.spark.streaming.Time,
>>
>> scala.collection.immutable.Set[org.apache.spark.streaming.kafka.OffsetRange]](or._1,
>>
>> scala.this.Predef.refArrayOps[org.apache.spark.streaming.kafka.OffsetRange](or._2).toSet[org.apache.spark.streaming.kafka.OffsetRange]
>> was false Recovered ranges are not the same as the ones generated
>> (DirectKafkaStreamSuite.scala:301)
>>
>> On Wed, Dec 2, 2015 at 8:26 PM, Michael Armbrust 
>> wrote:
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 1.6.0!
>> >
>> > The vote is open until Saturday, December 5, 2015 at 21:00 UTC and
>> passes if
>> > a majority of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 1.6.0
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v1.6.0-rc1
>> > (bf525845cef159d2d4c9f4d64e158f037179b5c4)
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1165/
>> >
>> > The test repository (versioned as v1.6.0-rc1) for this release can be
>> found
>> > at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1164/
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc1-docs/
>> >
>> >
>> > ===
>> > == How can I help test this release? ==
>> > ===
>> > If you are a Spark user, you can help us test this release by taking an
>> > existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > 
>> > == What justifies a -1 vote for this release? ==
>> > 
>> > This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> > should only occur for significant regressions from 1.5. Bugs already
>> present
>> > in 1.5, minor regressions, or bugs related to new features

Re: IntelliJ license for committers?

2015-12-02 Thread Yin Huai

I think they can renew your license. In
https://www.jetbrains.com/buy/opensource/?product=idea, you can find
"Update Open Source License".

On Wed, Dec 2, 2015 at 7:47 AM, Sean Owen  wrote:

> I'm aware that IntelliJ has (at least in the past) made licenses
> available to committers in bona fide open source projects, and I
> recall they did the same for Spark. I believe I'm using that license
> now, but it seems to have expired? If anyone knows the status of that
> (or of any renewals to the license), I wonder if you could share that
> with me, offline of course.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Seems jenkins is down (or very slow)?

2015-11-13 Thread Yin Huai

It was generally slow. But, after 5 or 10 minutes, it's all good.

On Fri, Nov 13, 2015 at 9:16 AM, shane knapp <skn...@berkeley.edu> wrote:

> were you hitting any particular URL when you noticed this, or was it
> generally slow?
>
> On Thu, Nov 12, 2015 at 6:21 PM, Yin Huai <yh...@databricks.com> wrote:
> > Hi Guys,
> >
> > Seems Jenkins is down or very slow? Does anyone else experience it or
> just
> > me?
> >
> > Thanks,
> >
> > Yin
>

Re: Seems jenkins is down (or very slow)?

2015-11-12 Thread Yin Huai

Seems it is back.

On Thu, Nov 12, 2015 at 6:21 PM, Yin Huai <yh...@databricks.com> wrote:

> Hi Guys,
>
> Seems Jenkins is down or very slow? Does anyone else experience it or just
> me?
>
> Thanks,
>
> Yin
>

Seems jenkins is down (or very slow)?

2015-11-12 Thread Yin Huai

Hi Guys,

Seems Jenkins is down or very slow? Does anyone else experience it or just
me?

Thanks,

Yin

Re: Dataframe nested schema inference from Json without type conflicts

2015-10-05 Thread Yin Huai

Hello Ewan,

Adding a JSON-specific option makes sense. Can you open a JIRA for this?
Also, sending out a PR will be great. For JSONRelation, I think we can pass
all user-specific options to it (see
org.apache.spark.sql.execution.datasources.json.DefaultSource's
createRelation) just like what we do for ParquetRelation. Then, inside
JSONRelation, we figure out what kind of options that have been specified.

Thanks,

Yin

On Mon, Oct 5, 2015 at 9:04 AM, Ewan Leith <ewan.le...@realitymine.com>
wrote:

> I’ve done some digging today and, as a quick and ugly fix, altering the
> case statement of the JSON inferField function in InferSchema.scala
>
>
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala
>
>
>
> to have
>
>
>
> case VALUE_STRING | VALUE_NUMBER_INT | VALUE_NUMBER_FLOAT | VALUE_TRUE |
> VALUE_FALSE => StringType
>
>
>
> rather than the rules for each type works as we’d want.
>
>
>
> If we were to wrap this up in a configuration setting in JSONRelation like
> the samplingRatio setting, with the default being to behave as it currently
> works, does anyone think a pull request would plausibly get into the Spark
> main codebase?
>
>
>
> Thanks,
>
> Ewan
>
>
>
>
>
>
>
> *From:* Ewan Leith [mailto:ewan.le...@realitymine.com]
> *Sent:* 02 October 2015 01:57
> *To:* yh...@databricks.com
>
> *Cc:* r...@databricks.com; dev@spark.apache.org
> *Subject:* Re: Dataframe nested schema inference from Json without type
> conflicts
>
>
>
> Exactly, that's a much better way to put it.
>
>
>
> Thanks,
>
> Ewan
>
>
>
> -- Original message--
>
> *From: *Yin Huai
>
> *Date: *Thu, 1 Oct 2015 23:54
>
> *To: *Ewan Leith;
>
> *Cc: *r...@databricks.com;dev@spark.apache.org;
>
> *Subject:*Re: Dataframe nested schema inference from Json without type
> conflicts
>
>
>
> Hi Ewan,
>
>
>
> For your use case, you only need the schema inference to pick up the
> structure of your data (basically you want spark sql to infer the type of
> complex values like arrays and structs but keep the type of primitive
> values as strings), right?
>
>
>
> Thanks,
>
>
>
> Yin
>
>
>
> On Thu, Oct 1, 2015 at 2:27 PM, Ewan Leith <ewan.le...@realitymine.com>
> wrote:
>
> We could, but if a client sends some unexpected records in the schema
> (which happens more than I'd like, our schema seems to constantly evolve),
> its fantastic how Spark picks up on that data and includes it.
>
>
>
> Passing in a fixed schema loses that nice additional ability, though it's
> what we'll probably have to adopt if we can't come up with a way to keep
> the inference working.
>
>
>
> Thanks,
>
> Ewan
>
>
>
> -- Original message--
>
> *From: *Reynold Xin
>
> *Date: *Thu, 1 Oct 2015 22:12
>
> *To: *Ewan Leith;
>
> *Cc: *dev@spark.apache.org;
>
> *Subject:*Re: Dataframe nested schema inference from Json without type
> conflicts
>
>
>
> You can pass the schema into json directly, can't you?
>
>
>
> On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith <ewan.le...@realitymine.com>
> wrote:
>
> Hi all,
>
>
>
> We really like the ability to infer a schema from JSON contained in an
> RDD, but when we’re using Spark Streaming on small batches of data, we
> sometimes find that Spark infers a more specific type than it should use,
> for example if the json in that small batch only contains integer values
> for a String field, it’ll class the field as an Integer type on one
> Streaming batch, then a String on the next one.
>
>
>
> Instead, we’d rather match every value as a String type, then handle any
> casting to a desired type later in the process.
>
>
>
> I don’t think there’s currently any simple way to avoid this that I can
> see, but we could add the functionality in the JacksonParser.scala file,
> probably in convertField.
>
>
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala
>
>
>
> Does anyone know an easier and cleaner way to do this?
>
>
>
> Thanks,
>
> Ewan
>
>
>
>
>

Re: Dataframe nested schema inference from Json without type conflicts

2015-10-01 Thread Yin Huai

Hi Ewan,

For your use case, you only need the schema inference to pick up the
structure of your data (basically you want spark sql to infer the type of
complex values like arrays and structs but keep the type of primitive
values as strings), right?

Thanks,

Yin

On Thu, Oct 1, 2015 at 2:27 PM, Ewan Leith 
wrote:

> We could, but if a client sends some unexpected records in the schema
> (which happens more than I'd like, our schema seems to constantly evolve),
> its fantastic how Spark picks up on that data and includes it.
>
>
> Passing in a fixed schema loses that nice additional ability, though it's
> what we'll probably have to adopt if we can't come up with a way to keep
> the inference working.
>
>
> Thanks,
>
> Ewan
>
>
> -- Original message--
>
> *From: *Reynold Xin
>
> *Date: *Thu, 1 Oct 2015 22:12
>
> *To: *Ewan Leith;
>
> *Cc: *dev@spark.apache.org;
>
> *Subject:*Re: Dataframe nested schema inference from Json without type
> conflicts
>
>
> You can pass the schema into json directly, can't you?
>
> On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith 
> wrote:
>
>> Hi all,
>>
>>
>>
>> We really like the ability to infer a schema from JSON contained in an
>> RDD, but when we’re using Spark Streaming on small batches of data, we
>> sometimes find that Spark infers a more specific type than it should use,
>> for example if the json in that small batch only contains integer values
>> for a String field, it’ll class the field as an Integer type on one
>> Streaming batch, then a String on the next one.
>>
>>
>>
>> Instead, we’d rather match every value as a String type, then handle any
>> casting to a desired type later in the process.
>>
>>
>>
>> I don’t think there’s currently any simple way to avoid this that I can
>> see, but we could add the functionality in the JacksonParser.scala file,
>> probably in convertField.
>>
>>
>>
>>
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala
>>
>>
>>
>> Does anyone know an easier and cleaner way to do this?
>>
>>
>>
>> Thanks,
>>
>> Ewan
>>
>
>

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-27 Thread Yin Huai

+1

Tested 1.5.1 SQL blockers.

On Sat, Sep 26, 2015 at 1:36 PM, robineast  wrote:

> +1
>
>
> build/mvn clean package -DskipTests -Pyarn -Phadoop-2.6
> OK
> Basic graph tests
>   Load graph using edgeListFile...SUCCESS
>   Run PageRank...SUCCESS
> Minimum Spanning Tree Algorithm
>   Run basic Minimum Spanning Tree algorithm...SUCCESS
>   Run Minimum Spanning Tree taxonomy creation...SUCCESS
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-1-RC1-tp14310p14380.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai

Looks like the problem is df.rdd does not work very well with limit. In
scala, df.limit(1).rdd will also trigger the issue you observed. I will add
this in the jira.

On Mon, Sep 21, 2015 at 10:44 AM, Jerry Lam <chiling...@gmail.com> wrote:

> I just noticed you found 1.4 has the same issue. I added that as well in
> the ticket.
>
> On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam <chiling...@gmail.com> wrote:
>
>> Hi Yin,
>>
>> You are right! I just tried the scala version with the above lines, it
>> works as expected.
>> I'm not sure if it happens also in 1.4 for pyspark but I thought the
>> pyspark code just calls the scala code via py4j. I didn't expect that this
>> bug is pyspark specific. That surprises me actually a bit. I created a
>> ticket for this (SPARK-10731
>> <https://issues.apache.org/jira/browse/SPARK-10731>).
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>> On Mon, Sep 21, 2015 at 1:01 PM, Yin Huai <yh...@databricks.com> wrote:
>>
>>> btw, does 1.4 has the same problem?
>>>
>>> On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai <yh...@databricks.com> wrote:
>>>
>>>> Hi Jerry,
>>>>
>>>> Looks like it is a Python-specific issue. Can you create a JIRA?
>>>>
>>>> Thanks,
>>>>
>>>> Yin
>>>>
>>>> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam <chiling...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Spark Developers,
>>>>>
>>>>> I just ran some very simple operations on a dataset. I was surprise by
>>>>> the execution plan of take(1), head() or first().
>>>>>
>>>>> For your reference, this is what I did in pyspark 1.5:
>>>>> df=sqlContext.read.parquet("someparquetfiles")
>>>>> df.head()
>>>>>
>>>>> The above lines take over 15 minutes. I was frustrated because I can
>>>>> do better without using spark :) Since I like spark, so I tried to figure
>>>>> out why. It seems the dataframe requires 3 stages to give me the first 
>>>>> row.
>>>>> It reads all data (which is about 1 billion rows) and run Limit twice.
>>>>>
>>>>> Instead of head(), show(1) runs much faster. Not to mention that if I
>>>>> do:
>>>>>
>>>>> df.rdd.take(1) //runs much faster.
>>>>>
>>>>> Is this expected? Why head/first/take is so slow for dataframe? Is it
>>>>> a bug in the optimizer? or I did something wrong?
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> Jerry
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai

Seems 1.4 has the same issue.

On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai <yh...@databricks.com> wrote:

> btw, does 1.4 has the same problem?
>
> On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai <yh...@databricks.com> wrote:
>
>> Hi Jerry,
>>
>> Looks like it is a Python-specific issue. Can you create a JIRA?
>>
>> Thanks,
>>
>> Yin
>>
>> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam <chiling...@gmail.com> wrote:
>>
>>> Hi Spark Developers,
>>>
>>> I just ran some very simple operations on a dataset. I was surprise by
>>> the execution plan of take(1), head() or first().
>>>
>>> For your reference, this is what I did in pyspark 1.5:
>>> df=sqlContext.read.parquet("someparquetfiles")
>>> df.head()
>>>
>>> The above lines take over 15 minutes. I was frustrated because I can do
>>> better without using spark :) Since I like spark, so I tried to figure out
>>> why. It seems the dataframe requires 3 stages to give me the first row. It
>>> reads all data (which is about 1 billion rows) and run Limit twice.
>>>
>>> Instead of head(), show(1) runs much faster. Not to mention that if I do:
>>>
>>> df.rdd.take(1) //runs much faster.
>>>
>>> Is this expected? Why head/first/take is so slow for dataframe? Is it a
>>> bug in the optimizer? or I did something wrong?
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>
>>
>

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai

btw, does 1.4 has the same problem?

On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai <yh...@databricks.com> wrote:

> Hi Jerry,
>
> Looks like it is a Python-specific issue. Can you create a JIRA?
>
> Thanks,
>
> Yin
>
> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam <chiling...@gmail.com> wrote:
>
>> Hi Spark Developers,
>>
>> I just ran some very simple operations on a dataset. I was surprise by
>> the execution plan of take(1), head() or first().
>>
>> For your reference, this is what I did in pyspark 1.5:
>> df=sqlContext.read.parquet("someparquetfiles")
>> df.head()
>>
>> The above lines take over 15 minutes. I was frustrated because I can do
>> better without using spark :) Since I like spark, so I tried to figure out
>> why. It seems the dataframe requires 3 stages to give me the first row. It
>> reads all data (which is about 1 billion rows) and run Limit twice.
>>
>> Instead of head(), show(1) runs much faster. Not to mention that if I do:
>>
>> df.rdd.take(1) //runs much faster.
>>
>> Is this expected? Why head/first/take is so slow for dataframe? Is it a
>> bug in the optimizer? or I did something wrong?
>>
>> Best Regards,
>>
>> Jerry
>>
>
>

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai

Hi Jerry,

Looks like it is a Python-specific issue. Can you create a JIRA?

Thanks,

Yin

On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam  wrote:

> Hi Spark Developers,
>
> I just ran some very simple operations on a dataset. I was surprise by the
> execution plan of take(1), head() or first().
>
> For your reference, this is what I did in pyspark 1.5:
> df=sqlContext.read.parquet("someparquetfiles")
> df.head()
>
> The above lines take over 15 minutes. I was frustrated because I can do
> better without using spark :) Since I like spark, so I tried to figure out
> why. It seems the dataframe requires 3 stages to give me the first row. It
> reads all data (which is about 1 billion rows) and run Limit twice.
>
> Instead of head(), show(1) runs much faster. Not to mention that if I do:
>
> df.rdd.take(1) //runs much faster.
>
> Is this expected? Why head/first/take is so slow for dataframe? Is it a
> bug in the optimizer? or I did something wrong?
>
> Best Regards,
>
> Jerry
>

Re: HyperLogLogUDT

2015-09-13 Thread Yin Huai

The user implementing a UDAF does not need to consider what is the
underlying buffer. Our aggregate operator will figure out if the buffer
data types of all aggregate functions used by a query are supported by the
UnsafeRow. If so, we will use the UnsafeRow as the buffer.

Regarding the performance, UDAF is not as efficient as out built-in
aggregate functions mainly because (1) users implement UDAFs with JVM data
types not SQL data types (e.g. in a UDAF you will use String not
UTF8String, which is our SQL data type) (2) UDAF does not support
code-generation.

For handling different data types for an argument, having multiple UDAF
classes is the way for now. We will consider what will be the right way to
support specifying multiple possible data types for an argument.

Thanks,

Yin

On Sat, Sep 12, 2015 at 11:01 PM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> Thanks Yin
>
> So how does one ensure a UDAF works with Tungsten and UnsafeRow buffers?
> Or is this something that will be included in the UDAF interface in future?
>
> Is there a performance difference between Extending UDAF vs Aggregate2?
>
> It's also not clear to me how to handle inputs of different types? What if
> my UDAF can handle String and Long for example? Do I need to specify
> AnyType or is there a way to specify multiple types possible for a single
> input column?
>
> If no performance difference and UDAF can work with Tungsten, then Herman
> does it perhaps make sense to use UDAF (but without a UDT as you've done
> for performance)? As it would then be easy to extend that UDAF and adjust
> the output types as needed. It also provides a really nice example of how
> to use the interface for something advanced and high performance.
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
> On Sun, Sep 13, 2015 at 12:09 AM, Yin Huai <yh...@databricks.com> wrote:
>
>> Hi Nick,
>>
>> The buffer exposed to UDAF interface is just a view of underlying buffer
>> (this underlying buffer is shared by different aggregate functions and
>> every function takes one or multiple slots). If you need a UDAF, extending
>> UserDefinedAggregationFunction is the preferred
>> approach. AggregateFunction2 is used for built-in aggregate function.
>>
>> Thanks,
>>
>> Yin
>>
>> On Sat, Sep 12, 2015 at 10:40 AM, Nick Pentreath <
>> nick.pentre...@gmail.com> wrote:
>>
>>> Ok, that makes sense. So this is (a) more efficient, since as far as I
>>> can see it is updating the HLL registers directly in the buffer for each
>>> value, and (b) would be "Tungsten-compatible" as it can work against
>>> UnsafeRow? Is it currently possible to specify an UnsafeRow as a buffer in
>>> a UDAF?
>>>
>>> So is extending AggregateFunction2 the preferred approach over the
>>> UserDefinedAggregationFunction interface? Or it is that internal only?
>>>
>>> I see one of the main use cases for things like HLL / CMS and other
>>> approximate data structure being the fact that you can store them as
>>> columns representing distinct counts in an aggregation. And then do further
>>> arbitrary aggregations on that data as required. e.g. store hourly
>>> aggregate data, and compute daily or monthly aggregates from that, while
>>> still keeping the ability to have distinct counts on certain fields.
>>>
>>> So exposing the serialized HLL as Array[Byte] say, so that it can be
>>> further aggregated in a later DF operation, or saved to an external data
>>> source, would be super useful.
>>>
>>>
>>>
>>> On Sat, Sep 12, 2015 at 6:06 PM, Herman van Hövell tot Westerflier <
>>> hvanhov...@questtec.nl> wrote:
>>>
>>>> I am typically all for code re-use. The reason for writing this is to
>>>> prevent the indirection of a UDT and work directly against memory. A UDT
>>>> will work fine at the moment because we still use
>>>> GenericMutableRow/SpecificMutableRow as aggregation buffers. However if you
>>>> would use an UnsafeRow as an AggregationBuffer (which is attractive when
>>>> you have a lot of groups during aggregation) the use of an UDT is either
>>>> impossible or it would become very slow because it would require us to
>>>> deserialize/serialize a UDT on every update.
>>>>
>>>> As for compatibility, the implementation produces exactly the same
>>>> results as the ClearSpring implementation. You could easily export the
>>>> HLL++ register values to the current ClearSpring implementation and export
>>>> those.
>&g

Re: HyperLogLogUDT

2015-09-12 Thread Yin Huai

Hi Nick,

The buffer exposed to UDAF interface is just a view of underlying buffer
(this underlying buffer is shared by different aggregate functions and
every function takes one or multiple slots). If you need a UDAF, extending
UserDefinedAggregationFunction is the preferred
approach. AggregateFunction2 is used for built-in aggregate function.

Thanks,

Yin

On Sat, Sep 12, 2015 at 10:40 AM, Nick Pentreath 
wrote:

> Ok, that makes sense. So this is (a) more efficient, since as far as I can
> see it is updating the HLL registers directly in the buffer for each value,
> and (b) would be "Tungsten-compatible" as it can work against UnsafeRow? Is
> it currently possible to specify an UnsafeRow as a buffer in a UDAF?
>
> So is extending AggregateFunction2 the preferred approach over the
> UserDefinedAggregationFunction interface? Or it is that internal only?
>
> I see one of the main use cases for things like HLL / CMS and other
> approximate data structure being the fact that you can store them as
> columns representing distinct counts in an aggregation. And then do further
> arbitrary aggregations on that data as required. e.g. store hourly
> aggregate data, and compute daily or monthly aggregates from that, while
> still keeping the ability to have distinct counts on certain fields.
>
> So exposing the serialized HLL as Array[Byte] say, so that it can be
> further aggregated in a later DF operation, or saved to an external data
> source, would be super useful.
>
>
>
> On Sat, Sep 12, 2015 at 6:06 PM, Herman van Hövell tot Westerflier <
> hvanhov...@questtec.nl> wrote:
>
>> I am typically all for code re-use. The reason for writing this is to
>> prevent the indirection of a UDT and work directly against memory. A UDT
>> will work fine at the moment because we still use
>> GenericMutableRow/SpecificMutableRow as aggregation buffers. However if you
>> would use an UnsafeRow as an AggregationBuffer (which is attractive when
>> you have a lot of groups during aggregation) the use of an UDT is either
>> impossible or it would become very slow because it would require us to
>> deserialize/serialize a UDT on every update.
>>
>> As for compatibility, the implementation produces exactly the same
>> results as the ClearSpring implementation. You could easily export the
>> HLL++ register values to the current ClearSpring implementation and export
>> those.
>>
>> Met vriendelijke groet/Kind regards,
>>
>> Herman van Hövell tot Westerflier
>>
>> QuestTec B.V.
>> Torenwacht 98
>> 2353 DC Leiderdorp
>> hvanhov...@questtec.nl
>> +599 9 521 4402
>>
>>
>> 2015-09-12 11:06 GMT+02:00 Nick Pentreath :
>>
>>> I should add that surely the idea behind UDT is exactly that it can (a)
>>> fit automatically into DFs and Tungsten and (b) that it can be used
>>> efficiently in writing ones own UDTs and UDAFs?
>>>
>>>
>>> On Sat, Sep 12, 2015 at 11:05 AM, Nick Pentreath <
>>> nick.pentre...@gmail.com> wrote:
>>>
 Can I ask why you've done this as a custom implementation rather than
 using StreamLib, which is already implemented and widely used? It seems
 more portable to me to use a library - for example, I'd like to export the
 grouped data with raw HLLs to say Elasticsearch, and then do further
 on-demand aggregation in ES and visualization in Kibana etc.

 Others may want to do something similar into Hive, Cassandra, HBase or
 whatever they are using. In this case they'd need to use this particular
 implementation from Spark which may be tricky to include in a dependency
 etc.

 If there are enhancements, does it not make sense to do a PR to
 StreamLib? Or does this interact in some better way with Tungsten?

 I am unclear on how the interop with Tungsten raw memory works - some
 pointers on that and where to look in the Spark code would be helpful.

 On Sat, Sep 12, 2015 at 10:45 AM, Herman van Hövell tot Westerflier <
 hvanhov...@questtec.nl> wrote:

> Hello Nick,
>
> I have been working on a (UDT-less) implementation of HLL++. You can
> find the PR here: https://github.com/apache/spark/pull/8362. This
> current implements the dense version of HLL++, which is a further
> development of HLL. It returns a Long, but it shouldn't be to hard to
> return a Row containing the cardinality and/or the HLL registers (the
> binary data).
>
> I am curious what the stance is on using UDTs in the new UDAF
> interface. Is this still viable? This wouldn't work with UnsafeRow for
> instance. The OpenHashSetUDT for instance would be a nice building block
> for CollectSet and all Distinct Aggregate operators. Are there any 
> opinions
> on this?
>
> Kind regards,
>
> Herman van Hövell tot Westerflier
>
> QuestTec B.V.
> Torenwacht 98
> 2353 DC Leiderdorp
> hvanhov...@questtec.nl
> +599 9 521 4402
>
>
>

Re: [SparkSQL]Could not alter table in Spark 1.5 use HiveContext

2015-09-10 Thread Yin Huai

Yes, Spark 1.5 use Hive 1.2's metastore client by default. You can change
it by putting the following settings in your spark conf.

spark.sql.hive.metastore.version = 0.13.1
spark.sql.hive.metastore.jars = maven or the path of your hive 0.13 jars
and hadoop jars

For spark.sql.hive.metastore.jars, basically, it tells spark sql where to
find metastore client's classes of Hive 0.13.1. If you set it to maven, we
will download needed jars directly (it is an easy way to do testing work).

On Thu, Sep 10, 2015 at 7:45 PM, StanZhai  wrote:

> Thank you for the swift reply!
>
> The version of my hive metastore server is 0.13.1, I've build spark use sbt
> like this:
> build/sbt -Pyarn -Phadoop-2.4 -Phive -Phive-thriftserver assembly
>
> Is spark 1.5 bind the hive client version of 1.2 by default?
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSQL-Could-not-alter-table-in-Spark-1-5-use-HiveContext-tp14029p14044.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-04 Thread Yin Huai

Hi Krishna,

Can you share your code to reproduce the memory allocation issue?

Thanks,

Yin

On Fri, Sep 4, 2015 at 8:00 AM, Krishna Sankar  wrote:

> Thanks Tom.  Interestingly it happened between RC2 and RC3.
> Now my vote is +1/2 unless the memory error is known and has a workaround.
>
> Cheers
> 
>
>
> On Fri, Sep 4, 2015 at 7:30 AM, Tom Graves  wrote:
>
>> The upper/lower case thing is known.
>> https://issues.apache.org/jira/browse/SPARK-9550
>> I assume it was decided to be ok and its going to be in the release notes
>>  but Reynold or Josh can probably speak to it more.
>>
>> Tom
>>
>>
>>
>> On Thursday, September 3, 2015 10:21 PM, Krishna Sankar <
>> ksanka...@gmail.com> wrote:
>>
>>
>> +?
>>
>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min
>>  mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
>> 2. Tested pyspark, mllib
>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
>> 2.2. Linear/Ridge/Laso Regression OK
>> 2.3. Decision Tree, Naive Bayes OK
>> 2.4. KMeans OK
>>Center And Scale OK
>> 2.5. RDD operations OK
>>   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>Model evaluation/optimization (rank, numIter, lambda) with
>> itertools OK
>> 3. Scala - MLlib
>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
>> 3.2. LinearRegressionWithSGD OK
>> 3.3. Decision Tree OK
>> 3.4. KMeans OK
>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
>> 3.6. saveAsParquetFile OK
>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
>> registerTempTable, sql OK
>> 3.8. result = sqlContext.sql("SELECT
>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
>> 4.0. Spark SQL from Python OK
>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
>> 5.0. Packages
>> 5.1. com.databricks.spark.csv - read/write OK
>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
>> com.databricks:spark-csv_2.11:1.2.0 worked)
>> 6.0. DataFrames
>> 6.1. cast,dtypes OK
>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
>> 6.3. All joins,sql,set operations,udf OK
>>
>> Two Problems:
>>
>> 1. The synthetic column names are lowercase ( i.e. now ‘sum(OrderPrice)’;
>> previously ‘SUM(OrderPrice)’, now ‘avg(Total)’; previously 'AVG(Total)').
>> So programs that depend on the case of the synthetic column names would
>> fail.
>> 2. orders_3.groupBy("Year","Month").sum('Total').show()
>> fails with the error ‘java.io.IOException: Unable to acquire 4194304
>> bytes of memory’
>> orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails
>> with the same error
>> Is this a known bug ?
>> Cheers
>> 
>> P.S: Sorry for the spam, forgot Reply All
>>
>> On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.5.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>>
>> The tag to be voted on is v1.5.0-rc3:
>>
>> https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release (published as 1.5.0-rc3) can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1143/
>>
>> The staging repository for this release (published as 1.5.0) can be found
>> at:
>> https://repository.apache.org/content/repositories/orgapachespark-1142/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/
>>
>>
>> ===
>> How can I help test this release?
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>>
>> 
>> What justifies a -1 vote for this release?
>> 
>> This vote is happening towards the end of the 1.5 QA period, so -1 votes
>> should only occur for significant regressions from 1.4. Bugs already
>> present in 1.4, minor regressions, or bugs related to new features will not
>> block this release.
>>
>>
>>

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-28 Thread Yin Huai

-1

Found a problem on reading partitioned table. Right now, we may create a
SQL project/filter operator for every partition. When we have thousands of
partitions, there will be a huge number of SQLMetrics (accumulators), which
causes high memory pressure to the driver and then takes down the cluster
(long GC time causes different kinds of timeouts).

https://issues.apache.org/jira/browse/SPARK-10339

Will have a fix soon.

On Fri, Aug 28, 2015 at 3:18 PM, Jon Bender jonathan.ben...@gmail.com
wrote:

Marcelo,

Thanks for replying -- after looking at my test again, I misinterpreted
another issue I'm seeing which is unrelated (note I'm not using a pre-built
binary, rather had to build my own with Yarn/Hive support, as I want to use
it on an older cluster (CDH5.1.0)).

I can start up a pyspark app on YARN, so I don't want to block this. +1

Best,
Jonathan

On Fri, Aug 28, 2015 at 2:34 PM, Marcelo Vanzin van...@cloudera.com
wrote:

Hi Jonathan,

Can you be more specific about what problem you're running into?

SPARK-6869 fixed the issue of pyspark vs. assembly jar by shipping the
pyspark archives separately to YARN. With that fix in place, pyspark
doesn't need to get anything from the Spark assembly, so it has no
problems running on YARN. I just downloaded
spark-1.5.0-bin-hadoop2.6.tgz and tried that out, and pyspark works
fine on YARN for me.

On Fri, Aug 28, 2015 at 2:22 PM, Jonathan Bender
jonathan.ben...@gmail.com wrote:
-1 for regression on PySpark + YARN support

It seems like this JIRA
https://issues.apache.org/jira/browse/SPARK-7733
added a requirement for Java 7 in the build process. Due to some quirks
with the Java archive format changes between Java 6 and 7, using PySpark
with a YARN uberjar seems to break when compiled with anything after
Java 6
(see https://issues.apache.org/jira/browse/SPARK-1920 for reference).

--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC2-tp13826p13890.html
Sent from the Apache Spark Developers List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

--
Marcelo

Re: SQLContext.read.json(path) throws java.io.IOException

2015-08-26 Thread Yin Huai

The JSON support in Spark SQL handles a file with one JSON object per line
or one JSON array of objects per line. What is the format your file? Does
it only contain a single line?

On Wed, Aug 26, 2015 at 6:47 AM, gsvic victora...@gmail.com wrote:

 Hi,

 I have the following issue. I am trying to load a 2.5G JSON file from a
 10-node Hadoop Cluster. Actually, I am trying to create a DataFrame, using
 sqlContext.read.json(hdfs://master:9000/path/file.json).

 The JSON file contains a parsed table(relation) from the TPCH benchmark.

 After finishing some tasks, the job fails by throwing several
 java.io.IOExceptions. For smaller files (eg 700M it works fine). I am
 posting a part of the log and the whole stack trace below:

 15/08/26 16:31:44 INFO TaskSetManager: Starting task 10.1 in stage 1.0 (TID
 47, 192.168.5.146, ANY, 1416 bytes)
 15/08/26 16:31:44 INFO TaskSetManager: Starting task 11.1 in stage 1.0 (TID
 48, 192.168.5.150, ANY, 1416 bytes)
 15/08/26 16:31:44 INFO TaskSetManager: Starting task 4.1 in stage 1.0 (TID
 49, 192.168.5.149, ANY, 1416 bytes)
 15/08/26 16:31:44 INFO TaskSetManager: Starting task 8.1 in stage 1.0 (TID
 50, 192.168.5.246, ANY, 1416 bytes)
 15/08/26 16:31:53 INFO TaskSetManager: Finished task 10.0 in stage 1.0 (TID
 17) in 104681 ms on 192.168.5.243 (27/35)
 15/08/26 16:31:53 INFO TaskSetManager: Finished task 8.0 in stage 1.0 (TID
 15) in 105541 ms on 192.168.5.193 (28/35)
 15/08/26 16:31:55 INFO TaskSetManager: Finished task 11.0 in stage 1.0 (TID
 18) in 107122 ms on 192.168.5.167 (29/35)
 15/08/26 16:31:57 INFO TaskSetManager: Finished task 5.0 in stage 1.0 (TID
 12) in 109583 ms on 192.168.5.245 (30/35)
 15/08/26 16:32:08 INFO TaskSetManager: Finished task 4.1 in stage 1.0 (TID
 49) in 24135 ms on 192.168.5.149 (31/35)
 15/08/26 16:32:13 WARN TaskSetManager: Lost task 2.0 in stage 1.0 (TID 9,
 192.168.5.246): java.io.IOException: Too many bytes before newline:
 2147483648
 at
 org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249)
 at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
 at
 org.apache.hadoop.mapred.LineRecordReader.init(LineRecordReader.java:134)
 at

 org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
 at
 org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:239)
 at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
 at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/SQLContext-read-json-path-throws-java-io-IOException-tp13841.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.4.1

2015-06-29 Thread Yin Huai

+1. I tested those SQL blocker bugs in my laptop and they have been fixed.

On Mon, Jun 29, 2015 at 6:51 AM, Sean Owen so...@cloudera.com wrote:

 +1 sigs, license, etc check out.

 All tests pass for me in the Hadoop 2.6 + Hive configuration on Ubuntu.
 (I still get those pesky cosmetic UDF test failures in Java 8, but
 they are clearly just test issues.)

 I'll follow up on retargeting 1.4.1 issues afterwards as needed, but
 again feel free to move those you're sure won't be in this release.

 On Wed, Jun 24, 2015 at 6:37 AM, Patrick Wendell pwend...@gmail.com
 wrote:
  Please vote on releasing the following candidate as Apache Spark version
 1.4.1!
 
  This release fixes a handful of known issues in Spark 1.4.0, listed here:
  http://s.apache.org/spark-1.4.1
 
  The tag to be voted on is v1.4.1-rc1 (commit 60e08e5):
  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
  60e08e50751fe3929156de956d62faea79f5b801
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc1-bin/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  [published as version: 1.4.1]
  https://repository.apache.org/content/repositories/orgapachespark-1118/
  [published as version: 1.4.1-rc1]
  https://repository.apache.org/content/repositories/orgapachespark-1119/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc1-docs/
 
  Please vote on releasing this package as Apache Spark 1.4.1!
 
  The vote is open until Saturday, June 27, at 06:32 UTC and passes
  if a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.4.1
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Re: Hive 0.12 support in 1.4.0 ?

2015-06-22 Thread Yin Huai

Hi Tom,

In Spark 1.4, we have de-coupled the support of Hive's metastore and other
parts (parser, Hive udfs, and Hive SerDes). The execution engine of Spark
SQL in 1.4 will always use Hive 0.13.1. For the metastore connection part,
you can connect to either Hive 0.12 or 0.13.1's metastore. We have removed
old shims and profiles of specifying the Hive version (since execution
engine is always using Hive 0.13.1 and metastore client part can be
configured to use either Hive 0.12 or 0.13.1's metastore).

You can take a look at
https://spark.apache.org/docs/latest/sql-programming-guide.html#interacting-with-different-versions-of-hive-metastore
for connecting to Hive 0.12's metastore.

Let me know if you have any question.

Thanks,

Yin

On Wed, Jun 17, 2015 at 4:18 PM, Thomas Dudziak tom...@gmail.com wrote:

 So I'm a little confused, has Hive 0.12 support disappeared in 1.4.0 ? The
 release notes didn't mention anything, but the documentation doesn't list a
 way to build for 0.12 anymore (
 http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support,
 in fact it doesn't list anything other than 0.13), and I don't see any
 maven profiles nor code for 0.12.

 Tom

Re: Spark-sql(yarn-client) java.lang.NoClassDefFoundError: org/apache/spark/deploy/yarn/ExecutorLauncher

2015-06-18 Thread Yin Huai

Is it the full stack trace?

On Thu, Jun 18, 2015 at 6:39 AM, Sea 261810...@qq.com wrote:

 Hi, all:

 I want to run spark sql on yarn(yarn-client), but ... I already set
 spark.yarn.jar and  spark.jars in conf/spark-defaults.conf.

 ./bin/spark-sql -f game.sql --executor-memory 2g --num-executors 100  
 game.txt

 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/spark/deploy/yarn/ExecutorLauncher
 Caused by: java.lang.ClassNotFoundException:
 org.apache.spark.deploy.yarn.ExecutorLauncher
 at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
 Could not find the main class:
 org.apache.spark.deploy.yarn.ExecutorLauncher.  Program will exit.


 Anyone can help?

Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-05 Thread Yin Huai

Sean,

Can you add -Phive -Phive-thriftserver and try those Hive tests?

Thanks,

Yin

On Fri, Jun 5, 2015 at 5:19 AM, Sean Owen so...@cloudera.com wrote:

 Everything checks out again, and the tests pass for me on Ubuntu +
 Java 7 with '-Pyarn -Phadoop-2.6', except that I always get
 SparkSubmitSuite errors like ...

 - success sanity check *** FAILED ***
   java.lang.RuntimeException: [download failed:
 org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed:
 commons-net#commons-net;3.1!commons-net.jar]
   at
 org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978)
   at
 org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$3.apply(IsolatedClientLoader.scala:62)
   ...

 I also can't get hive tests to pass. Is anyone else seeing anything
 like this? if not I'll assume this is something specific to the env --
 or that I don't have the build invocation just right. It's puzzling
 since it's so consistent, but I presume others' tests pass and Jenkins
 does.


 On Wed, Jun 3, 2015 at 5:53 AM, Patrick Wendell pwend...@gmail.com
 wrote:
  Please vote on releasing the following candidate as Apache Spark version
 1.4.0!
 
  The tag to be voted on is v1.4.0-rc3 (commit 22596c5):
  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
  22596c534a38cfdda91aef18aa9037ab101e4251
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-bin/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  [published as version: 1.4.0]
  https://repository.apache.org/content/repositories/orgapachespark-/
  [published as version: 1.4.0-rc4]
  https://repository.apache.org/content/repositories/orgapachespark-1112/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs/
 
  Please vote on releasing this package as Apache Spark 1.4.0!
 
  The vote is open until Saturday, June 06, at 05:00 UTC and passes
  if a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.4.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == What has changed since RC3 ==
  In addition to may smaller fixes, three blocker issues were fixed:
  4940630 [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make
  metadataHive get constructed too early
  6b0f615 [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise()
  78a6723 [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be singleton
 
  == How can I help test this release? ==
  If you are a Spark user, you can help us test this release by
  taking a Spark 1.3 workload and running on this release candidate,
  then reporting any regressions.
 
  == What justifies a -1 vote for this release? ==
  This vote is happening towards the end of the 1.4 QA period,
  so -1 votes should only occur for significant regressions from 1.3.1.
  Bugs already present in 1.3.X, minor regressions, or bugs related
  to new features will not block this release.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-01 Thread Yin Huai

Hi Peter,

Based on your error message, seems you were not using the RC3. For the
error thrown at HiveContext's line 206, we have changed the message to this
one
https://github.com/apache/spark/blob/v1.4.0-rc3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L205-207
just
before RC3. Basically, we will not print out the class loader name. Can you
check if a older version of 1.4 branch got used? Have you published a RC3
to your local maven repo? Can you clean your local repo cache and try again?

Thanks,

Yin

On Mon, Jun 1, 2015 at 10:45 AM, Peter Rudenko petro.rude...@gmail.com
wrote:

  Still have problem using HiveContext from sbt. Here’s an example of
 dependencies:

  val sparkVersion = 1.4.0-rc3

 lazy val root = Project(id = spark-hive, base = file(.),
settings = Project.defaultSettings ++ Seq(
name := spark-1.4-hive,
scalaVersion := 2.10.5,
scalaBinaryVersion := 2.10,
resolvers += Spark RC at 
 https://repository.apache.org/content/repositories/orgapachespark-1110/; 
 https://repository.apache.org/content/repositories/orgapachespark-1110/,
libraryDependencies ++= Seq(
  org.apache.spark %% spark-core % sparkVersion,
  org.apache.spark %% spark-mllib % sparkVersion,
  org.apache.spark %% spark-hive % sparkVersion,
  org.apache.spark %% spark-sql % sparkVersion
 )

   ))

 Launching sbt console with it and running:

 val conf = new SparkConf().setMaster(local[4]).setAppName(test)
 val sc = new SparkContext(conf)
 val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
 val data = sc.parallelize(1 to 1)
 import sqlContext.implicits._
 scala data.toDF
 java.lang.IllegalArgumentException: Unable to locate hive jars to connect to 
 metastore using classloader 
 scala.tools.nsc.interpreter.IMain$TranslatingClassLoader. Please set 
 spark.sql.hive.metastore.jars
 at 
 org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:206)
 at 
 org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:175)
 at 
 org.apache.spark.sql.hive.HiveContext$anon$2.init(HiveContext.scala:367)
 at 
 org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:367)
 at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:366)
 at 
 org.apache.spark.sql.hive.HiveContext$anon$1.init(HiveContext.scala:379)
 at 
 org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:379)
 at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:378)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:901)
 at org.apache.spark.sql.DataFrame.init(DataFrame.scala:134)
 at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
 at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:474)
 at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:456)
 at 
 org.apache.spark.sql.SQLContext$implicits$.intRddToDataFrameHolder(SQLContext.scala:345)

 Thanks,
 Peter Rudenko

 On 2015-06-01 05:04, Guoqiang Li wrote:

   +1 (non-binding)


  -- Original --
  *From: * Sandy Ryza;sandy.r...@cloudera.com sandy.r...@cloudera.com
 ;
 *Date: * Mon, Jun 1, 2015 07:34 AM
 *To: * Krishna Sankarksanka...@gmail.com ksanka...@gmail.com;
 *Cc: * Patrick Wendellpwend...@gmail.com pwend...@gmail.com;
 dev@spark.apache.org dev@spark.apache.orgdev@spark.apache.org
 dev@spark.apache.org;
 *Subject: * Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

  +1 (non-binding)

  Launched against a pseudo-distributed YARN cluster running Hadoop 2.6.0
 and ran some jobs.

  -Sandy

 On Sat, May 30, 2015 at 3:44 PM, Krishna Sankar  ksanka...@gmail.com
 ksanka...@gmail.com wrote:

  +1 (non-binding, of course)

  1. Compiled OSX 10.10 (Yosemite) OK Total time: 17:07 min
  mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
 -Dhadoop.version=2.6.0 -DskipTests
 2. Tested pyspark, mlib - running as well as compare results with 1.3.1
 2.1. statistics (min,max,mean,Pearson,Spearman) OK
 2.2. Linear/Ridge/Laso Regression OK
 2.3. Decision Tree, Naive Bayes OK
 2.4. KMeans OK
Center And Scale OK
 2.5. RDD operations OK
   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
Model evaluation/optimization (rank, numIter, lambda) with
 itertools OK
 3. Scala - MLlib
 3.1. statistics (min,max,mean,Pearson,Spearman) OK
 3.2. LinearRegressionWithSGD OK
 3.3. Decision Tree OK
 3.4. KMeans OK
 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
 3.6. saveAsParquetFile OK
 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
 registerTempTable, sql OK
 3.8. result = sqlContext.sql(SELECT
 OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
 JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID)

Re: ClosureCleaner slowing down Spark SQL queries

2015-05-29 Thread Yin Huai

For Spark SQL internal operations, probably we can just
create MapPartitionsRDD directly (like
https://github.com/apache/spark/commit/5287eec5a6948c0c6e0baaebf35f512324c0679a
).

On Fri, May 29, 2015 at 11:04 AM, Josh Rosen rosenvi...@gmail.com wrote:

 Hey, want to file a JIRA for this?  This will make it easier to track
 progress on this issue.  Definitely upload the profiler screenshots there,
 too, since that's helpful information.

 https://issues.apache.org/jira/browse/SPARK



 On Wed, May 27, 2015 at 11:12 AM, Nitin Goyal nitin2go...@gmail.com
 wrote:

 Hi Ted,

 Thanks a lot for replying. First of all, moving to 1.4.0 RC2 is not easy
 for
 us as migration cost is big since lot has changed in Spark SQL since 1.2.

 Regarding SPARK-7233, I had already looked at it few hours back and it
 solves the problem for concurrent queries but my problem is just for a
 single query. I also looked at the fix's code diff and it wasn't related
 to
 the problem which seems to exist in Closure Cleaner code.

 Thanks
 -Nitin



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tp12466p12468.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-28 Thread Yin Huai

Justin,

If you are creating multiple HiveContexts in tests, you need to assign a
temporary metastore location for every HiveContext (like what we do at here
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L527-L543).
Otherwise, they all try to connect to the metastore in the current dir
(look at metastore_db).

Peter,

Do you also have the same use case as Justin (creating multiple
HiveContexts in tests)? Can you explain what you meant by all tests? I am
probably missing some context at here.

Thanks,

Yin


On Thu, May 28, 2015 at 11:28 AM, Peter Rudenko petro.rude...@gmail.com
wrote:

  Also have the same issue - all tests fail because of HiveContext / derby
 lock.

 Cause: javax.jdo.JDOFatalDataStoreException: Unable to open a test connection 
 to the given database. JDBC url = 
 jdbc:derby:;databaseName=metastore_db;create=true, username = APP. 
 Terminating connection pool (set lazyInit to true if you expect to start your 
 database after your app). Original Exception: --
 [info] java.sql.SQLException: Failed to start database 'metastore_db' with 
 class loader 
 org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@8066e0e, see 
 the next exception for details.

 Also is there build for hadoop2.6? Don’t see it here:
 http://people.apache.org/%7Epwendell/spark-releases/spark-1.4.0-rc2-bin/
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc2-bin/

 Thanks,
 Peter Rudenko

 On 2015-05-22 22:56, Justin Uang wrote:

   I'm working on one of the Palantir teams using Spark, and here is our
 feedback:

  We have encountered three issues when upgrading to spark 1.4.0. I'm not
 sure they qualify as a -1, as they come from using non-public APIs and
 multiple spark contexts for the purposes of testing, but I do want to bring
 them up for awareness =)

1. Our UDT was serializing to a StringType, but now strings are
represented internally as UTF8String, so we had to change our UDT to use
UTF8String.apply() and UTF8String.toString() to convert back to String.
2. createDataFrame when using UDTs used to accept things in the
serialized catalyst form. Now, they're supposed to be in the UDT java class
form (I think this change would've affected us in 1.3.1 already, since we
were in 1.3.0)
3. derby database lifecycle management issue with HiveContext. We have
been using a SparkContextResource JUnit Rule that we wrote, and it sets up
then tears down a SparkContext and HiveContext between unit test runs
within the same process (possibly the same thread as well). Multiple
contexts are not being used at once. It used to work in 1.3.0, but now when
we try to create the HiveContext for the second unit test, then it
complains with the following exception. I have a feeling it might have
something to do with the Hive object being thread local, and us not
explicitly closing the HiveContext and everything it holds. The full stack
trace is here:
https://gist.github.com/justinuang/0403d49cdeedf91727cd
https://gist.github.com/justinuang/0403d49cdeedf91727cd

  Caused by: java.sql.SQLException: Failed to start database 'metastore_db' 
 with class loader 
 org.apache.spark.sql.hive.client.IsolatedClientLoader$anon$1@5dea2446, see 
 the next exception for details.
   at 
 org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)


 On Wed, May 20, 2015 at 10:35 AM Imran Rashid iras...@cloudera.com
 wrote:

 -1

 discovered I accidentally removed master  worker json endpoints, will
 restore
  https://issues.apache.org/jira/browse/SPARK-7760

 On Tue, May 19, 2015 at 11:10 AM, Patrick Wendell  pwend...@gmail.com
 pwend...@gmail.com wrote:

 Please vote on releasing the following candidate as Apache Spark version
 1.4.0!

 The tag to be voted on is v1.4.0-rc1 (commit 777a081):

 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=777a08166f1fb144146ba32581d4632c3466541e

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.4.0-rc1/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1092/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.4.0-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.4.0!

 The vote is open until Friday, May 22, at 17:03 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.4.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == How can I help test this release? ==
 If you are a Spark user, you can help us test this release by
 taking a Spark 1.3 workload and

[Spark SQL] Generating new golden answer files for HiveComparisonTest

2015-04-25 Thread Yin Huai

Spark SQL developers,

If you are trying to add new tests based on HiveComparisonTest and want to
generate golden answer files with Hive 0.13.1, unfortunately, the setup
work is quite different from that for Hive 0.12. We have updated SQL readme
to include the new instruction for Hive 0.13.1. You can find it at the the
section of Other dependencies for developers
https://github.com/apache/spark/tree/master/sql.

Please let me know if you still see any issue after setup your environment
based on this instruction.

Thanks,

Yin

Re: dataframe can not find fields after loading from hive

2015-04-19 Thread Yin Huai

Hi Cesar,

Can you try 1.3.1 (
https://spark.apache.org/releases/spark-release-1-3-1.html) and see if it
still shows the error?

Thanks,

Yin

On Fri, Apr 17, 2015 at 1:58 PM, Reynold Xin r...@databricks.com wrote:

 This is strange. cc the dev list since it might be a bug.



 On Thu, Apr 16, 2015 at 3:18 PM, Cesar Flores ces...@gmail.com wrote:

 Never mind. I found the solution:

 val newDataFrame = hc.createDataFrame(hiveLoadedDataFrame.rdd,
 hiveLoadedDataFrame.schema)

 which translate to convert the data frame to rdd and back again to data
 frame. Not the prettiest solution, but at least it solves my problems.


 Thanks,
 Cesar Flores



 On Thu, Apr 16, 2015 at 11:17 AM, Cesar Flores ces...@gmail.com wrote:


 I have a data frame in which I load data from a hive table. And my issue
 is that the data frame is missing the columns that I need to query.

 For example:

 val newdataset = dataset.where(dataset(label) === 1)

 gives me an error like the following:

 ERROR yarn.ApplicationMaster: User class threw exception: resolved
 attributes label missing from label, user_id, ...(the rest of the fields of
 my table
 org.apache.spark.sql.AnalysisException: resolved attributes label
 missing from label, user_id, ... (the rest of the fields of my table)

 where we can see that the label field actually exist. I manage to solve
 this issue by updating my syntax to:

 val newdataset = dataset.where($label === 1)

 which works. However I can not make this trick in all my queries. For
 example, when I try to do a unionAll from two subsets of the same data
 frame the error I am getting is that all my fields are missing.

 Can someone tell me if I need to do some post processing after loading
 from hive in order to avoid this kind of errors?


 Thanks
 --
 Cesar Flores




 --
 Cesar Flores

Re: Spark SQL ExternalSorter not stopped

2015-03-20 Thread Yin Huai

Hi Michael,

Thanks for reporting it. Yes, it is a bug. I have created
https://issues.apache.org/jira/browse/SPARK-6437 to track it.

Thanks,

Yin

On Thu, Mar 19, 2015 at 10:51 AM, Michael Allman mich...@videoamp.com
wrote:

 I've examined the experimental support for ExternalSorter in Spark SQL,
 and it does not appear that the external sorted is ever stopped
 (ExternalSorter.stop). According to the API documentation, this suggests a
 resource leak. Before I file a bug report in Jira, can someone familiar
 with the codebase confirm this is indeed a bug?

 Thanks,

 Michael
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Re: Spark 1.3 SQL Type Parser Changes?

2015-03-10 Thread Yin Huai

Hi Nitay,

Can you try using backticks to quote the column name? Like
org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType(
struct`int`:bigint)?

Thanks,

Yin

On Tue, Mar 10, 2015 at 2:43 PM, Michael Armbrust mich...@databricks.com
wrote:

 Thanks for reporting.  This was a result of a change to our DDL parser
 that resulted in types becoming reserved words.  I've filled a JIRA and
 will investigate if this is something we can fix.
 https://issues.apache.org/jira/browse/SPARK-6250

 On Tue, Mar 10, 2015 at 1:51 PM, Nitay Joffe ni...@actioniq.co wrote:

 In Spark 1.2 I used to be able to do this:

 scala
 org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType(structint:bigint)
 res30: org.apache.spark.sql.catalyst.types.DataType =
 StructType(List(StructField(int,LongType,true)))

 That is, the name of a column can be a keyword like int. This is no
 longer the case in 1.3:

 data-pipeline-shell HiveTypeHelper.toDataType(structint:bigint)
 org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.8]
 failure: ``'' expected but `int' found

 structint:bigint
^
 at org.apache.spark.sql.sources.DDLParser.parseType(ddl.scala:52)
 at
 org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:785)
 at
 org.apache.spark.sql.hive.HiveTypeHelper$.toDataType(HiveTypeHelper.scala:9)

 Note HiveTypeHelper is simply an object I load in to expose
 HiveMetastoreTypes since it was made private. See
 https://gist.github.com/nitay/460b41ed5fd7608507f5
 https://app.relateiq.com/r?c=chrome_gmailurl=https%3A%2F%2Fgist.github.com%2Fnitay%2F460b41ed5fd7608507f5t=AFwhZf262cJFT8YSR54ZotvY2aTmpm_zHTSKNSd4jeT-a6b8q-yMXQ-BqEX9-Ym54J1bkDFiFOXyRKsNxXoDGIh7bhqbBVKsGGq6YTJIfLZxs375XXPdS13KHsE_3Lffk4UIFkRFZ_7c

 This is actually a pretty big problem for us as we have a bunch of legacy
 tables with column names like timestamp. They work fine in 1.2, but now
 everything throws in 1.3.

 Any thoughts?

 Thanks,
 - Nitay
 Founder  CTO

Re: org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.1] failure: ``varchar'' expected but identifier char found in spark-sql

2015-02-17 Thread Yin Huai

Hi Quizhuang,

Right now, char is not supported in DDL. Can you try varchar or string?

Thanks,

Yin

On Mon, Feb 16, 2015 at 10:39 PM, Qiuzhuang Lian qiuzhuang.l...@gmail.com
wrote:

 Hi,

 I am not sure this has been reported already or not, I run into this error
 under spark-sql shell as build from newest of spark git trunk,

 spark-sql describe qiuzhuang_hcatlog_import;
 15/02/17 14:38:36 ERROR SparkSQLDriver: Failed in [describe
 qiuzhuang_hcatlog_import]
 org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.1]
 failure: ``varchar'' expected but identifier char found

 char(32)
 ^
 at org.apache.spark.sql.sources.DDLParser.parseType(ddl.scala:52)
 at

 org.apache.spark.sql.hive.MetastoreRelation$SchemaAttribute.toAttribute(HiveMetastoreCatalog.scala:664)
 at

 org.apache.spark.sql.hive.MetastoreRelation$$anonfun$23.apply(HiveMetastoreCatalog.scala:674)
 at

 org.apache.spark.sql.hive.MetastoreRelation$$anonfun$23.apply(HiveMetastoreCatalog.scala:674)
 at

 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at

 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
 at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at

 org.apache.spark.sql.hive.MetastoreRelation.init(HiveMetastoreCatalog.scala:674)
 at

 org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:185)
 at org.apache.spark.sql.hive.HiveContext$$anon$2.org

 $apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:234)

 As in hive 0.131, console, this commands works,

 hive describe qiuzhuang_hcatlog_import;
 OK
 id  char(32)
 assistant_novarchar(20)
 assistant_name  varchar(32)
 assistant_type  int
 grade   int
 shop_no varchar(20)
 shop_name   varchar(64)
 organ_novarchar(20)
 organ_name  varchar(20)
 entry_date  string
 education   int
 commission  decimal(8,2)
 tel varchar(20)
 address varchar(100)
 identity_card   varchar(25)
 sex int
 birthdaystring
 employee_type   int
 status  int
 remark  varchar(255)
 create_user_no  varchar(20)
 create_user varchar(32)
 create_time string
 update_user_no  varchar(20)
 update_user varchar(32)
 update_time string
 Time taken: 0.49 seconds, Fetched: 26 row(s)
 hive


 Regards,
 Qiuzhuang

Re: Join implementation in SparkSQL

2015-01-16 Thread Yin Huai

Hi Alex,

Can you attach the output of sql(explain extended your
query).collect.foreach(println)?

Thanks,

Yin

On Fri, Jan 16, 2015 at 1:54 PM, Alessandro Baretta alexbare...@gmail.com
wrote:

 Reynold,

 The source file you are directing me to is a little too terse for me to
 understand what exactly is going on. Let me tell you what I'm trying to do
 and what problems I'm encountering, so that you might be able to better
 direct me investigation of the SparkSQL codebase.

 I am computing the join of three tables, sharing the same primary key,
 composed of three fields, and having several other fields. My first attempt
 at computing this join was in SQL, with a query much like this slightly
 simplified one:

  SELECT
   a.key1 key1, a.key2 key2, a.key3 key3,
   a.data1   adata1,a.data2adata2,...
   b.data1   bdata1,b.data2bdata2,...
   c.data1   cdata1,c.data2cdata2,...
 FROM a, b, c
 WHERE
   a.key1 = b.key1 AND a.key2 = b.key2 AND a.key3 = b.key3
   b.key1 = c.key1 AND b.key2 = c.key2 AND b.key3 = c.key3

 This code yielded a SparkSQL job containing 40,000 stages, which failed
 after filling up all available disk space on the worker nodes.

 I then wrote this join as a plain mapreduce join. The code looks roughly
 like this:
 val a_ = a.map(row = (key(row), (a, row))
 val b_ = b.map(row = (key(row), (b, row))
 val c_ = c.map(row = (key(row), (c, row))
 val join = UnionRDD(sc, List(a_, b_, c_)).groupByKey

 This implementation yields approximately 1600 stages and completes in a few
 minutes on a 256 core cluster. The huge difference in scale of the two jobs
 makes me think that SparkSQL is implementing my join as cartesian product.
 This is they query plan--I'm not sure I can read it, but it does seem to
 imply that the filter conditions are not being pushed far down enough:

  'Project [...]
  'Filter (('a.key1 = 'b.key1))  ('a.key2 = b.key2))  ...)
   'Join Inner, None
'Join Inner, None

 Is maybe SparkSQL unable to push join conditions down from the WHERE clause
 into the join itself?

 Alex

 On Thu, Jan 15, 2015 at 10:36 AM, Reynold Xin r...@databricks.com wrote:

  It's a bunch of strategies defined here:
 
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
 
  In most common use cases (e.g. inner equi join), filters are pushed below
  the join or into the join. Doing a cartesian product followed by a filter
  is too expensive.
 
 
  On Thu, Jan 15, 2015 at 7:39 AM, Alessandro Baretta 
 alexbare...@gmail.com
   wrote:
 
  Hello,
 
  Where can I find docs about how joins are implemented in SparkSQL? In
  particular, I'd like to know whether they are implemented according to
  their relational algebra definition as filters on top of a cartesian
  product.
 
  Thanks,
 
  Alex

Re: scala.MatchError on SparkSQL when creating ArrayType of StructType

2014-12-08 Thread Yin Huai

Seems you hit https://issues.apache.org/jira/browse/SPARK-4245. It was
fixed in 1.2.

Thanks,

Yin

On Wed, Dec 3, 2014 at 11:50 AM, invkrh inv...@gmail.com wrote:

 Hi,

 I am using SparkSQL on 1.1.0 branch.

 The following code leads to a scala.MatchError
 at

 org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:247)

 val scm = StructType(*inputRDD*.schema.fields.init :+
   StructField(list,
 ArrayType(
   StructType(
 Seq(StructField(*date*, StringType, nullable = *false*),
   StructField(*nbPurchase*, IntegerType, nullable =
 *false*,
 nullable = false))

 // *purchaseRDD* is RDD[sql.ROW] whose schema is corresponding to scm. It
 is
 transformed from *inputRDD*
 val schemaRDD = hiveContext.applySchema(purchaseRDD, scm)
 schemaRDD.registerTempTable(t_purchase)

 Here's the stackTrace:
 scala.MatchError: ArrayType(StructType(List(StructField(date,StringType,
 *true* ), StructField(n_reachat,IntegerType, *true* ))),true) (of class
 org.apache.spark.sql.catalyst.types.ArrayType)
 at

 org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:247)
 at
 org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
 at
 org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
 at

 org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:84)
 at

 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:66)
 at

 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:50)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org
 $apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149)
 at

 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
 at

 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)

 The strange thing is that *nullable* of *date* and *nbPurchase* field are
 set to true while it were false in the code. If I set both to *true*, it
 works. But, in fact, they should not be nullable.

 Here's what I find at Cast.scala:247 on 1.1.0 branch

   private[this] lazy val cast: Any = Any = dataType match {
 case StringType = castToString
 case BinaryType = castToBinary
 case DecimalType = castToDecimal
 case TimestampType = castToTimestamp
 case BooleanType = castToBoolean
 case ByteType = castToByte
 case ShortType = castToShort
 case IntegerType = castToInt
 case FloatType = castToFloat
 case LongType = castToLong
 case DoubleType = castToDouble
   }

 Any idea? Thank you.

 Hao



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/scala-MatchError-on-SparkSQL-when-creating-ArrayType-of-StructType-tp9623.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Get attempt number in a closure

2014-10-20 Thread Yin Huai

Hello,

Is there any way to get the attempt number in a closure? Seems
TaskContext.attemptId actually returns the taskId of a task (see this
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L181
 and this
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L47).
It looks like a bug.

Thanks,

Yin

Re: Get attempt number in a closure

2014-10-20 Thread Yin Huai

Yeah, seems we need to pass the attempt id to executors through
TaskDescription. I have created
https://issues.apache.org/jira/browse/SPARK-4014.

On Mon, Oct 20, 2014 at 1:57 PM, Reynold Xin r...@databricks.com wrote:

 I also ran into this earlier. It is a bug. Do you want to file a jira?

 I think part of the problem is that we don't actually have the attempt id
 on the executors. If we do, that's great. If not, we'd need to propagate
 that over.

 On Mon, Oct 20, 2014 at 7:17 AM, Yin Huai huaiyin@gmail.com wrote:

 Hello,

 Is there any way to get the attempt number in a closure? Seems
 TaskContext.attemptId actually returns the taskId of a task (see this
 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L181
 
  and this
 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L47
 ).
 It looks like a bug.

 Thanks,

 Yin

Re: Get attempt number in a closure

2014-10-20 Thread Yin Huai

Yes, it is for (2). I was confused because the doc of TaskContext.attemptId
(release 1.1)
http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.TaskContext
is
the number of attempts to execute this task. Seems the per-task attempt
id used to populate attempt field in the UI is maintained by
TaskSetManager and its value is assigned in resourceOffer.

On Mon, Oct 20, 2014 at 4:56 PM, Reynold Xin r...@databricks.com wrote:

Yes, as I understand it this is for (2).

Imagine a use case in which I want to save some output. In order to make
this atomic, the program uses part_[index]_[attempt].dat, and once it
finishes writing, it renames this to part_[index].dat.

Right now [attempt] is just the TID, which could show up like (assuming
this is not the first stage):

part_0_1000
part_1_1001
part_0_1002 (some retry)
...

This is fairly confusing. The natural thing to expect is

part_0_0
part_1_0
part_0_1
...

On Mon, Oct 20, 2014 at 1:47 PM, Kay Ousterhout k...@eecs.berkeley.edu
wrote:

Sorry to clarify, there are two issues here:

(1) attemptId has different meanings in the codebase
(2) we currently don't propagate the 0-based per-task attempt identifier
to the executors.

(1) should definitely be fixed. It sounds like Yin's original email was
requesting that we add (2).

On Mon, Oct 20, 2014 at 1:45 PM, Kay Ousterhout k...@eecs.berkeley.edu
wrote:

Are you guys sure this is a bug? In the task scheduler, we keep two
identifiers for each task: the index, which uniquely identifiers the
computation+partition, and the taskId which is unique across all tasks
for that Spark context (See
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L439).
If multiple attempts of one task are run, they will have the same index,
but different taskIds. Historically, we have used taskId and
taskAttemptId interchangeably (which arose from naming in Mesos, which
uses similar naming).

This was complicated when Mr. Xin added the attempt field to TaskInfo,
which we show in the UI. This field uniquely identifies attempts for a
particular task, but is not unique across different task indexes (it always
starts at 0 for a given task). I'm guessing the right fix is to rename
Task.taskAttemptId to Task.taskId to resolve this inconsistency -- does
that sound right to you Reynold?

-Kay

On Mon, Oct 20, 2014 at 1:29 PM, Patrick Wendell pwend...@gmail.com
wrote:

There is a deeper issue here which is AFAIK we don't even store a
notion of attempt inside of Spark, we just use a new taskId with the
same index.

On Mon, Oct 20, 2014 at 12:38 PM, Yin Huai huaiyin@gmail.com
wrote:
Yeah, seems we need to pass the attempt id to executors through
TaskDescription. I have created
https://issues.apache.org/jira/browse/SPARK-4014.

On Mon, Oct 20, 2014 at 1:57 PM, Reynold Xin r...@databricks.com
wrote:

I also ran into this earlier. It is a bug. Do you want to file a
jira?

I think part of the problem is that we don't actually have the
attempt id
on the executors. If we do, that's great. If not, we'd need to
propagate
that over.

On Mon, Oct 20, 2014 at 7:17 AM, Yin Huai huaiyin@gmail.com
wrote:

Hello,

Is there any way to get the attempt number in a closure? Seems
TaskContext.attemptId actually returns the taskId of a task (see
this

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L181

and this

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L47
).
It looks like a bug.

Thanks,

Yin

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Spark SQL Query and join different data sources.

2014-09-02 Thread Yin Huai

Actually, with HiveContext, you can join hive tables with registered
temporary tables.


On Fri, Aug 22, 2014 at 9:07 PM, chutium teng@gmail.com wrote:

 oops, thanks Yan, you are right, i got

 scala sqlContext.sql(select * from a join b).take(10)
 java.lang.RuntimeException: Table Not Found: b
 at scala.sys.package$.error(package.scala:27)
 at

 org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:90)
 at

 org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:90)
 at scala.Option.getOrElse(Option.scala:120)
 at

 org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:90)

 and with hql

 scala hiveContext.hql(select * from a join b).take(10)
 warning: there were 1 deprecation warning(s); re-run with -deprecation for
 details
 14/08/22 14:48:45 INFO parse.ParseDriver: Parsing command: select * from a
 join b
 14/08/22 14:48:45 INFO parse.ParseDriver: Parse Completed
 14/08/22 14:48:45 ERROR metadata.Hive:
 NoSuchObjectException(message:default.a table not found)
 at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:27129)
 at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:27097)
 at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:27028)
 at
 org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
 at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:936)
 at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:922)
 at

 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:854)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at

 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
 at com.sun.proxy.$Proxy17.getTable(Unknown Source)
 at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950)
 at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:924)
 at

 org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:59)


 so sqlContext is looking up table from
 org.apache.spark.sql.catalyst.analysis.SimpleCatalog, Catalog.scala
 hiveContext looking up from org.apache.spark.sql.hive.HiveMetastoreCatalog,
 HiveMetastoreCatalog.scala

 maybe we can do something in sqlContext to register a hive table as
 Spark-SQL-Table, need to read column info, partition info, location, SerDe,
 Input/OutputFormat and maybe StorageHandler also, from the hive
 metastore...




 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Query-and-join-different-data-sources-tp7914p7955.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

68 matches

Mail list logo