Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

2016-01-12 Thread Cheolsoo Park
Alex, see this jira-
https://issues.apache.org/jira/browse/SPARK-9926

On Tue, Jan 12, 2016 at 10:55 AM, Alex Nastetsky <
alex.nastet...@vervemobile.com> wrote:

> Ran into this need myself. Does Spark have an equivalent of  "mapreduce.
> input.fileinputformat.list-status.num-threads"?
>
> Thanks.
>
> On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park 
> wrote:
>
>> Hi,
>>
>> I am wondering if anyone has successfully enabled
>> "mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I
>> usually set this property to 25 to speed up file listing in MR jobs (Hive
>> and Pig). But for some reason, this property does not take effect in Spark
>> HadoopRDD resulting in serious delay in file listing.
>>
>> I verified that the property is indeed set in HadoopRDD by logging the
>> value of the property in the getPartitions() function. I also tried to
>> attach VisualVM to Spark and Pig clients, which look as follows-
>>
>> In Pig, I can see 25 threads running in parallel for file listing-
>> [image: Inline image 1]
>>
>> In Spark, I only see 2 threads running in parallel for file listing-
>> [image: Inline image 2]
>>
>> What's strange is that the # of concurrent threads in Spark is throttled
>> no matter how high I
>> set "mapreduce.input.fileinputformat.list-status.num-threads".
>>
>> Is anyone using Spark with this property enabled? If so, can you please
>> share how you do it?
>>
>> Thanks!
>> Cheolsoo
>>
>
>


Re: Flaky test in DAGSchedulerSuite?

2015-09-04 Thread Cheolsoo Park
Thank you Pete!

On Fri, Sep 4, 2015 at 1:40 PM, Pete Robbins  wrote:

> raised https://issues.apache.org/jira/browse/SPARK-10454 and PR
>
> On 4 September 2015 at 21:24, Pete Robbins  wrote:
>
>> I've also just hit this and was about to raise a JIRA for this if there
>> isn't one already. I have a simple fix.
>>
>> On 4 September 2015 at 19:09, Cheolsoo Park  wrote:
>>
>>> Hi devs,
>>>
>>> I noticed this test case fails intermittently in Jenkins.
>>>
>>> For eg, see the following builds-
>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/
>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/
>>>
>>> The test failed in different PRs, and the failure looks unrelated to
>>> changes in the PRs. Looks like the test was added by the following commit-
>>>
>>> commit 80e2568b25780a7094199239da8ad6cfb6efc9f7
>>> Author: Imran Rashid 
>>> Date:   Mon Jul 20 10:28:32 2015 -0700
>>> [SPARK-8103][core] DAGScheduler should not submit multiple
>>> concurrent attempts for a stag
>>>
>>> Thanks!
>>> Cheolsoo
>>>
>>
>>
>


Flaky test in DAGSchedulerSuite?

2015-09-04 Thread Cheolsoo Park
Hi devs,

I noticed this test case fails intermittently in Jenkins.

For eg, see the following builds-
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/

The test failed in different PRs, and the failure looks unrelated to
changes in the PRs. Looks like the test was added by the following commit-

commit 80e2568b25780a7094199239da8ad6cfb6efc9f7
Author: Imran Rashid 
Date:   Mon Jul 20 10:28:32 2015 -0700
[SPARK-8103][core] DAGScheduler should not submit multiple concurrent
attempts for a stag

Thanks!
Cheolsoo


Re: Jenkins having issues?

2015-08-18 Thread Cheolsoo Park
Thank you for looking into it.

On Tue, Aug 18, 2015 at 4:26 PM, shane knapp  wrote:

> hey all...  so this has been happening intermittently and i'm not sure
> what's causing it.
>
> sometimes directories under the target/tmp/ dir get created w/o the
> owner write bit set, so that they look like this:
> dr-xr-xr-x.  2 jenkins jenkins  4096 Aug  9 01:28
>
> /home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-67a08260-3318-42ec-b12d-65c700f8f220
>
> this means that the next build that runs in that directory space fails
> when 'git clean -fdx' encounters a directory that it can't remove.
>
> for now, i've added the following lines to the pull request builder
> run script (before the git clean command) to fix the target/ dir:
>
> echo "fixing target dir permissions"
> chmod -R +w target/*
>
> other than that, i'm looking around the codebase some older builds and
> seeing if i can't find the culprit.
> -- Forwarded message --
> From: Cheolsoo Park 
> Date: Fri, Aug 14, 2015 at 4:11 PM
> Subject: Jenkins having issues?
> To: Dev 
>
>
> Hi devs,
>
> Jenkins failed twice in my PR for unknown error-
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/40930/console
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/40931/console
>
> Can you help?
>
> Thank you!
> Cheolsoo
>


Jenkins having issues?

2015-08-14 Thread Cheolsoo Park
Hi devs,

Jenkins failed twice in my PR 
for unknown error-

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/40930/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/40931/console

Can you help?

Thank you!
Cheolsoo


Re: pyspark.sql.tests: is test_time_with_timezone a flaky test?

2015-07-13 Thread Cheolsoo Park
Thank you!

On Sun, Jul 12, 2015 at 10:59 PM, Davies Liu  wrote:

> Will be fixed by https://github.com/apache/spark/pull/7363
>
> On Sun, Jul 12, 2015 at 7:45 PM, Davies Liu  wrote:
> > Thanks for reporting this, I'm working on it. It turned out that it's
> > a bug in when run with Python3.4, will sending out a fix soon.
> >
> > On Sun, Jul 12, 2015 at 1:33 PM, Cheolsoo Park 
> wrote:
> >> Hi devs,
> >>
> >> For some reason, I keep getting this test failure (3 out of 4 builds)
> in my
> >> PR-
> >>
> >> ==
> >> FAIL: test_time_with_timezone (__main__.SQLTests)
> >> --
> >> Traceback (most recent call last):
> >>   File
> >>
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests.py",
> >> line 718, in test_time_with_timezone
> >> self.assertEqual(now, now1)
> >> AssertionError: datetime.datetime(2015, 7, 12, 13, 18, 46, 504366) !=
> >> datetime.datetime(2015, 7, 12, 13, 18, 46, 504365)
> >>
> >> Jenkins builds-
> >>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37100/console
> >>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37092/console
> >>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37081/console
> >>
> >> I am aware that there was a hot fix for this test case, and I already
> have
> >> it in the commit log-
> >>
> >> commit 05ac023dc8d9004a27c2f06ee875b0ff3743ccdd
> >>
> >> Author: Davies Liu 
> >> Date:   Fri Jul 10 13:05:23 2015 -0700
> >> [HOTFIX] fix flaky test in PySpark SQL
> >>
> >> I looked at the test code, and it seems that precision in microseconds
> is
> >> lost somewhere in a round trip from Python to DataFrame. Can someone
> please
> >> help me debug this error?
> >>
> >> Thanks!
> >> Cheolsoo
> >>
> >>
>


pyspark.sql.tests: is test_time_with_timezone a flaky test?

2015-07-12 Thread Cheolsoo Park
Hi devs,

For some reason, I keep getting this test failure (3 out of 4 builds) in my
PR -

==
FAIL: test_time_with_timezone (__main__.SQLTests)
--
Traceback (most recent call last):
  File
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests.py",
line 718, in test_time_with_timezone
*self.assertEqual(now, now1)*
AssertionError: datetime.datetime(2015, 7, 12, 13, 18, 46, *504366*) !=
datetime.datetime(2015, 7, 12, 13, 18, 46, *504365*)

Jenkins builds-
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37100/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37092/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37081/console

I am aware that there was a hot fix for this test case, and I already have
it in the commit log-

commit 05ac023dc8d9004a27c2f06ee875b0ff3743ccdd

Author: Davies Liu 
Date:   Fri Jul 10 13:05:23 2015 -0700
[HOTFIX] fix flaky test in PySpark SQL

I looked at the test code, and it seems that precision in microseconds is
lost somewhere in a round trip from Python to DataFrame. Can someone please
help me debug this error?

Thanks!
Cheolsoo


Re: SparkSQL errors in 1.4 rc when using with Hive 0.12 metastore

2015-05-24 Thread Cheolsoo Park
Thank you Hao for the confirmation!

I filed two jiras as follows-
https://issues.apache.org/jira/browse/SPARK-7850 (removing hive-0.12.0
profile from pom)
https://issues.apache.org/jira/browse/SPARK-7851 (thrift error with hive
metastore 0.12)


On Sun, May 24, 2015 at 8:18 PM, Cheng, Hao  wrote:

>  Thanks for reporting this.
>
>
>
> We intend to support the multiple metastore versions in a single
> build(hive-0.13.1) by introducing the IsolatedClientLoader, but probably
> you’re hitting the bug, please file a jira issue for this.
>
>
>
> I will keep investigating on this also.
>
>
>
> Hao
>
>
>
>
>
> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
> *Sent:* Sunday, May 24, 2015 9:06 PM
> *To:* Cheolsoo Park
> *Cc:* u...@spark.apache.org; dev@spark.apache.org
> *Subject:* Re: SparkSQL errors in 1.4 rc when using with Hive 0.12
> metastore
>
>
>
> This discussion belongs on the dev list.  Please post any replies there.
>
>
>
> On Sat, May 23, 2015 at 10:19 PM, Cheolsoo Park 
> wrote:
>
>  Hi,
>
>
>
> I've been testing SparkSQL in 1.4 rc and found two issues. I wanted to
> confirm whether these are bugs or not before opening a jira.
>
>
> *1)* I can no longer compile SparkSQL with -Phive-0.12.0. I noticed that
> in 1.4, IsolatedClientLoader is introduced, and different versions of Hive
> metastore jars can be loaded at runtime. But instead, SparkSQL no longer
> compiles with Hive 0.12.0.
>
>
>
> My question is, is this intended? If so, shouldn't the hive-0.12.0 profile
> in POM be removed?
>
>
>
> *2)* After compiling SparkSQL with -Phive-0.13.1, I ran into my 2nd
> problem. Since I have Hive 0.12 metastore in production, I have to use it
> for now. But even if I set "spark.sql.hive.metastore.version" and
> "spark.sql.hive.metastore.jars", SparkSQL cli throws an error as follows-
>
>
>
> 15/05/24 05:03:29 WARN RetryingMetaStoreClient: MetaStoreClient lost
> connection. Attempting to reconnect.
>
> org.apache.thrift.TApplicationException: Invalid method name:
> 'get_functions'
>
> at
> org.apache.thrift.TApplicationException.read(TApplicationException.java:108)
>
> at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
>
> at
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_functions(ThriftHiveMetastore.java:2886)
>
> at
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_functions(ThriftHiveMetastore.java:2872)
>
> at
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getFunctions(HiveMetaStoreClient.java:1727)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
>
> at com.sun.proxy.$Proxy12.getFunctions(Unknown Source)
>
> at org.apache.hadoop.hive.ql.metadata.Hive.getFunctions(Hive.java:2670)
>
> at
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionNames(FunctionRegistry.java:674)
>
> at
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionNames(FunctionRegistry.java:662)
>
> at
> org.apache.hadoop.hive.cli.CliDriver.getCommandCompletor(CliDriver.java:540)
>
> at
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:175)
>
> at
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>
>
>
> What's happening is that when SparkSQL Cli starts up, it tries to fetch
> permanent udfs from Hive metastore (due to HIVE-6330
> <https://issues.apache.org/jira/browse/HIVE-6330>, which was introduced
> in Hive 0.13). But then, it ends up invoking an incompatible thrift
> function that doesn't exist in Hive 0.12. To work around this error, I have
> to comment out the following line of code for now-
>
> https://goo.gl/wcfnH1
>
>
>
> My question is, is SparkSQL that is compiled against Hive 0.13 supposed to
> work with Hive 0.12 metastore (by setting
> "spark.sql.hive.metastore.version" and "spark.sql.hive.metastore.jars")? It
> only works if I comment out the above line of code.
>
>
>
> Thanks,
>
> Cheolsoo
>
>
>


Re: Spark Sql reading hive partitioned tables?

2015-04-14 Thread Cheolsoo Park
Is there a plan to fix this? I also ran into this issue with a *"select *
from tbl where ... limit 10"* query. Spark SQL is 100x slower than Presto
in worst case (1.6M partitions table). This is a serious blocker for us
since we have many tables with near (and over) 1M partitions, and any query
against these big tables wastes 5 minutes to get full partitions info.

I briefly looked at the code, and it looks like resolving metastore
relations is the first thing that the analyzer does prior to any other
optimization rules such as partition pruning. So in the Hive metastore
client, it ends up calling getAllPartitions() with no filter expression. I
am wondering how much work will be involved to fix this issue. Can you
please advise what you think should be done?


On Mon, Apr 13, 2015 at 3:27 PM, Michael Armbrust 
wrote:

> Yeah, we don't currently push down predicates into the metastore.  Though,
> we do prune partitions based on predicates (so we don't read the data).
>
> On Mon, Apr 13, 2015 at 2:53 PM, Tom Graves 
> wrote:
>
> > Hey,
> > I was trying out spark sql using the HiveContext and doing a select on a
> > partitioned table with lots of partitions (16,000+). It took over 6
> minutes
> > before it even started the job. It looks like it was querying the Hive
> > metastore and got a good chunk of data back.  Which I'm guessing is info
> on
> > the partitions.  Running the same query using hive takes 45 seconds for
> the
> > entire job.
> > I know spark sql doesn't support all the hive optimization.  Is this a
> > known limitation currently?
> > Thanks,Tom
>