[dev-platform] Intent to prototype: Symbols as WeakMap keys (ECMA 262)

2023-09-27 Thread Yoshi Cheng-Hao Huang
*Summary:*
This proposal allows Symbols can be used as keys in WeakMap/WeakSet, and as
the target in WeakRef/FinalizationRegistry

*Bug:*
https://bugzilla.mozilla.org/show_bug.cgi?id=1828144

*Standards body:*
https://tc39.es/proposal-symbols-as-weakmap-keys/

This proposal has been merged into ecma262
https://tc39.es/ecma262/multipage/executable-code-and-execution-contexts.html#sec-canbeheldweakly

*Platform coverage:*
All

*Preference:*
N/A yet

*Other browsers:*
Chrome has shipped this since 108 behind the flag
"--harmony-symbol-as-weakmap-key"
https://groups.google.com/a/chromium.org/g/blink-dev/c/E6pDZP_TiBA/m/ZcXLwiz8AAAJ
and has been enabled by default since April 2023
https://chromium.googlesource.com/v8/v8/+/71ff68830279b7ad6719db066b21f0489e871596

Safari has shipped this since 16.4
https://developer.apple.com/documentation/safari-release-notes/safari-16_4-release-notes

*web-platform-tests:*
Tests are located in in test262/
https://github.com/tc39/test262/pull/3678

-- 
You received this message because you are subscribed to the Google Groups 
"dev-platform@mozilla.org" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dev-platform+unsubscr...@mozilla.org.
To view this discussion on the web visit 
https://groups.google.com/a/mozilla.org/d/msgid/dev-platform/CAJ5FAhV0EyzRBCLoKYwQHLTeu3T3CJCrn6waCnboF695m_h9Vw%40mail.gmail.com.


[dev-platform] Intent to ship: Well-Formed Unicode Strings

2023-09-06 Thread Yoshi Cheng-Hao Huang
As of Firefox 119, I intend to turn on Well-Formed Unicode Strings (JS) by
default

Bug to turn on by default:
https://bugzilla.mozilla.org/show_bug.cgi?id=1850755

Standard:
https://tc39.es/ecma262/#sec-string.prototype.iswellformed
https://tc39.es/ecma262/#sec-string.prototype.towellformed

The feature was previously discussed in "Intent to prototype: Well-Formed
Unicode Strings"
https://groups.google.com/a/mozilla.org/g/dev-platform/c/lJercpdK6VY/m/swRCS6N3AQAJ

-- 
You received this message because you are subscribed to the Google Groups 
"dev-platform@mozilla.org" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dev-platform+unsubscr...@mozilla.org.
To view this discussion on the web visit 
https://groups.google.com/a/mozilla.org/d/msgid/dev-platform/CAJ5FAhW7Q8O2c4nyaCA6qrUn5mdFZApuQobEvH3K%3DNRg5XVv8A%40mail.gmail.com.


[dev-platform] Intent to ship: Module Workers

2023-05-02 Thread Yoshi Cheng-Hao Huang
[I am sending this on behalf of Yulia Startsev, who is taking parental
leave now]


Summary:

Module workers enables you to use ECMAScript modules on workers rather than
just classic scripts. This enables the `import` `export` style syntax to be
run in a worker. It also enables dynamic import to run in workers, for both
shared and dedicated workers. You instantiate it like so:

```

const worker = new Worker(“”, {type: “module”});

```

I intend to ship this feature in Firefox 114

Bugs:  

   -

   https://bugzilla.mozilla.org/show_bug.cgi?id=1247687
   -

   https://bugzilla.mozilla.org/show_bug.cgi?id=1540913
   -

   https://bugzilla.mozilla.org/show_bug.cgi?id=1805676

Specification: 
https://html.spec.whatwg.org/multipage/workers.html#worker-processing-model

Platform Coverage: All

Preference: enabled by default under `dom.workers.modules.enabled` when Bug
1812591  lands.

Documentation: https://developer.mozilla.org/en-US/docs/Web/API/Worker

Other Browsers: Safari and Chrome have shipped.

Testing: Tested through our tests as well as web platform tests.

-- 
You received this message because you are subscribed to the Google Groups 
"dev-platform@mozilla.org" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dev-platform+unsubscr...@mozilla.org.
To view this discussion on the web visit 
https://groups.google.com/a/mozilla.org/d/msgid/dev-platform/CAJ5FAhUTf_9mmYBBpfWHcX2t4pRGnf_wioNP78YbXk%3DPZKo6Dg%40mail.gmail.com.


[dev-platform] Intent to prototype: Well-Formed Unicode Strings

2023-03-28 Thread Yoshi Cheng-Hao Huang
*Summary:*
Add String.prototype.isWellFormed and String.prototype.toWellFormed methods.
isWellFormed() returns a boolean to indicate whether the string is a
well-formed
 UTF-16.
toWellFormed() will return a string with the all unpaired surrogate code
points converted to U+FFFD (REPLACEMENT CHARACTER)

This is currently at stage 3

of TC39 proposals.

*Bug:*
https://bugzilla.mozilla.org/show_bug.cgi?id=1803523

*Standards body:*
https://tc39.es/proposal-is-usv-string/

*Platform coverage:*
All

*Preference:*
N/A yet

*Other browsers:*
Chrome has shipped this since 111
https://chromestatus.com/feature/5200195346759680
flag:--harmony-string-is-well-formed
which is enabled by default

Safari: https://bugs.webkit.org/show_bug.cgi?id=248588
This feature has shipped this in Safari Technology Preview 160
https://webkit.org/blog/13639/release-notes-for-safari-technology-preview-160/
flag: useStringWellFormed
which is enabled by default


*web-platform-tests:*
Tests are located in in test262/
https://github.com/tc39/test262/tree/main/test/built-ins/String/prototype/toWellFormed
https://github.com/tc39/test262/tree/main/test/built-ins/String/prototype/isWellFormed

-- 
You received this message because you are subscribed to the Google Groups 
"dev-platform@mozilla.org" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dev-platform+unsubscr...@mozilla.org.
To view this discussion on the web visit 
https://groups.google.com/a/mozilla.org/d/msgid/dev-platform/CAJ5FAhXszf-jSXiF7V%2B4WP9yQ2zMDkGNPxh_sjMXeOpvKwEv7w%40mail.gmail.com.


[dev-platform] Intent to Ship: Import maps

2022-10-17 Thread Yoshi Cheng-Hao Huang
As of Firefox 108, I intend to turn Import-maps on by default.

Bug to turn on by default:
https://bugzilla.mozilla.org/show_bug.cgi?id=1795647

Standard:
https://html.spec.whatwg.org/multipage/webappapis.html#import-maps

The feature was previously discussed in this "Intent to prototype: Import
maps"
https://groups.google.com/a/mozilla.org/g/dev-platform/c/tiReRwpIT30

-- 
You received this message because you are subscribed to the Google Groups 
"dev-platform@mozilla.org" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dev-platform+unsubscr...@mozilla.org.
To view this discussion on the web visit 
https://groups.google.com/a/mozilla.org/d/msgid/dev-platform/CAJ5FAhXV7xfg06TF4ohb6EiCrftS7hCHxN6DROOVf58_7_QPwg%40mail.gmail.com.


[dev-platform] Intent to prototype: Import maps

2022-04-04 Thread Yoshi Cheng-Hao Huang
Summary:
Import maps allow control over what URLs get fetched by JavaScript import
statements and import() expressions.

Bug:
https://bugzilla.mozilla.org/show_bug.cgi?id=1688879

Standards body:
https://wicg.github.io/import-maps/

Link to standards positions:
https://mozilla.github.io/standards-positions/#import-maps

Platform coverage:
All

Preference:
dom.importMaps.enabled, currently disabled by default.

Other browsers:
Chrome has shipped it since 89
https://chromestatus.com/feature/5315286962012160

Web-platform-tests:
https://github.com/web-platform-tests/wpt/tree/master/import-maps

-- 
You received this message because you are subscribed to the Google Groups 
"dev-platform@mozilla.org" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dev-platform+unsubscr...@mozilla.org.
To view this discussion on the web visit 
https://groups.google.com/a/mozilla.org/d/msgid/dev-platform/CAJ5FAhV44QVLBdTrwqusuwHtmUKF1M3vZsd3sAwgk452%2Bv1z%2Bg%40mail.gmail.com.


RE: Welcoming Saisai (Jerry) Shao as a committer

2017-08-28 Thread Cheng, Hao
Congratulations!! Jerry, you really deserve it.

Hao

-Original Message-
From: Mridul Muralidharan [mailto:mri...@gmail.com] 
Sent: Tuesday, August 29, 2017 12:04 PM
To: Matei Zaharia 
Cc: dev ; Saisai Shao 
Subject: Re: Welcoming Saisai (Jerry) Shao as a committer

Congratulations Jerry, well deserved !

Regards,
Mridul

On Mon, Aug 28, 2017 at 6:28 PM, Matei Zaharia  wrote:
> Hi everyone,
>
> The PMC recently voted to add Saisai (Jerry) Shao as a committer. Saisai has 
> been contributing to many areas of the project for a long time, so it’s great 
> to see him join. Join me in thanking and congratulating him!
>
> Matei
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



RE: [VOTE] HADOOP-12756 - Aliyun OSS Support branch merge

2016-09-28 Thread Cheng, Hao
That will be great helpful for developers in China, whose application are based 
on AliYun OSS.

+1(non-binding)

Hao

-Original Message-
From: Zheng, Kai [mailto:kai.zh...@intel.com] 
Sent: Wednesday, September 28, 2016 10:35 AM
To: common-dev@hadoop.apache.org
Subject: [VOTE] HADOOP-12756 - Aliyun OSS Support branch merge

Hi all,

I would like to propose a merge vote for HADOOP-12756 branch to trunk. This 
branch develops support for Aliyun OSS (another cloud storage) in Hadoop.

The voting starts now and will run for 7 days till Oct 5, 2016 07:00 PM PDT.

Aliyun OSS is widely used among China's cloud users, and currently it is not 
easy to access data in Aliyun OSS from Hadoop. The branch develops a new module 
hadoop-aliyun and provides support for accessing data in Aliyun OSS cloud 
storage, which will enable more use cases and bring better use experience for 
Hadoop users. Like the existing s3a support, AliyunOSSFileSystem a new 
implementation of FileSystem backed by Aliyun OSS is provided. During the 
implementation, the contributors refer to the s3a support, keeping the same 
coding style and project structure.

. The updated architecture document is here.
   
[https://issues.apache.org/jira/secure/attachment/12829541/Aliyun-OSS-integration-v2.pdf]

. The merge patch that is a diff against trunk is posted here, which builds 
cleanly with manual testing results posted in HADOOP-13584.
   
[https://issues.apache.org/jira/secure/attachment/12829738/HADOOP-13584.004.patch]

. The user documentation is also provided as part of the module.

HADOOP-12756 has a set of sub-tasks and they are ordered in the same sequence 
as they were committed to HADOOP-12756. Hopefully this will make it easier for 
reviewing.

What I want to emphasize is: this is a fundamental implementation aiming at 
guaranteeing functionality and stability. The major functionality has been 
running in production environments for some while. There're definitely 
performance optimizations that we can do like the community have done for the 
existing s3a and azure supports. Merging this to trunk would serve as a very 
good beginning for the following optimizations aligning with the related 
efforts together.

The new hadoop-aliyun modlue is made possible owing to many people. Thanks to 
the contributors Mingfei Shi, Genmao Yu and Ling Zhou; thanks to Cheng Hao, 
Steve Loughran, Chris Nauroth, Yi Liu, Lei (Eddy) Xu, Uma Maheswara Rao G and 
Allen Wittenauer for their kind reviewing and guidance. Also thanks Arpit 
Agarwal, Andrew Wang and Anu Engineer for the great process discussions to 
bring this up.

Please kindly vote. Thanks in advance!

Regards,
Kai


-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Commented] (SPARK-17299) TRIM/LTRIM/RTRIM strips characters other than spaces

2016-08-31 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15451914#comment-15451914
 ] 

Cheng Hao commented on SPARK-17299:
---

Or come after SPARK-14878 ?

> TRIM/LTRIM/RTRIM strips characters other than spaces
> 
>
> Key: SPARK-17299
> URL: https://issues.apache.org/jira/browse/SPARK-17299
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Jeremy Beard
>Priority: Minor
>
> TRIM/LTRIM/RTRIM docs state that they only strip spaces:
> http://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html#trim(org.apache.spark.sql.Column)
> But the implementation strips all characters of ASCII value 20 or less:
> https://github.com/apache/spark/blob/v2.0.0/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L468-L470



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17299) TRIM/LTRIM/RTRIM strips characters other than spaces

2016-08-31 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15451810#comment-15451810
 ] 

Cheng Hao commented on SPARK-17299:
---

Yes, that's my bad, I thought it should be the same behavior of 
`String.trim()`. We should fix this bug. [~jbeard], can you please fix it?

> TRIM/LTRIM/RTRIM strips characters other than spaces
> 
>
> Key: SPARK-17299
> URL: https://issues.apache.org/jira/browse/SPARK-17299
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Jeremy Beard
>Priority: Minor
>
> TRIM/LTRIM/RTRIM docs state that they only strip spaces:
> http://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html#trim(org.apache.spark.sql.Column)
> But the implementation strips all characters of ASCII value 20 or less:
> https://github.com/apache/spark/blob/v2.0.0/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L468-L470



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



RE: [VOTE] Release of Apache Mnemonic-0.2.0-incubating [rc3]

2016-07-24 Thread Cheng, Hao
+1

Thanks,
Hao

-Original Message-
From: Gary [mailto:ga...@apache.org] 
Sent: Monday, July 25, 2016 10:29 AM
To: dev@mnemonic.incubator.apache.org
Subject: Re: [VOTE] Release of Apache Mnemonic-0.2.0-incubating [rc3]

+1

Thanks.
+Gary.


On 7/22/2016 3:08 PM, Gary wrote:
> Hi all,
>
> This is a call for a releasing Apache Mnemonic 0.2.0-incubating, 
> release candidate 3. This is the second release of Mnemonic 
> incubating.
>
> The source tarball, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/incubator/mnemonic/0.2.0-incuba
> ting-rc3/src/
>
> The tag to be voted upon is v0.2.0-incubating:
> https://git-wip-us.apache.org/repos/asf?p=incubator-mnemonic.git;a=sho
> rtlog;h=refs/tags/v0.2.0-incubating
>
> The release hash is 6203b6f06177a48c860d63f72afb78eaae556ebc:
> https://git-wip-us.apache.org/repos/asf?p=incubator-mnemonic.git;a=com
> mit;h=6203b6f06177a48c860d63f72afb78eaae556ebc
>
> Release artifacts are signed with the following key:
> https://dist.apache.org/repos/dist/dev/incubator/mnemonic/KEYS
>
> KEYS file available:
> https://dist.apache.org/repos/dist/dev/incubator/mnemonic/KEYS
>
> For information about the contents of this release, see:
> https://dist.apache.org/repos/dist/dev/incubator/mnemonic/0.2.0-incuba
> ting-rc3/CHANGES.txt
>
> The vote will be open for 72 hours.
> Please download the release candidate and evaluate the necessary items 
> including checking hashes, signatures, build from source, and test.  
> The please vote:
>
> [ ] +1 Release this package as apache-mnemonic-0.2.0-incubating
>
> [ ] +0 no opinion
>
> [ ] -1 Do not release this package because because...
>
> Thanks,
> Gary
>
>
>
>
>
>
>
>




RE: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-14 Thread Cheng, Hao
-1

Breaks the existing applications while using the Script Transformation in Spark 
SQL, as the default Record/Column delimiter class changed since we don’t get 
the default conf value from HiveConf any more, see SPARK-16515;

This is a regression.


From: Reynold Xin [mailto:r...@databricks.com]
Sent: Friday, July 15, 2016 7:26 AM
To: dev@spark.apache.org
Subject: Re: [VOTE] Release Apache Spark 2.0.0 (RC4)

Updated documentation at 
http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-docs-updated/



On Thu, Jul 14, 2016 at 11:59 AM, Reynold Xin 
> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.0.0. 
The vote is open until Sunday, July 17, 2016 at 12:00 PDT and passes if a 
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.0
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.0-rc4 (e5f8c1117e0c48499f54d62b556bc693435afae0).

This release candidate resolves ~2500 issues: 
https://s.apache.org/spark-2.0.0-jira

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1192/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-docs/


=
How can I help test this release?
=
If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions from 1.x.

==
What justifies a -1 vote for this release?
==
Critical bugs impacting major functionalities.

Bugs already present in 1.x, missing features, or bugs related to new features 
will not necessarily block this release. Note that historically Spark 
documentation has been published on the website separately from the main 
release so we do not need to block the release due to documentation errors 
either.


Note: There was a mistake made during "rc3" preparation, and as a result there 
is no "rc3", but only "rc4".




[jira] [Created] (SPARK-15859) Optimize the Partition Pruning with Disjunction

2016-06-09 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-15859:
-

 Summary: Optimize the Partition Pruning with Disjunction
 Key: SPARK-15859
 URL: https://issues.apache.org/jira/browse/SPARK-15859
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Critical


Currently we can not optimize the partition pruning in disjunction, for example:

{{(part1=2 and col1='abc') or (part1=5 and col1='cde')}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15730) [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take effect in spark-sql session

2016-06-07 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318654#comment-15318654
 ] 

Cheng Hao commented on SPARK-15730:
---

[~jameszhouyi], can you please verify this fixing?

> [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take 
> effect in spark-sql session
> -
>
> Key: SPARK-15730
> URL: https://issues.apache.org/jira/browse/SPARK-15730
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> /usr/lib/spark/bin/spark-sql -v --driver-memory 4g --executor-memory 7g 
> --executor-cores 5 --num-executors 31 --master yarn-client --conf 
> spark.yarn.executor.memoryOverhead=1024 --hiveconf RESULT_TABLE=test_result01
> spark-sql> use test;
> 16/06/02 21:36:15 INFO execution.SparkSqlParser: Parsing command: use test
> 16/06/02 21:36:15 INFO spark.SparkContext: Starting job: processCmd at 
> CliDriver.java:376
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Got job 2 (processCmd at 
> CliDriver.java:376) with 1 output partitions
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 
> (processCmd at CliDriver.java:376)
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Parents of final stage: List()
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Missing parents: List()
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting ResultStage 2 
> (MapPartitionsRDD[8] at processCmd at CliDriver.java:376), which has no 
> missing parents
> 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2 stored as values 
> in memory (estimated size 3.2 KB, free 2.4 GB)
> 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as 
> bytes in memory (estimated size 1964.0 B, free 2.4 GB)
> 16/06/02 21:36:15 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on 192.168.3.11:36189 (size: 1964.0 B, free: 2.4 GB)
> 16/06/02 21:36:15 INFO spark.SparkContext: Created broadcast 2 from broadcast 
> at DAGScheduler.scala:1012
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting 1 missing tasks 
> from ResultStage 2 (MapPartitionsRDD[8] at processCmd at CliDriver.java:376)
> 16/06/02 21:36:15 INFO cluster.YarnScheduler: Adding task set 2.0 with 1 tasks
> 16/06/02 21:36:15 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 
> 2.0 (TID 2, 192.168.3.13, partition 0, PROCESS_LOCAL, 5362 bytes)
> 16/06/02 21:36:15 INFO cluster.YarnClientSchedulerBackend: Launching task 2 
> on executor id: 10 hostname: 192.168.3.13.
> 16/06/02 21:36:16 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on hw-node3:45924 (size: 1964.0 B, free: 4.4 GB)
> 16/06/02 21:36:17 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 
> 2.0 (TID 2) in 1934 ms on 192.168.3.13 (1/1)
> 16/06/02 21:36:17 INFO cluster.YarnScheduler: Removed TaskSet 2.0, whose 
> tasks have all completed, from pool
> 16/06/02 21:36:17 INFO scheduler.DAGScheduler: ResultStage 2 (processCmd at 
> CliDriver.java:376) finished in 1.937 s
> 16/06/02 21:36:17 INFO scheduler.DAGScheduler: Job 2 finished: processCmd at 
> CliDriver.java:376, took 1.962631 s
> Time taken: 2.027 seconds
> 16/06/02 21:36:17 INFO CliDriver: Time taken: 2.027 seconds
> spark-sql> DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE};
> 16/06/02 21:36:36 INFO execution.SparkSqlParser: Parsing command: DROP TABLE 
> IF EXISTS ${hiveconf:RESULT_TABLE}
> Error in query:
> mismatched input '$' expecting {'ADD', 'AS', 'ALL', 'GROUP', 'BY', 
> 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'LIMIT', 'AT', 'IN', 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'OUTER', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 
> 'RANGE', 'ROWS', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 
> 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 
> 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'TO', 
> 'TABLESAMPLE', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 
> 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'IF', 
> 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 
> 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 
> 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 
> 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'EXTENDED', 
> 'REFRESH', 'CLEAR

RE: [VOTE] Accept CarbonData into the Apache Incubator

2016-05-25 Thread Cheng, Hao
+1

-Original Message-
From: Jacques Nadeau [mailto:jacq...@apache.org] 
Sent: Thursday, May 26, 2016 8:26 AM
To: general@incubator.apache.org
Subject: Re: [VOTE] Accept CarbonData into the Apache Incubator

+1 (binding)

On Wed, May 25, 2016 at 4:04 PM, John D. Ament 
wrote:

> +1
>
> On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré 
> wrote:
>
> > Hi all,
> >
> > following the discussion thread, I'm now calling a vote to accept 
> > CarbonData into the Incubator.
> >
> > ​[ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [ 
> > ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> >
> > This vote is open for 72 hours.
> >
> > The proposal follows, you can also access the wiki page:
> > https://wiki.apache.org/incubator/CarbonDataProposal
> >
> > Thanks !
> > Regards
> > JB
> >
> > = Apache CarbonData =
> >
> > == Abstract ==
> >
> > Apache CarbonData is a new Apache Hadoop native file format for 
> > faster interactive query using advanced columnar storage, index, 
> > compression and encoding techniques to improve computing efficiency, 
> > in turn it will help speedup queries an order of magnitude faster 
> > over PetaBytes of data.
> >
> > CarbonData github address: 
> > https://github.com/HuaweiBigData/carbondata
> >
> > == Background ==
> >
> > Huawei is an ICT solution provider, we are committed to enhancing 
> > customer experiences for telecom carriers, enterprises, and 
> > consumers on big data, In order to satisfy the following customer 
> > requirements, we created a new Hadoop native file format:
> >
> >   * Support interactive OLAP-style query over big data in seconds.
> >   * Support fast query on individual record which require touching 
> > all fields.
> >   * Fast data loading speed and support incremental load in period 
> > of minutes.
> >   * Support HDFS so that customer can leverage existing Hadoop cluster.
> >   * Support time based data retention.
> >
> > Based on these requirements, we investigated existing file formats 
> > in the Hadoop eco-system, but we could not find a suitable solution 
> > that satisfying requirements all at the same time, so we start 
> > designing CarbonData.
> >
> > == Rationale ==
> >
> > CarbonData contains multiple modules, which are classified into two
> > categories:
> >
> >   1. CarbonData File Format: which contains core implementation for 
> > file format such as 
> > columnar,index,dictionary,encoding+compression,API for reading/writing etc.
> >   2. CarbonData integration with big data processing framework such 
> > as Apache Spark, Apache Hive etc. Apache Beam is also planned to 
> > abstract the execution runtime.
> >
> > === CarbonData File Format ===
> >
> > CarbonData file format is a columnar store in HDFS, it has many 
> > features that a modern columnar format has, such as splittable, 
> > compression schema ,complex data type etc. And CarbonData has 
> > following unique
> > features:
> >
> >  Indexing 
> >
> > In order to support fast interactive query, CarbonData leverage 
> > indexing technology to reduce I/O scans. CarbonData files stores 
> > data along with index, the index is not stored separately but the 
> > CarbonData file itself contains the index. In current 
> > implementation, CarbonData supports 3 types of indexing:
> >
> > 1. Multi-dimensional Key (B+ Tree index)
> >   The Data block are written in sequence to the disk and within each 
> > data blocks each column block is written in sequence. Finally, the 
> > metadata block for the file is written with information about byte 
> > positions of each block in the file, Min-Max statistics index and 
> > the start and end MDK of each data block. Since, the entire data in 
> > the file is in sorted order, the start and end MDK of each data 
> > block can be used to construct a B+Tree and the file can be 
> > logically  represented as a
> > B+Tree with the data blocks as leaf nodes (on disk) and the 
> > B+remaining
> > non-leaf nodes in memory.
> > 2. Inverted index
> >   Inverted index is widely used in search engine. By using this 
> > index, it helps processing/query engine to do filtering inside one HDFS 
> > block.
> > Furthermore, query acceleration for count distinct like operation is 
> > made possible when combining bitmap and inverted index in query time.
> > 3. MinMax index
> >   For all columns, minmax index is created so that processing/query 
> > engine can skip scan that is not required.
> >
> >  Global Dictionary 
> >
> > Besides I/O reduction, CarbonData accelerates computation by using 
> > global dictionary, which enables processing/query engines to perform 
> > all processing on encoded data without having to convert the data 
> > (Late Materialization). We have observed dramatic performance 
> > improvement for OLAP analytic scenario where table contains many 
> > columns in string data type. The data is converted back to the user 
> > readable form 

[jira] [Commented] (SPARK-15034) Use the value of spark.sql.warehouse.dir as the warehouse location instead of using hive.metastore.warehouse.dir

2016-05-25 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300072#comment-15300072
 ] 

Cheng Hao commented on SPARK-15034:
---

[~yhuai], but it probably not respect the `hive-site.xml`, and lots of users 
will be impacted by this configuration change, will that acceptable?

> Use the value of spark.sql.warehouse.dir as the warehouse location instead of 
> using hive.metastore.warehouse.dir
> 
>
> Key: SPARK-15034
> URL: https://issues.apache.org/jira/browse/SPARK-15034
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>  Labels: release_notes, releasenotes
> Fix For: 2.0.0
>
>
> Starting from Spark 2.0, spark.sql.warehouse.dir will be the conf to set 
> warehouse location. We will not use hive.metastore.warehouse.dir.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13894) SQLContext.range should return Dataset[Long]

2016-03-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195274#comment-15195274
 ] 

Cheng Hao commented on SPARK-13894:
---

The existing functions "SQLContext.range()" returns the underlying schema with 
name "id", it will be lots of unit test code requires to be updated if we 
changed the column name to "value". How about keep the name as "id" unchanged?

> SQLContext.range should return Dataset[Long]
> 
>
> Key: SPARK-13894
> URL: https://issues.apache.org/jira/browse/SPARK-13894
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Blocker
>
> Rather than returning DataFrame, it should return a Dataset[Long]. The 
> documentation should still make it clear that the underlying schema consists 
> of a single long column named "value".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13326) Dataset in spark 2.0.0-SNAPSHOT missing columns

2016-03-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195022#comment-15195022
 ] 

Cheng Hao commented on SPARK-13326:
---

Can not reproduce it anymore, can you try it again?

> Dataset in spark 2.0.0-SNAPSHOT missing columns
> ---
>
> Key: SPARK-13326
> URL: https://issues.apache.org/jira/browse/SPARK-13326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: koert kuipers
>Priority: Minor
>
> i noticed some things stopped working on datasets in spark 2.0.0-SNAPSHOT, 
> and with a confusing error message (cannot resolved some column with input 
> columns []).
> for example in 1.6.0-SNAPSHOT:
> {noformat}
> scala> val ds = sc.parallelize(1 to 10).toDS
> ds: org.apache.spark.sql.Dataset[Int] = [value: int]
> scala> ds.map(x => Option(x))
> res0: org.apache.spark.sql.Dataset[Option[Int]] = [value: int]
> {noformat}
> and same commands in 2.0.0-SNAPSHOT:
> {noformat}
> scala> val ds = sc.parallelize(1 to 10).toDS
> ds: org.apache.spark.sql.Dataset[Int] = [value: int]
> scala> ds.map(x => Option(x))
> org.apache.spark.sql.AnalysisException: cannot resolve 'value' given input 
> columns: [];
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:283)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:162)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:172)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:176)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:176)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:181)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:181)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:122)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:121)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(

RE: [VOTE] Accept Mnemonic into the Apache Incubator

2016-03-03 Thread Cheng, Hao
erested in innovative memory project to fit 
> > large sized persistent memory and storage devices. Various Apache 
> > projects such as Apache Spark™, Apache HBase™, Apache Phoenix™, 
> > Apache Flink™, Apache Cassandra™ etc. can take good advantage of 
> > this project to overcome serialization/de-serialization, Java GC, 
> > and caching issues. We expect a wide range of interest will be 
> > generated after we open source this project to Apache.
> >
> >  Reliance on Salaried Developers  All developers are paid by 
> > their employers to contribute to this project. We welcome all others 
> > to contribute to this project after it is open sourced.
> >
> >  Relationships with Other Apache Product  Relationship with 
> > Apache™ Arrow:
> > Arrow's columnar data layout allows great use of CPU caches & SIMD. 
> > It places all data that relevant to a column operation in a compact 
> > format in memory.
> >
> > Mnemonic directly puts the whole business object graphs on external 
> > heterogeneous storage media, e.g. off-heap, SSD. It is not necessary 
> > to normalize the structures of object graphs for caching, checkpoint 
> > or storing. It doesn’t require developers to normalize their data 
> > object graphs. Mnemonic applications can avoid indexing & join 
> > datasets compared to traditional approaches.
> >
> > Mnemonic can leverage Arrow to transparently re-layout qualified 
> > data objects or create special containers that is able to 
> > efficiently hold those data records in columnar form as one of major 
> > performance optimization constructs.
> >
> > Mnemonic can be integrated into various Big Data and Cloud 
> > frameworks and applications.
> > We are currently working on several Apache projects with Mnemonic:
> > For Apache Spark™ we are integrating Mnemonic to improve:
> > a) Local checkpoints
> > b) Memory management for caching
> > c) Persistent memory datasets input
> > d) Non-Volatile RDD operations
> > The best use case for Apache Spark™ computing is that the input data 
> > is stored in form of Mnemonic native storage to avoid caching its 
> > row data for iterative processing. Moreover, Spark applications can 
> > leverage Mnemonic to perform data transforming in persistent or 
> > non-persistent memory without SerDes.
> >
> > For Apache™ Hadoop®, we are integrating HDFS Caching with Mnemonic 
> > instead of mmap. This will take advantage of persistent memory 
> > related features. We also plan to evaluate to integrate in Namenode 
> > Editlog, FSImage persistent data into Mnemonic persistent memory area.
> >
> > For Apache HBase™, we are using Mnemonic for BucketCache and 
> > evaluating performance improvements.
> >
> > We expect Mnemonic will be further developed and integrated into 
> > many Apache BigData projects and so on, to enhance memory management 
> > solutions for much improved performance and reliability.
> >
> >  An Excessive Fascination with the Apache Brand  While we 
> > expect Apache brand helps to attract more contributors, our 
> > interests in starting this project is based on the factors mentioned 
> > in the Rationale section.
> >
> > We would like Mnemonic to become an Apache project to further foster 
> > a healthy community of contributors and consumers in BigData 
> > technology R areas. Since Mnemonic can directly benefit many 
> > Apache projects and solves major performance problems, we expect the 
> > Apache Software Foundation to increase interaction with the larger 
> > community as well.
> >
> > === Documentation ===
> > The documentation is currently available at Intel and will be posted
> > under: https://mnemonic.incubator.apache.org/docs
> >
> > === Initial Source ===
> > Initial source code is temporary hosted Github for general viewing:
> > https://github.com/NonVolatileComputing/Mnemonic.git
> > It will be moved to Apache http://git.apache.org/ after podling.
> >
> > The initial Source is written in Java code (88%) and mixed with JNI 
> > C code (11%) and shell script (1%) for underlying native allocation 
> > libraries.
> >
> > === Source and Intellectual Property Submission Plan === As soon as 
> > Mnemonic is approved to join the Incubator, the source code will be 
> > transitioned via the Software Grant Agreement onto ASF 
> > infrastructure and in turn made available under the Apache License, 
> > version 2.0.
> >
> > === External Dependencies ===
> > The required external dependencies are all

RE: Spark Streaming - graceful shutdown when stream has no more data

2016-02-24 Thread Cheng, Hao
This is very interesting, how to shutdown the streaming job gracefully once no 
input data for some time.

A doable solution probably you can count the input data by using the 
Accumulator, and anther thread (in master node) will always to get the latest 
accumulator value, if there is no value change from the accumulator for 
sometime, then shutdown the streaming job.

From: Daniel Siegmann [mailto:daniel.siegm...@teamaol.com]
Sent: Wednesday, February 24, 2016 12:30 AM
To: Ashutosh Kumar 
Cc: Hemant Bhanawat ; Ted Yu ; Femi 
Anthony ; user 
Subject: Re: Spark Streaming - graceful shutdown when stream has no more data

During testing you will typically be using some finite data. You want the 
stream to shut down automatically when that data has been consumed so your test 
shuts down gracefully.
Of course once the code is running in production you'll want it to keep waiting 
for new records. So whether the stream shuts down when there's no more data 
should be configurable.


On Tue, Feb 23, 2016 at 11:09 AM, Ashutosh Kumar 
> wrote:
Just out of curiosity I will like to know why a streaming program should 
shutdown when no new data is arriving?  I think it should keep waiting for 
arrival of new records.
Thanks
Ashutosh

On Tue, Feb 23, 2016 at 9:17 PM, Hemant Bhanawat 
> wrote:
A guess - parseRecord is returning None in some case (probaly empty lines). And 
then entry.get is throwing the exception.
You may want to filter the None values from accessLogDStream before you run the 
map function over it.
Hemant

Hemant Bhanawat
www.snappydata.io

On Tue, Feb 23, 2016 at 6:00 PM, Ted Yu 
> wrote:
Which line is line 42 in your code ?

When variable lines becomes empty, you can stop your program.

Cheers

On Feb 23, 2016, at 12:25 AM, Femi Anthony 
> wrote:

I am working on Spark Streaming API and I wish to stream a set of 
pre-downloaded web log files continuously to simulate a real-time stream. I 
wrote a script that gunzips the compressed logs and pipes the output to nc on 
port .

The script looks like this:

BASEDIR=/home/mysuer/data/datamining/internet_traffic_archive

zipped_files=`find $BASEDIR -name "*.gz"`



for zfile in $zipped_files

 do

  echo "Unzipping $zfile..."

  gunzip -c $zfile  | nc -l -p  -q 20



 done
I have streaming code written in Scala that processes the streams. It works 
well for the most part, but when its run out of files to stream I get the 
following error in Spark:



16/02/19 23:04:35 WARN ReceiverSupervisorImpl:

Restarting receiver with delay 2000 ms: Socket data stream had no more data

16/02/19 23:04:35 ERROR ReceiverTracker: Deregistered receiver for stream 0:

Restarting receiver with delay 2000ms: Socket data stream had no more data

16/02/19 23:04:35 WARN BlockManager: Block input-0-1455941075600 replicated to 
only 0 peer(s) instead of 1 peers



16/02/19 23:04:40 ERROR Executor: Exception in task 2.0 in stage 15.0 (TID 47)

java.util.NoSuchElementException: None.get

at scala.None$.get(Option.scala:313)

at scala.None$.get(Option.scala:311)

at 
com.femibyte.learningsparkaddexamples.scala.StreamingLogEnhanced$$anonfun$2.apply(StreamingLogEnhanced.scala:42)

at 
com.femibyte.learningsparkaddexamples.scala.StreamingLogEnhanced$$anonfun$2.apply(StreamingLogEnhanced.scala:42)

How to I implement a graceful shutdown so that the program exits gracefully 
when it no longer detects any data in the stream ?

My Spark Streaming code looks like this:

object StreamingLogEnhanced {

 def main(args: Array[String]) {

  val master = args(0)

  val conf = new

 SparkConf().setMaster(master).setAppName("StreamingLogEnhanced")

 // Create a StreamingContext with a n second batch size

  val ssc = new StreamingContext(conf, Seconds(10))

 // Create a DStream from all the input on port 

  val log = Logger.getLogger(getClass.getName)



  sys.ShutdownHookThread {

  log.info("Gracefully stopping Spark Streaming Application")

  ssc.stop(true, true)

  log.info("Application stopped")

  }

  val lines = ssc.socketTextStream("localhost", )

  // Create a count of log hits by ip

  var ipCounts=countByIp(lines)

  ipCounts.print()



  // start our streaming context and wait for it to "finish"

  ssc.start()

  // Wait for 600 seconds then exit

  ssc.awaitTermination(1*600)

  ssc.stop()

  }



 def countByIp(lines: DStream[String]) = {

   val parser = new AccessLogParser

   val accessLogDStream = lines.map(line => parser.parseRecord(line))

   val ipDStream = accessLogDStream.map(entry =>


[jira] [Commented] (HADOOP-12756) Incorporate Aliyun OSS file system implementation

2016-02-03 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131540#comment-15131540
 ] 

Cheng Hao commented on HADOOP-12756:


Thank you so much  [~ste...@apache.org], [~cnauroth] for the comments and 
suggestions, that's really helpful.

AliYun OSS has large number of users at China, and we can see its fast growing 
in the future; we realize being part of Hadoop file system will make life 
easier for OSS users, particularly, application developers from different 
ecosystem(Pig, Tez, Impala, Spark, Flink, HBase, Tachyon etc.), Hadoop is 
probably the only common area that all of them familiar with,  that's the 
motive we are working on it. And as part of the collaboration with AliYun, 
Intel(and AliYun) has strong willing to maintain the code and keep it in high 
quality.

> Incorporate Aliyun OSS file system implementation
> -
>
> Key: HADOOP-12756
> URL: https://issues.apache.org/jira/browse/HADOOP-12756
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs
>Reporter: shimingfei
>Assignee: shimingfei
> Attachments: OSS integration.pdf
>
>
> Aliyun OSS is widely used among China’s cloud users, but currently it is not 
> easy to access data laid on OSS storage from user’s Hadoop/Spark application, 
> because of no original support for OSS in Hadoop.
> This work aims to integrate Aliyun OSS with Hadoop. By simple configuration, 
> Spark/Hadoop applications can read/write data from OSS without any code 
> change. Narrowing the gap between user’s APP and data storage, like what have 
> been done for S3 in Hadoop 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-12756) Incorporate Aliyun OSS file system implementation

2016-02-01 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127645#comment-15127645
 ] 

Cheng Hao commented on HADOOP-12756:


+1 This is critical for AliYun users when integrated with MapReduce/Spark etc. 

> Incorporate Aliyun OSS file system implementation
> -
>
> Key: HADOOP-12756
> URL: https://issues.apache.org/jira/browse/HADOOP-12756
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs
>Reporter: shimingfei
>Assignee: shimingfei
> Attachments: OSS integration.pdf
>
>
> Aliyun OSS is widely used among China’s cloud users, but currently it is not 
> easy to access data laid on OSS storage from user’s Hadoop/Spark application, 
> because of no original support for OSS in Hadoop.
> This work aims to integrate Aliyun OSS with Hadoop. By simple configuration, 
> Spark/Hadoop applications can read/write data from OSS without any code 
> change. Narrowing the gap between user’s APP and data storage, like what have 
> been done for S3 in Hadoop 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: Spark SQL joins taking too long

2016-01-27 Thread Cheng, Hao
Another possibility is about the parallelism? Probably be 1 or some other small 
value, since the input data size is not that big.

If in that case, probably you can try something like:

Df1.repartition(10).registerTempTable(“hospitals”);
Df2.repartition(10).registerTempTable(“counties”);
…
And then doing the join.


From: Raghu Ganti [mailto:raghuki...@gmail.com]
Sent: Thursday, January 28, 2016 3:06 AM
To: Ted Yu; Дмитро Попович
Cc: user
Subject: Re: Spark SQL joins taking too long

The problem is with the way Spark query plan is being created, IMO, what was 
happening before is that the order of the tables mattered and when the larger 
table is given first, it took a very long time (~53mins to complete). I changed 
the order of the tables with the smaller one first (including replacing the 
table with one element with that of the entire one) and modified the query to 
look like this:

SELECT c.NAME, h.name FROM counties c, hospitals h WHERE c.NAME 
= 'Dutchess' AND ST_Intersects(c.shape, h.location)
With the above query, things worked like a charm (<1min to finish the entire 
execution and join on 3141 polygons with 6.5k points).
Do let me know if you need more info in order to pin point the issue.
Regards,
Raghu

On Tue, Jan 26, 2016 at 5:13 PM, Ted Yu 
> wrote:
What's the type of shape column ?

Can you disclose what SomeUDF does (by showing the code) ?

Cheers

On Tue, Jan 26, 2016 at 12:41 PM, raghukiran 
> wrote:
Hi,

I create two tables, one counties with just one row (it actually has 2k
rows, but I used only one) and another hospitals, which has 6k rows. The
join command I use is as follows, which takes way too long to run and has
never finished successfully (even after nearly 10mins). The following is
what I have:

DataFrame df1 = ...
df1.registerTempTable("hospitals");
DataFrame df2 = ...
df2.registerTempTable("counties"); //has only one row right now
DataFrame joinDf = sqlCtx.sql("SELECT h.name, 
c.name FROM hospitals h JOIN
counties c ON SomeUDF(c.shape, h.location)");
long count = joinDf.count(); //this takes too long!

//whereas the following which is the exact equivalent of the above gets done
very quickly!
DataFrame joinDf = sqlCtx.sql("SELECT h.name FROM hospitals WHERE
SomeUDF('c.shape as string', h.location)");
long count = joinDf.count(); //gives me the correct answer of 8

Any suggestions on what I can do to optimize and debug this piece of code?

Regards,
Raghu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-joins-taking-too-long-tp26078.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org




RE: JSON to SQL

2016-01-27 Thread Cheng, Hao
Have you ever try the DataFrame API like: 
sqlContext.read.json("/path/to/file.json"); the Spark SQL will auto infer the 
type/schema for you.

And lateral view will help on the flatten issues,
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView, as 
well as the “a.b[0].c” format of expression.


From: Andrés Ivaldi [mailto:iaiva...@gmail.com]
Sent: Thursday, January 28, 2016 3:39 AM
To: Sahil Sareen
Cc: Al Pivonka; user
Subject: Re: JSON to SQL

I'm really brand new with Scala, but if I'm defining a case class then is 
becouse I know how is the json's structure is previously?

If I'm able to define dinamicaly a case class from the JSON structure then even 
with spark I will be able to extract the data

On Wed, Jan 27, 2016 at 4:01 PM, Sahil Sareen 
> wrote:
Isn't this just about defining a case class and using 
parse(json).extract[CaseClassName]  using Jackson?

-Sahil

On Wed, Jan 27, 2016 at 11:08 PM, Andrés Ivaldi 
> wrote:
We dont have Domain Objects, its a service like a pipeline, data is read  from 
source and they are saved it in relational Database

I can read the structure from DataFrames, and do some transformations, I would 
prefer to do it with Spark to be consistent with the process


On Wed, Jan 27, 2016 at 12:25 PM, Al Pivonka 
> wrote:
Are you using an Relational Database?
If so why not use a nojs DB ? then pull from it to your relational?

Or utilize a library that understands Json structure like Jackson to obtain the 
data from the Json structure the persist the Domain Objects ?

On Wed, Jan 27, 2016 at 9:45 AM, Andrés Ivaldi 
> wrote:
Sure,
The Job is like an etl, but without interface, so I decide the rules of how the 
JSON will be saved into a SQL Table.

I need to Flatten the hierarchies where is possible in case of list flatten 
also, nested objects Won't be processed by now

{"a":1,"b":[2,3],"c"="Field", "d":[4,5,6,7,8] }
{"a":11,"b":[22,33],"c"="Field1", "d":[44,55,66,77,88] }
{"a":111,"b":[222,333],"c"="Field2", "d":[44,55,666,777,888] }

I would like something like this on my SQL table
ab  c d
12,3Field 4,5,6,7,8
11   22,33  Field144,55,66,77,88
111  222,333Field2444,555,,666,777,888
Right now this is what i need
I will later add more intelligence, like detection of list or nested objects 
and create relations in other tables.



On Wed, Jan 27, 2016 at 11:25 AM, Al Pivonka 
> wrote:
More detail is needed.
Can you provide some context to the use-case ?

On Wed, Jan 27, 2016 at 8:33 AM, Andrés Ivaldi 
> wrote:
Hello, I'm trying to Save a JSON filo into SQL table.

If i try to do this directly the IlligalArgumentException is raised, I suppose 
this is beacouse JSON have a hierarchical structure, is that correct?

If that is the problem, how can I flatten the JSON structure? The JSON 
structure to be processed would be unknow, so I need to do it programatically

regards
--
Ing. Ivaldi Andres



--
Those who say it can't be done, are usually interrupted by those doing it.


--
Ing. Ivaldi Andres



--
Those who say it can't be done, are usually interrupted by those doing it.


--
Ing. Ivaldi Andres




--
Ing. Ivaldi Andres


RE: HiBench as part of Bigtop

2016-01-26 Thread Cheng, Hao
Which way does the community prefer? Just for packaging or the put entire 
project into bigtop?

-Original Message-
From: RJ Nowling [mailto:rnowl...@gmail.com] 
Sent: Wednesday, January 27, 2016 11:24 AM
To: dev@bigtop.apache.org
Subject: Re: HiBench as part of Bigtop

A few of us have been working on BigPetStore, a family of example applications 
(blueprints).  We've also been working on data generators to create relatively 
complex fake data for those demo apps.

On Tue, Jan 26, 2016 at 9:10 PM, Cheng, Hao <hao.ch...@intel.com> wrote:

> Hi RJ,
>
> Currently, we are targeting packaging the HiBench in Bigtop; but is 
> there any other subproject entirely be part of BigTop?
>
> Thanks,
> Hao
>
> -Original Message-
> From: RJ Nowling [mailto:rnowl...@gmail.com]
> Sent: Wednesday, January 27, 2016 11:04 AM
> To: dev@bigtop.apache.org
> Subject: Re: HiBench as part of Bigtop
>
> Hi Hao,
>
> Are you interested in just packaging HiBench in Bigtop or contributing 
> the entire project to Bigtop?
>
> Is there someone who would be willing to take responsibility for 
> maintenance?
>
> We generally like to find a maintainer for each package and add them 
> to our maintainers list.  This way we know who to contact if a build 
> fails and blocks a release.  If there is no maintainer (and the build 
> has problems), we use that as grounds for removing the package from Bigtop.
>
> Thanks!
> RJ
>
>
> On Tue, Jan 26, 2016 at 1:15 AM, Cheng, Hao <hao.ch...@intel.com> wrote:
>
> > Thank you Cos for the reply, we will take a look at YCSB. :)
> >
> > Hao
> >
> > -Original Message-
> > From: Konstantin Boudnik [mailto:c...@apache.org]
> > Sent: Tuesday, January 26, 2016 2:33 PM
> > To: dev@bigtop.apache.org
> > Subject: Re: HiBench as part of Bigtop
> >
> > Hi Hao.
> >
> > Thanks for your interest in the project and for considering us as 
> > the new home for this benchmarking tool! We already have YCSB as the 
> > part of stack, but clearly the more the merrier ;)
> >
> > A quick look at the source code shows that HiBench is already 
> > licensed under ASL2, so the rest of it should be just a SMOP ;) As 
> > you know, we do package all our components for both deb and rpm 
> > package managers, deployment code, as well as do have requirements 
> > for package and integration testing. I think the good starting point 
> > would be to take a look at YCSB
> >
> > bigtop-packages/src/*/ycsb/
> > bigtop-deploy/puppet/modules/ycsb/
> >
> > and/or other components in the stack. This link
> >
> > https://cwiki.apache.org/confluence/display/BIGTOP/How+to+Contribute
> >
> > should also be helpful to understand the process of contributing 
> > into the project. And you already found dev@ list, so you'll set ;)
> >
> > I wonder how others in the community see this?
> >   Cos
> >
> > On Tue, Jan 26, 2016 at 05:37AM, Cheng, Hao wrote:
> > > Dear BigTop Devs,
> > >
> > > I am from Intel Big Data Technology team, and we are the owner of 
> > > HiBench, an open source benchmark suite for Hadoop / Spark 
> > > ecosystem; as widely used, more and more trivial requirements from 
> > > HiBench users, due to the limited resources, particularly the ease 
> > > of deployment, we are exploring the possibility of getting the 
> > > code
> into BigTop.
> > >
> > > HiBench code can be found at:
> > > https://github.com/intel-hadoop/HiBench
> > >
> > > Looking forward to your reply.
> > >
> > > Regards,
> > > Hao
> > >
> >
>


RE: HiBench as part of Bigtop

2016-01-26 Thread Cheng, Hao
Hi RJ,

Currently, we are targeting packaging the HiBench in Bigtop; but is there any 
other subproject entirely be part of BigTop?

Thanks,
Hao

-Original Message-
From: RJ Nowling [mailto:rnowl...@gmail.com] 
Sent: Wednesday, January 27, 2016 11:04 AM
To: dev@bigtop.apache.org
Subject: Re: HiBench as part of Bigtop

Hi Hao,

Are you interested in just packaging HiBench in Bigtop or contributing the 
entire project to Bigtop?

Is there someone who would be willing to take responsibility for maintenance?

We generally like to find a maintainer for each package and add them to our 
maintainers list.  This way we know who to contact if a build fails and blocks 
a release.  If there is no maintainer (and the build has problems), we use that 
as grounds for removing the package from Bigtop.

Thanks!
RJ


On Tue, Jan 26, 2016 at 1:15 AM, Cheng, Hao <hao.ch...@intel.com> wrote:

> Thank you Cos for the reply, we will take a look at YCSB. :)
>
> Hao
>
> -Original Message-
> From: Konstantin Boudnik [mailto:c...@apache.org]
> Sent: Tuesday, January 26, 2016 2:33 PM
> To: dev@bigtop.apache.org
> Subject: Re: HiBench as part of Bigtop
>
> Hi Hao.
>
> Thanks for your interest in the project and for considering us as the 
> new home for this benchmarking tool! We already have YCSB as the part 
> of stack, but clearly the more the merrier ;)
>
> A quick look at the source code shows that HiBench is already licensed 
> under ASL2, so the rest of it should be just a SMOP ;) As you know, we 
> do package all our components for both deb and rpm package managers, 
> deployment code, as well as do have requirements for package and 
> integration testing. I think the good starting point would be to take 
> a look at YCSB
>
> bigtop-packages/src/*/ycsb/
> bigtop-deploy/puppet/modules/ycsb/
>
> and/or other components in the stack. This link
> 
> https://cwiki.apache.org/confluence/display/BIGTOP/How+to+Contribute
>
> should also be helpful to understand the process of contributing into 
> the project. And you already found dev@ list, so you'll set ;)
>
> I wonder how others in the community see this?
>   Cos
>
> On Tue, Jan 26, 2016 at 05:37AM, Cheng, Hao wrote:
> > Dear BigTop Devs,
> >
> > I am from Intel Big Data Technology team, and we are the owner of 
> > HiBench, an open source benchmark suite for Hadoop / Spark 
> > ecosystem; as widely used, more and more trivial requirements from 
> > HiBench users, due to the limited resources, particularly the ease 
> > of deployment, we are exploring the possibility of getting the code into 
> > BigTop.
> >
> > HiBench code can be found at: 
> > https://github.com/intel-hadoop/HiBench
> >
> > Looking forward to your reply.
> >
> > Regards,
> > Hao
> >
>


RE: HiBench as part of Bigtop

2016-01-25 Thread Cheng, Hao
Thank you Cos for the reply, we will take a look at YCSB. :)

Hao

-Original Message-
From: Konstantin Boudnik [mailto:c...@apache.org] 
Sent: Tuesday, January 26, 2016 2:33 PM
To: dev@bigtop.apache.org
Subject: Re: HiBench as part of Bigtop

Hi Hao.

Thanks for your interest in the project and for considering us as the new home 
for this benchmarking tool! We already have YCSB as the part of stack, but 
clearly the more the merrier ;)

A quick look at the source code shows that HiBench is already licensed under 
ASL2, so the rest of it should be just a SMOP ;) As you know, we do package all 
our components for both deb and rpm package managers, deployment code, as well 
as do have requirements for package and integration testing. I think the good 
starting point would be to take a look at YCSB

bigtop-packages/src/*/ycsb/
bigtop-deploy/puppet/modules/ycsb/

and/or other components in the stack. This link
https://cwiki.apache.org/confluence/display/BIGTOP/How+to+Contribute

should also be helpful to understand the process of contributing into the 
project. And you already found dev@ list, so you'll set ;)

I wonder how others in the community see this?
  Cos

On Tue, Jan 26, 2016 at 05:37AM, Cheng, Hao wrote:
> Dear BigTop Devs,
> 
> I am from Intel Big Data Technology team, and we are the owner of 
> HiBench, an open source benchmark suite for Hadoop / Spark ecosystem; 
> as widely used, more and more trivial requirements from HiBench users, 
> due to the limited resources, particularly the ease of deployment, we 
> are exploring the possibility of getting the code into BigTop.
> 
> HiBench code can be found at: https://github.com/intel-hadoop/HiBench
> 
> Looking forward to your reply.
> 
> Regards,
> Hao
> 


HiBench as part of Bigtop

2016-01-25 Thread Cheng, Hao
Dear BigTop Devs,

I am from Intel Big Data Technology team, and we are the owner of HiBench, an 
open source benchmark suite for Hadoop / Spark ecosystem; as widely used, more 
and more trivial requirements from HiBench users, due to the limited resources, 
particularly the ease of deployment, we are exploring the possibility of 
getting the code into BigTop.

HiBench code can be found at: https://github.com/intel-hadoop/HiBench

Looking forward to your reply.

Regards,
Hao



[jira] [Updated] (SPARK-12610) Add Anti Join Operators

2016-01-03 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-12610:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-4226

> Add Anti Join Operators
> ---
>
> Key: SPARK-12610
> URL: https://issues.apache.org/jira/browse/SPARK-12610
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>    Reporter: Cheng Hao
>
> We need to implements the anti join operators, for supporting the NOT 
> predicates in subquery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12610) Add Anti Join Operators

2016-01-03 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-12610:
-

 Summary: Add Anti Join Operators
 Key: SPARK-12610
 URL: https://issues.apache.org/jira/browse/SPARK-12610
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Cheng Hao


We need to implements the anti join operators, for supporting the NOT 
predicates in subquery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



RE: Problem with WINDOW functions?

2015-12-29 Thread Cheng, Hao
Which version are you using? Have you tried the 1.6?

From: Vadim Tkachenko [mailto:apache...@gmail.com]
Sent: Wednesday, December 30, 2015 10:17 AM
To: Cheng, Hao
Cc: user@spark.apache.org
Subject: Re: Problem with WINDOW functions?

When I allocate 200g to executor, it is able to make better progress,
that is I see 189 tasks executed instead of 169 previously.
But eventually it fails with the same error.

On Tue, Dec 29, 2015 at 5:58 PM, Cheng, Hao 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:
Is there any improvement if you set a bigger memory for executors?

-Original Message-
From: va...@percona.com<mailto:va...@percona.com> 
[mailto:va...@percona.com<mailto:va...@percona.com>] On Behalf Of Vadim 
Tkachenko
Sent: Wednesday, December 30, 2015 9:51 AM
To: Cheng, Hao
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Problem with WINDOW functions?

Hi,

I am getting the same error with write.parquet("/path/to/file") :
 WARN HeartbeatReceiver: Removing executor 0 with no recent
heartbeats: 160714 ms exceeds timeout 12 ms
15/12/30 01:49:05 ERROR TaskSchedulerImpl: Lost executor 0 on
10.10.7.167<http://10.10.7.167>: Executor heartbeat timed out after 160714 ms


On Tue, Dec 29, 2015 at 5:35 PM, Cheng, Hao 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:
> Can you try to write the result into another file instead? Let's see if there 
> any issue in the executors side .
>
> sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day
> ORDER BY pageviews DESC) as rank FROM d1").filter("rank <=
> 20").sort($"day",$"rank").write.parquet("/path/to/file")
>
> -Original Message-
> From: vadimtk [mailto:apache...@gmail.com<mailto:apache...@gmail.com>]
> Sent: Wednesday, December 30, 2015 9:29 AM
> To: user@spark.apache.org<mailto:user@spark.apache.org>
> Subject: Problem with WINDOW functions?
>
> Hi,
>
> I can't successfully execute a query with WINDOW function.
>
> The statements are following:
>
> val orcFile =
> sqlContext.read.parquet("/data/flash/spark/dat14sn").filter("upper(pro
> ject)='EN'")
> orcFile.registerTempTable("d1")
>  sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day
> ORDER BY pageviews DESC) as rank FROM d1").filter("rank <=
> 20").sort($"day",$"rank").collect().foreach(println)
>
> with default
> spark.driver.memory
>
> I am getting java.lang.OutOfMemoryError: Java heap space.
> The same if I set spark.driver.memory=10g.
>
> When I set spark.driver.memory=45g (the box has 256GB of RAM) the execution 
> fails with a different error:
>
> 15/12/29 23:03:19 WARN HeartbeatReceiver: Removing executor 0 with no
> recent
> heartbeats: 129324 ms exceeds timeout 12 ms
>
> And I see that GC takes a lot of time.
>
> What is a proper way to execute statements above?
>
> I see the similar problems reported
> http://stackoverflow.com/questions/32196859/org-apache-spark-shuffle-f
> etchfailedexception
> http://stackoverflow.com/questions/32544478/spark-memory-settings-for-
> count-action-in-a-big-table
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-WINDO
> W-functions-tp25833.html Sent from the Apache Spark User List mailing
> list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: 
> user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> 
> For
> additional commands, e-mail: 
> user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>
>



RE: Problem with WINDOW functions?

2015-12-29 Thread Cheng, Hao
Can you try to write the result into another file instead? Let's see if there 
any issue in the executors side .

sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day ORDER BY 
pageviews DESC) as rank FROM d1").filter("rank <=
20").sort($"day",$"rank").write.parquet("/path/to/file")

-Original Message-
From: vadimtk [mailto:apache...@gmail.com] 
Sent: Wednesday, December 30, 2015 9:29 AM
To: user@spark.apache.org
Subject: Problem with WINDOW functions?

Hi,

I can't successfully execute a query with WINDOW function.

The statements are following:

val orcFile =
sqlContext.read.parquet("/data/flash/spark/dat14sn").filter("upper(project)='EN'")
orcFile.registerTempTable("d1")
 sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day ORDER BY 
pageviews DESC) as rank FROM d1").filter("rank <=
20").sort($"day",$"rank").collect().foreach(println)

with default
spark.driver.memory 

I am getting java.lang.OutOfMemoryError: Java heap space.
The same if I set spark.driver.memory=10g.

When I set spark.driver.memory=45g (the box has 256GB of RAM) the execution 
fails with a different error:

15/12/29 23:03:19 WARN HeartbeatReceiver: Removing executor 0 with no recent
heartbeats: 129324 ms exceeds timeout 12 ms

And I see that GC takes a lot of time.

What is a proper way to execute statements above?

I see the similar problems reported
http://stackoverflow.com/questions/32196859/org-apache-spark-shuffle-fetchfailedexception
http://stackoverflow.com/questions/32544478/spark-memory-settings-for-count-action-in-a-big-table









--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-WINDOW-functions-tp25833.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Problem with WINDOW functions?

2015-12-29 Thread Cheng, Hao
Is there any improvement if you set a bigger memory for executors?

-Original Message-
From: va...@percona.com [mailto:va...@percona.com] On Behalf Of Vadim Tkachenko
Sent: Wednesday, December 30, 2015 9:51 AM
To: Cheng, Hao
Cc: user@spark.apache.org
Subject: Re: Problem with WINDOW functions?

Hi,

I am getting the same error with write.parquet("/path/to/file") :
 WARN HeartbeatReceiver: Removing executor 0 with no recent
heartbeats: 160714 ms exceeds timeout 12 ms
15/12/30 01:49:05 ERROR TaskSchedulerImpl: Lost executor 0 on
10.10.7.167: Executor heartbeat timed out after 160714 ms


On Tue, Dec 29, 2015 at 5:35 PM, Cheng, Hao <hao.ch...@intel.com> wrote:
> Can you try to write the result into another file instead? Let's see if there 
> any issue in the executors side .
>
> sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day 
> ORDER BY pageviews DESC) as rank FROM d1").filter("rank <=
> 20").sort($"day",$"rank").write.parquet("/path/to/file")
>
> -Original Message-
> From: vadimtk [mailto:apache...@gmail.com]
> Sent: Wednesday, December 30, 2015 9:29 AM
> To: user@spark.apache.org
> Subject: Problem with WINDOW functions?
>
> Hi,
>
> I can't successfully execute a query with WINDOW function.
>
> The statements are following:
>
> val orcFile =
> sqlContext.read.parquet("/data/flash/spark/dat14sn").filter("upper(pro
> ject)='EN'")
> orcFile.registerTempTable("d1")
>  sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day 
> ORDER BY pageviews DESC) as rank FROM d1").filter("rank <=
> 20").sort($"day",$"rank").collect().foreach(println)
>
> with default
> spark.driver.memory
>
> I am getting java.lang.OutOfMemoryError: Java heap space.
> The same if I set spark.driver.memory=10g.
>
> When I set spark.driver.memory=45g (the box has 256GB of RAM) the execution 
> fails with a different error:
>
> 15/12/29 23:03:19 WARN HeartbeatReceiver: Removing executor 0 with no 
> recent
> heartbeats: 129324 ms exceeds timeout 12 ms
>
> And I see that GC takes a lot of time.
>
> What is a proper way to execute statements above?
>
> I see the similar problems reported
> http://stackoverflow.com/questions/32196859/org-apache-spark-shuffle-f
> etchfailedexception 
> http://stackoverflow.com/questions/32544478/spark-memory-settings-for-
> count-action-in-a-big-table
>
>
>
>
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-WINDO
> W-functions-tp25833.html Sent from the Apache Spark User List mailing 
> list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
> additional commands, e-mail: user-h...@spark.apache.org
>


RE: Problem with WINDOW functions?

2015-12-29 Thread Cheng, Hao
It’s not released yet, probably you need to compile it yourself. In the 
meantime, can you increase the partition number? By setting the " 
spark.sql.shuffle.partitions” to a bigger value.

And more details about your cluster size, partition size, yarn/standalone, 
executor resources etc. will be more helpful in understanding your problem.

From: Vadim Tkachenko [mailto:apache...@gmail.com]
Sent: Wednesday, December 30, 2015 10:49 AM
To: Cheng, Hao
Subject: Re: Problem with WINDOW functions?

I use 1.5.2.

Where can I get 1.6? I do not see it on http://spark.apache.org/downloads.html

Thanks,
Vadim


On Tue, Dec 29, 2015 at 6:47 PM, Cheng, Hao 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:
Which version are you using? Have you tried the 1.6?

From: Vadim Tkachenko [mailto:apache...@gmail.com<mailto:apache...@gmail.com>]
Sent: Wednesday, December 30, 2015 10:17 AM

To: Cheng, Hao
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Problem with WINDOW functions?

When I allocate 200g to executor, it is able to make better progress,
that is I see 189 tasks executed instead of 169 previously.
But eventually it fails with the same error.

On Tue, Dec 29, 2015 at 5:58 PM, Cheng, Hao 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:
Is there any improvement if you set a bigger memory for executors?

-Original Message-
From: va...@percona.com<mailto:va...@percona.com> 
[mailto:va...@percona.com<mailto:va...@percona.com>] On Behalf Of Vadim 
Tkachenko
Sent: Wednesday, December 30, 2015 9:51 AM
To: Cheng, Hao
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Problem with WINDOW functions?

Hi,

I am getting the same error with write.parquet("/path/to/file") :
 WARN HeartbeatReceiver: Removing executor 0 with no recent
heartbeats: 160714 ms exceeds timeout 12 ms
15/12/30 01:49:05 ERROR TaskSchedulerImpl: Lost executor 0 on
10.10.7.167<http://10.10.7.167>: Executor heartbeat timed out after 160714 ms


On Tue, Dec 29, 2015 at 5:35 PM, Cheng, Hao 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:
> Can you try to write the result into another file instead? Let's see if there 
> any issue in the executors side .
>
> sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day
> ORDER BY pageviews DESC) as rank FROM d1").filter("rank <=
> 20").sort($"day",$"rank").write.parquet("/path/to/file")
>
> -Original Message-
> From: vadimtk [mailto:apache...@gmail.com<mailto:apache...@gmail.com>]
> Sent: Wednesday, December 30, 2015 9:29 AM
> To: user@spark.apache.org<mailto:user@spark.apache.org>
> Subject: Problem with WINDOW functions?
>
> Hi,
>
> I can't successfully execute a query with WINDOW function.
>
> The statements are following:
>
> val orcFile =
> sqlContext.read.parquet("/data/flash/spark/dat14sn").filter("upper(pro
> ject)='EN'")
> orcFile.registerTempTable("d1")
>  sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day
> ORDER BY pageviews DESC) as rank FROM d1").filter("rank <=
> 20").sort($"day",$"rank").collect().foreach(println)
>
> with default
> spark.driver.memory
>
> I am getting java.lang.OutOfMemoryError: Java heap space.
> The same if I set spark.driver.memory=10g.
>
> When I set spark.driver.memory=45g (the box has 256GB of RAM) the execution 
> fails with a different error:
>
> 15/12/29 23:03:19 WARN HeartbeatReceiver: Removing executor 0 with no
> recent
> heartbeats: 129324 ms exceeds timeout 12 ms
>
> And I see that GC takes a lot of time.
>
> What is a proper way to execute statements above?
>
> I see the similar problems reported
> http://stackoverflow.com/questions/32196859/org-apache-spark-shuffle-f
> etchfailedexception
> http://stackoverflow.com/questions/32544478/spark-memory-settings-for-
> count-action-in-a-big-table
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-WINDO
> W-functions-tp25833.html Sent from the Apache Spark User List mailing
> list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: 
> user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> 
> For
> additional commands, e-mail: 
> user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>
>




RE: Does Spark SQL support rollup like HQL

2015-12-29 Thread Cheng, Hao
Hi, currently, the Simple SQL Parser of SQLContext is quite weak, and doesn’t 
support the rollup, but you can check the code

https://github.com/apache/spark/pull/5080/ , which aimed to add the support, 
just in case you can patch it in your own branch.

In Spark 2.0, the simple SQL Parser will be replaced by HQL Parser, so it will 
not be the problem then.

Hao

From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID]
Sent: Wednesday, December 30, 2015 11:41 AM
To: User
Subject: Does Spark SQL support rollup like HQL

Hi guys,

As we know, hqlContext support rollup like this:

hiveContext.sql("select a, b, sum(c) from t group by a, b with rollup")

And I also knows that dataframe provides rollup function to support it:

dataframe.rollup($"a", $"b").agg(Map("c" -> "sum"))

But in my scenario, I'd better use sql syntax in SqlContext to support rollup 
it seems like what HqlContext does. Any suggestion?

Thanks.

Regards,
Yi Zhang


[jira] [Commented] (SPARK-12196) Store blocks in different speed storage devices by hierarchy way

2015-12-28 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072634#comment-15072634
 ] 

Cheng Hao commented on SPARK-12196:
---

Thank you wei wu to support this feature! 

However, we're trying to avoid to change the existing configuration format, as 
it might impact the user applications, and besides, in Yarn/Mesos, this 
configuration key will not work anymore.

An updated PR will be submitted soon, welcome to join the discussion the in PR.

> Store blocks in different speed storage devices by hierarchy way
> 
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. Set the priority and threshold for each layer in 
> spark.storage.hierarchyStore.
> {code}
> spark.storage.hierarchyStore='nvm 50GB,ssd 80GB'
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configure each layer's location, user just needs put the keyword like 
> "nvm", "ssd", which are specified in step 1, into local dirs, like 
> spark.local.dir or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8360) Streaming DataFrames

2015-12-02 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-8360:
-
Attachment: StreamingDataFrameProposal.pdf

This is a proposal for streaming dataframes that we were trying to work, 
hopefully helpful for the new design.

> Streaming DataFrames
> 
>
> Key: SPARK-8360
> URL: https://issues.apache.org/jira/browse/SPARK-8360
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL, Streaming
>Reporter: Reynold Xin
> Attachments: StreamingDataFrameProposal.pdf
>
>
> Umbrella ticket to track what's needed to make streaming DataFrame a reality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8360) Streaming DataFrames

2015-12-02 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035335#comment-15035335
 ] 

Cheng Hao edited comment on SPARK-8360 at 12/2/15 12:14 PM:


Remove the google docs link, as I cannot make it access for anyone when using 
the corp account. In the meantime, I put an pdf doc, hopefully helpful.


was (Author: chenghao):
Add some thoughts on StreamingSQL. 
https://docs.google.com/document/u/1/d/1ZgmDsNLT2XltMI317nNtkUkxBFwaaCkLHtPXjQE6JBY/pub

Request Edit if you needed.
https://docs.google.com/document/d/1ZgmDsNLT2XltMI317nNtkUkxBFwaaCkLHtPXjQE6JBY/edit
 

> Streaming DataFrames
> 
>
> Key: SPARK-8360
> URL: https://issues.apache.org/jira/browse/SPARK-8360
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL, Streaming
>Reporter: Reynold Xin
> Attachments: StreamingDataFrameProposal.pdf
>
>
> Umbrella ticket to track what's needed to make streaming DataFrame a reality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8360) Streaming DataFrames

2015-12-01 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035335#comment-15035335
 ] 

Cheng Hao edited comment on SPARK-8360 at 12/2/15 6:19 AM:
---

Add some thoughts on StreamingSQL. 
https://docs.google.com/document/u/1/d/1ZgmDsNLT2XltMI317nNtkUkxBFwaaCkLHtPXjQE6JBY/pub

Request Edit if you needed.
https://docs.google.com/document/d/1ZgmDsNLT2XltMI317nNtkUkxBFwaaCkLHtPXjQE6JBY/edit
 


was (Author: chenghao):
Add some thoughts on StreamingSQL. 
https://docs.google.com/document/d/1ZgmDsNLT2XltMI317nNtkUkxBFwaaCkLHtPXjQE6JBY/edit
 

> Streaming DataFrames
> 
>
> Key: SPARK-8360
> URL: https://issues.apache.org/jira/browse/SPARK-8360
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>
> Umbrella ticket to track what's needed to make streaming DataFrame a reality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8360) Streaming DataFrames

2015-12-01 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035335#comment-15035335
 ] 

Cheng Hao commented on SPARK-8360:
--

Add some thoughts on StreamingSQL. 
https://docs.google.com/document/d/1ZgmDsNLT2XltMI317nNtkUkxBFwaaCkLHtPXjQE6JBY/edit
 

> Streaming DataFrames
> 
>
> Key: SPARK-8360
> URL: https://issues.apache.org/jira/browse/SPARK-8360
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>
> Umbrella ticket to track what's needed to make streaming DataFrame a reality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12064) Make the SqlParser as trait for better integrated with extensions

2015-11-30 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-12064:
-

 Summary: Make the SqlParser as trait for better integrated with 
extensions
 Key: SPARK-12064
 URL: https://issues.apache.org/jira/browse/SPARK-12064
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao


`SqlParser` is now an object, which hard to reuse it in extensions, a proper 
implementation will be make the `SqlParser` as trait, and keep all of its 
implementation unchanged, and then add another object called `SqlParser` 
inherits from the trait.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12064) Make the SqlParser as trait for better integrated with extensions

2015-11-30 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao resolved SPARK-12064.
---
Resolution: Won't Fix

DBX has plan to remove the SqlParser in 2.0.

> Make the SqlParser as trait for better integrated with extensions
> -
>
> Key: SPARK-12064
> URL: https://issues.apache.org/jira/browse/SPARK-12064
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>    Reporter: Cheng Hao
>
> `SqlParser` is now an object, which hard to reuse it in extensions, a proper 
> implementation will be make the `SqlParser` as trait, and keep all of its 
> implementation unchanged, and then add another object called `SqlParser` 
> inherits from the trait.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



RE: new datasource

2015-11-19 Thread Cheng, Hao
I think you probably need to write some code as you need to support the ES, 
there are 2 options per my understanding:

Create a new Data Source from scratch, but you probably need to overwrite the 
interface at:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L751

Or you can reuse most of code in ParquetRelation in the new DataSource, but 
also need to modify your own logic, see
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L285

Hope it helpful.

Hao
From: james.gre...@baesystems.com [mailto:james.gre...@baesystems.com]
Sent: Thursday, November 19, 2015 11:14 PM
To: dev@spark.apache.org
Subject: new datasource



We have written a new Spark DataSource that uses both Parquet and 
ElasticSearch.  It is based on the existing Parquet DataSource.   When I look 
at the filters being pushed down to buildScan I don’t get anything representing 
any filters based on UDFs – or for any fields generated by an explode – I had 
thought if I made it a CatalystScan I would get everything I needed.



This is fine from the Parquet point of view – but we are using ElasticSearch to 
index/filter the data we are searching and I need to be able to capture the UDF 
conditions – or have access to the Plan AST in order that I can construct a 
query for ElasticSearch.



I am thinking I might just need to patch Spark to do this – but I’d prefer not 
too if there is a way of getting round this without hacking the core code.  Any 
ideas?



Thanks



James


Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.


RE: A proposal for Spark 2.0

2015-11-12 Thread Cheng, Hao
I am not sure what the best practice for this specific problem, but it’s really 
worth to think about it in 2.0, as it is a painful issue for lots of users.

By the way, is it also an opportunity to deprecate the RDD API (or internal API 
only?)? As lots of its functionality overlapping with DataFrame or DataSet.

Hao

From: Kostas Sakellis [mailto:kos...@cloudera.com]
Sent: Friday, November 13, 2015 5:27 AM
To: Nicholas Chammas
Cc: Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org; Reynold Xin
Subject: Re: A proposal for Spark 2.0

I know we want to keep breaking changes to a minimum but I'm hoping that with 
Spark 2.0 we can also look at better classpath isolation with user programs. I 
propose we build on spark.{driver|executor}.userClassPathFirst, setting it true 
by default, and not allow any spark transitive dependencies to leak into user 
code. For backwards compatibility we can have a whitelist if we want but I'd be 
good if we start requiring user apps to explicitly pull in all their 
dependencies. From what I can tell, Hadoop 3 is also moving in this direction.

Kostas

On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas 
> wrote:

With regards to Machine learning, it would be great to move useful features 
from MLlib to ML and deprecate the former. Current structure of two separate 
machine learning packages seems to be somewhat confusing.

With regards to GraphX, it would be great to deprecate the use of RDD in GraphX 
and switch to Dataframe. This will allow GraphX evolve with Tungsten.

On that note of deprecating stuff, it might be good to deprecate some things in 
2.0 without removing or replacing them immediately. That way 2.0 doesn’t have 
to wait for everything that we want to deprecate to be replaced all at once.

Nick
​

On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander 
> wrote:
Parameter Server is a new feature and thus does not match the goal of 2.0 is 
“to fix things that are broken in the current API and remove certain deprecated 
APIs”. At the same time I would be happy to have that feature.

With regards to Machine learning, it would be great to move useful features 
from MLlib to ML and deprecate the former. Current structure of two separate 
machine learning packages seems to be somewhat confusing.
With regards to GraphX, it would be great to deprecate the use of RDD in GraphX 
and switch to Dataframe. This will allow GraphX evolve with Tungsten.

Best regards, Alexander

From: Nan Zhu [mailto:zhunanmcg...@gmail.com]
Sent: Thursday, November 12, 2015 7:28 AM
To: wi...@qq.com
Cc: dev@spark.apache.org
Subject: Re: A proposal for Spark 2.0

Being specific to Parameter Server, I think the current agreement is that PS 
shall exist as a third-party library instead of a component of the core code 
base, isn’t?

Best,

--
Nan Zhu
http://codingcat.me


On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com 
wrote:
Who has the idea of machine learning? Spark missing some features for machine 
learning, For example, the parameter server.


在 2015年11月12日,05:32,Matei Zaharia 
> 写道:

I like the idea of popping out Tachyon to an optional component too to reduce 
the number of dependencies. In the future, it might even be useful to do this 
for Hadoop, but it requires too many API changes to be worth doing now.

Regarding Scala 2.12, we should definitely support it eventually, but I don't 
think we need to block 2.0 on that because it can be added later too. Has 
anyone investigated what it would take to run on there? I imagine we don't need 
many code changes, just maybe some REPL stuff.

Needless to say, but I'm all for the idea of making "major" releases as 
undisruptive as possible in the model Reynold proposed. Keeping everyone 
working with the same set of releases is super important.

Matei

On Nov 11, 2015, at 4:58 AM, Sean Owen 
> wrote:

On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin 
> wrote:
to the Spark community. A major release should not be very different from a
minor release and should not be gated based on new features. The main
purpose of a major release is an opportunity to fix things that are broken
in the current API and remove certain deprecated APIs (examples follow).

Agree with this stance. Generally, a major release might also be a
time to replace some big old API or implementation with a new one, but
I don't see obvious candidates.

I wouldn't mind turning attention to 2.x sooner than later, unless
there's a fairly good reason to continue adding features in 1.x to a
1.7 release. The scope as of 1.6 is already pretty darned big.


1. Scala 2.11 as the default build. We should still 

RE: A proposal for Spark 2.0

2015-11-12 Thread Cheng, Hao
Agree, more features/apis/optimization need to be added in DF/DS.

I mean, we need to think about what kind of RDD APIs we have to provide to 
developer, maybe the fundamental API is enough, like, the ShuffledRDD etc..  
But PairRDDFunctions probably not in this category, as we can do the same thing 
easily with DF/DS, even better performance.

From: Mark Hamstra [mailto:m...@clearstorydata.com]
Sent: Friday, November 13, 2015 11:23 AM
To: Stephen Boesch
Cc: dev@spark.apache.org
Subject: Re: A proposal for Spark 2.0

Hmmm... to me, that seems like precisely the kind of thing that argues for 
retaining the RDD API but not as the first thing presented to new Spark 
developers: "Here's how to use groupBy with DataFrames Until the optimizer 
is more fully developed, that won't always get you the best performance that 
can be obtained.  In these particular circumstances, ..., you may want to use 
the low-level RDD API while setting preservesPartitioning to true.  Like 
this"

On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch 
<java...@gmail.com<mailto:java...@gmail.com>> wrote:
My understanding is that  the RDD's presently have more support for complete 
control of partitioning which is a key consideration at scale.  While 
partitioning control is still piecemeal in  DF/DS  it would seem premature to 
make RDD's a second-tier approach to spark dev.

An example is the use of groupBy when we know that the source relation (/RDD) 
is already partitioned on the grouping expressions.  AFAIK the spark sql still 
does not allow that knowledge to be applied to the optimizer - so a full 
shuffle will be performed. However in the native RDD we can use 
preservesPartitioning=true.

2015-11-12 17:42 GMT-08:00 Mark Hamstra 
<m...@clearstorydata.com<mailto:m...@clearstorydata.com>>:
The place of the RDD API in 2.0 is also something I've been wondering about.  I 
think it may be going too far to deprecate it, but changing emphasis is 
something that we might consider.  The RDD API came well before DataFrames and 
DataSets, so programming guides, introductory how-to articles and the like 
have, to this point, also tended to emphasize RDDs -- or at least to deal with 
them early.  What I'm thinking is that with 2.0 maybe we should overhaul all 
the documentation to de-emphasize and reposition RDDs.  In this scheme, 
DataFrames and DataSets would be introduced and fully addressed before RDDs.  
They would be presented as the normal/default/standard way to do things in 
Spark.  RDDs, in contrast, would be presented later as a kind of lower-level, 
closer-to-the-metal API that can be used in atypical, more specialized contexts 
where DataFrames or DataSets don't fully fit.

On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:
I am not sure what the best practice for this specific problem, but it’s really 
worth to think about it in 2.0, as it is a painful issue for lots of users.

By the way, is it also an opportunity to deprecate the RDD API (or internal API 
only?)? As lots of its functionality overlapping with DataFrame or DataSet.

Hao

From: Kostas Sakellis [mailto:kos...@cloudera.com<mailto:kos...@cloudera.com>]
Sent: Friday, November 13, 2015 5:27 AM
To: Nicholas Chammas
Cc: Ulanov, Alexander; Nan Zhu; wi...@qq.com<mailto:wi...@qq.com>; 
dev@spark.apache.org<mailto:dev@spark.apache.org>; Reynold Xin

Subject: Re: A proposal for Spark 2.0

I know we want to keep breaking changes to a minimum but I'm hoping that with 
Spark 2.0 we can also look at better classpath isolation with user programs. I 
propose we build on spark.{driver|executor}.userClassPathFirst, setting it true 
by default, and not allow any spark transitive dependencies to leak into user 
code. For backwards compatibility we can have a whitelist if we want but I'd be 
good if we start requiring user apps to explicitly pull in all their 
dependencies. From what I can tell, Hadoop 3 is also moving in this direction.

Kostas

On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas 
<nicholas.cham...@gmail.com<mailto:nicholas.cham...@gmail.com>> wrote:

With regards to Machine learning, it would be great to move useful features 
from MLlib to ML and deprecate the former. Current structure of two separate 
machine learning packages seems to be somewhat confusing.

With regards to GraphX, it would be great to deprecate the use of RDD in GraphX 
and switch to Dataframe. This will allow GraphX evolve with Tungsten.

On that note of deprecating stuff, it might be good to deprecate some things in 
2.0 without removing or replacing them immediately. That way 2.0 doesn’t have 
to wait for everything that we want to deprecate to be replaced all at once.

Nick
​

On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander 
<alexander.ula...@hpe.com<mailto:alexander.ula...@hpe.com>> wrote:
Parameter Server is a new feature and thus does not match 

[jira] [Commented] (SPARK-10865) [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type

2015-11-11 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001423#comment-15001423
 ] 

Cheng Hao commented on SPARK-10865:
---

1.5.2 is released, I am not sure whether part of it now or not.

> [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type
> ---
>
> Key: SPARK-10865
> URL: https://issues.apache.org/jira/browse/SPARK-10865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Cheng Hao
> Fix For: 1.6.0
>
>
> As per ceil/ceiling definition,it should get BIGINT return value
> -ceil(DOUBLE a), ceiling(DOUBLE a)
> -Returns the minimum BIGINT value that is equal to or greater than a.
> But in current Spark implementation, it got wrong value type.
> e.g., 
> select ceil(2642.12) from udf_test_web_sales limit 1;
> 2643.0
> In hive implementation, it got return value type like below:
> hive> select ceil(2642.12) from udf_test_web_sales limit 1;
> OK
> 2643



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10865) [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type

2015-11-11 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001422#comment-15001422
 ] 

Cheng Hao commented on SPARK-10865:
---

We actually follow the criteria of Hive, and actually I tested it in MySQL, it 
works in the same way. 

> [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type
> ---
>
> Key: SPARK-10865
> URL: https://issues.apache.org/jira/browse/SPARK-10865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Cheng Hao
> Fix For: 1.6.0
>
>
> As per ceil/ceiling definition,it should get BIGINT return value
> -ceil(DOUBLE a), ceiling(DOUBLE a)
> -Returns the minimum BIGINT value that is equal to or greater than a.
> But in current Spark implementation, it got wrong value type.
> e.g., 
> select ceil(2642.12) from udf_test_web_sales limit 1;
> 2643.0
> In hive implementation, it got return value type like below:
> hive> select ceil(2642.12) from udf_test_web_sales limit 1;
> OK
> 2643



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



RE: Sort Merge Join from the filesystem

2015-11-09 Thread Cheng, Hao
Yes, we definitely need to think how to handle this case, probably even more 
common than both sorted/partitioned tables case, can you jump to the jira and 
leave comment there?

From: Alex Nastetsky [mailto:alex.nastet...@vervemobile.com]
Sent: Tuesday, November 10, 2015 3:03 AM
To: Cheng, Hao
Cc: Reynold Xin; dev@spark.apache.org
Subject: Re: Sort Merge Join from the filesystem

Thanks for creating that ticket.

Another thing I was thinking of, is doing this type of join between dataset A 
which is already partitioned/sorted on disk and dataset B, which gets generated 
during the run of the application.

Dataset B would need something like repartitionAndSortWithinPartitions to be 
performed on it, using the same partitioner that was used with dataset A. Then 
dataset B could be joined with dataset A without needing to write it to disk 
first (unless it's too big to fit in memory, then it would need to be 
[partially] spilled).

On Wed, Nov 4, 2015 at 7:51 PM, Cheng, Hao 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:
Yes, we probably need more change for the data source API if we need to 
implement it in a generic way.
BTW, I create the JIRA by copy most of words from Alex. ☺

https://issues.apache.org/jira/browse/SPARK-11512


From: Reynold Xin [mailto:r...@databricks.com<mailto:r...@databricks.com>]
Sent: Thursday, November 5, 2015 1:36 AM
To: Alex Nastetsky
Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: Sort Merge Join from the filesystem

It's not supported yet, and not sure if there is a ticket for it. I don't think 
there is anything fundamentally hard here either.


On Wed, Nov 4, 2015 at 6:37 AM, Alex Nastetsky 
<alex.nastet...@vervemobile.com<mailto:alex.nastet...@vervemobile.com>> wrote:
(this is kind of a cross-post from the user list)

Does Spark support doing a sort merge join on two datasets on the file system 
that have already been partitioned the same with the same number of partitions 
and sorted within each partition, without needing to repartition/sort them 
again?

This functionality exists in
- Hive (hive.optimize.bucketmapjoin.sortedmerge)
- Pig (USING 'merge')
- MapReduce (CompositeInputFormat)

If this is not supported in Spark, is a ticket already open for it? Does the 
Spark architecture present unique difficulties to having this feature?

It is very useful to have this ability, as you can prepare dataset A to be 
joined with dataset B before B even exists, by pre-processing A with a 
partition/sort.

Thanks.




RE: dataframe slow down with tungsten turn on

2015-11-05 Thread Cheng, Hao
What’s the big size of the raw data and the result data? Is that any other 
changes like HDFS, Spark configuration, your own code etc. besides the Spark 
binary? Can you monitor the IO/CPU state while executing the final stage, and 
it will be great if you can paste the call stack if you observe the high CPU 
utilization.

And can you try not to cache anything and repeat the same step? Just be sure 
it’s not caused by the memory stuff.

From: gen tang [mailto:gen.tan...@gmail.com]
Sent: Friday, November 6, 2015 12:18 AM
To: dev@spark.apache.org
Subject: Fwd: dataframe slow down with tungsten turn on


-- Forwarded message --
From: gen tang <gen.tan...@gmail.com<mailto:gen.tan...@gmail.com>>
Date: Fri, Nov 6, 2015 at 12:14 AM
Subject: Re: dataframe slow down with tungsten turn on
To: "Cheng, Hao" <hao.ch...@intel.com<mailto:hao.ch...@intel.com>>

Hi,

My application is as follows:
1. create dataframe from hive table
2. transform dataframe to rdd of json and do some aggregations on json (in 
fact, I use pyspark, so it is rdd of dict)
3. retransform rdd of json to dataframe and cache it (triggered by count)
4. join several dataframe which is created by the above steps.
5. save final dataframe as json.(by dataframe write api)

There are a lot of stages, other stage is quite the same under two version of 
spark. However, the final step (save as json) is 1 min vs. 2 hour. In my 
opinion, I think it is writing to hdfs cause the slowness of final stage. 
However, I don't know why...

In fact, I make a mistake about the version of spark that I used. The spark 
which runs faster is build on source code of spark 1.4.1. The spark which runs 
slower is build on source code of spark 1.5.2, 2 days ago.

Any idea? Thanks a lot.

Cheers
Gen


On Thu, Nov 5, 2015 at 1:01 PM, Cheng, Hao 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:
BTW, 1 min V.S. 2 Hours, seems quite weird, can you provide more information on 
the ETL work?

From: Cheng, Hao [mailto:hao.ch...@intel.com<mailto:hao.ch...@intel.com>]
Sent: Thursday, November 5, 2015 12:56 PM
To: gen tang; dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: RE: dataframe slow down with tungsten turn on

1.5 has critical performance / bug issues, you’d better try 1.5.1 or 1.5.2rc 
version.

From: gen tang [mailto:gen.tan...@gmail.com]
Sent: Thursday, November 5, 2015 12:43 PM
To: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Fwd: dataframe slow down with tungsten turn on

Hi,

In fact, I tested the same code with spark 1.5 with tungsten turning off. The 
result is quite the same as tungsten turning on.
It seems that it is not the problem of tungsten, it is simply that spark 1.5 is 
slower than spark 1.4.

Is there any idea about why it happens?
Thanks a lot in advance

Cheers
Gen


-- Forwarded message --
From: gen tang <gen.tan...@gmail.com<mailto:gen.tan...@gmail.com>>
Date: Wed, Nov 4, 2015 at 3:54 PM
Subject: dataframe slow down with tungsten turn on
To: "u...@spark.apache.org<mailto:u...@spark.apache.org>" 
<u...@spark.apache.org<mailto:u...@spark.apache.org>>
Hi sparkers,

I am using dataframe to do some large ETL jobs.
More precisely, I create dataframe from HIVE table and do some operations. And 
then I save it as json.

When I used spark-1.4.1, the whole process is quite fast, about 1 mins. 
However, when I use the same code with spark-1.5.1(with tungsten turn on), it 
takes a about 2 hours to finish the same job.

I checked the detail of tasks, almost all the time is consumed by computation.
[https://owa.gf.com.cn/owa/service.svc/s/GetFileAttachment?id=AAMkAGEzNGJiN2Q4LTI2ODYtNGIyYS1hYWIyLTMzMTYxOGQzYTViNABGAACPuqp5iM6mRqg7wmvE6c8KBwBKGW%2B6dpgjRb4BfC%2BACXJIAAEPAABKGW%2B6dpgjRb4BfC%2BACXJIQcF3AAABEgAQAIeCeL7UEe9GhqECpYfXhDI%3D=7U3OIyan90CkQzeCMSlDnFM6WrDs5NIIksHvCIBBNwcmtRNW4tO1_1WPFeb51C1IsASUo1jqj_A.]
Any idea about why this happens?

Thanks a lot in advance for your help.

Cheers
Gen






RE: Rule Engine for Spark

2015-11-04 Thread Cheng, Hao
Or try Streaming SQL? Which is a simple layer on top of the Spark Streaming. ☺

https://github.com/Intel-bigdata/spark-streamingsql


From: Cassa L [mailto:lcas...@gmail.com]
Sent: Thursday, November 5, 2015 8:09 AM
To: Adrian Tanase
Cc: Stefano Baghino; user
Subject: Re: Rule Engine for Spark

Thanks for reply. How about DROOLs. Does it worj with Spark?

LCassa

On Wed, Nov 4, 2015 at 3:02 AM, Adrian Tanase 
> wrote:
Another way to do it is to extract your filters as SQL code and load it in a 
transform – which allows you to change the filters at runtime.

Inside the transform you could apply the filters by goind RDD -> DF -> SQL -> 
RDD.

Lastly, depending on how complex your filters are, you could skip SQL and 
create your own mini-DSL that runs inside transform. I’d definitely start here 
if the filter predicates are simple enough…

-adrian

From: Stefano Baghino
Date: Wednesday, November 4, 2015 at 10:15 AM
To: Cassa L
Cc: user
Subject: Re: Rule Engine for Spark

Hi LCassa,
unfortunately I don't have actual experience on this matter, however for a 
similar use case I have briefly evaluated 
Decision (then called literally Streaming 
CEP Engine) and it looked interesting. I hope it may help.

On Wed, Nov 4, 2015 at 1:42 AM, Cassa L 
> wrote:
Hi,
 Has anyone used rule engine with spark streaming? I have a case where data is 
streaming from Kafka and I need to apply some rules on it (instead of hard 
coding in a code).
Thanks,
LCassa



--
BR,
Stefano Baghino
Software Engineer @ Radicalbit



RE: Sort Merge Join from the filesystem

2015-11-04 Thread Cheng, Hao
Yes, we probably need more change for the data source API if we need to 
implement it in a generic way.
BTW, I create the JIRA by copy most of words from Alex. ☺

https://issues.apache.org/jira/browse/SPARK-11512


From: Reynold Xin [mailto:r...@databricks.com]
Sent: Thursday, November 5, 2015 1:36 AM
To: Alex Nastetsky
Cc: dev@spark.apache.org
Subject: Re: Sort Merge Join from the filesystem

It's not supported yet, and not sure if there is a ticket for it. I don't think 
there is anything fundamentally hard here either.


On Wed, Nov 4, 2015 at 6:37 AM, Alex Nastetsky 
> wrote:
(this is kind of a cross-post from the user list)

Does Spark support doing a sort merge join on two datasets on the file system 
that have already been partitioned the same with the same number of partitions 
and sorted within each partition, without needing to repartition/sort them 
again?

This functionality exists in
- Hive (hive.optimize.bucketmapjoin.sortedmerge)
- Pig (USING 'merge')
- MapReduce (CompositeInputFormat)

If this is not supported in Spark, is a ticket already open for it? Does the 
Spark architecture present unique difficulties to having this feature?

It is very useful to have this ability, as you can prepare dataset A to be 
joined with dataset B before B even exists, by pre-processing A with a 
partition/sort.

Thanks.



[jira] [Created] (SPARK-11512) Bucket Join

2015-11-04 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-11512:
-

 Summary: Bucket Join
 Key: SPARK-11512
 URL: https://issues.apache.org/jira/browse/SPARK-11512
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao


Sort merge join on two datasets on the file system that have already been 
partitioned the same with the same number of partitions and sorted within each 
partition, and we don't need to sort it again while join with the 
sorted/partitioned keys

This functionality exists in
- Hive (hive.optimize.bucketmapjoin.sortedmerge)
- Pig (USING 'merge')
- MapReduce (CompositeInputFormat)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11512) Bucket Join

2015-11-04 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990867#comment-14990867
 ] 

Cheng Hao commented on SPARK-11512:
---

Oh, yes, but SPARK-5292 is only about to support the Hive bucket, but in a 
generic way, we need to add support the bucket for Data Source API. Anyway, I 
will add a link with that jira issue.

> Bucket Join
> ---
>
> Key: SPARK-11512
> URL: https://issues.apache.org/jira/browse/SPARK-11512
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>    Reporter: Cheng Hao
>
> Sort merge join on two datasets on the file system that have already been 
> partitioned the same with the same number of partitions and sorted within 
> each partition, and we don't need to sort it again while join with the 
> sorted/partitioned keys
> This functionality exists in
> - Hive (hive.optimize.bucketmapjoin.sortedmerge)
> - Pig (USING 'merge')
> - MapReduce (CompositeInputFormat)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11512) Bucket Join

2015-11-04 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990868#comment-14990868
 ] 

Cheng Hao commented on SPARK-11512:
---

We need to support the "bucket" for DataSource API.

> Bucket Join
> ---
>
> Key: SPARK-11512
> URL: https://issues.apache.org/jira/browse/SPARK-11512
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>
> Sort merge join on two datasets on the file system that have already been 
> partitioned the same with the same number of partitions and sorted within 
> each partition, and we don't need to sort it again while join with the 
> sorted/partitioned keys
> This functionality exists in
> - Hive (hive.optimize.bucketmapjoin.sortedmerge)
> - Pig (USING 'merge')
> - MapReduce (CompositeInputFormat)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



RE: dataframe slow down with tungsten turn on

2015-11-04 Thread Cheng, Hao
BTW, 1 min V.S. 2 Hours, seems quite weird, can you provide more information on 
the ETL work?

From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Thursday, November 5, 2015 12:56 PM
To: gen tang; dev@spark.apache.org
Subject: RE: dataframe slow down with tungsten turn on

1.5 has critical performance / bug issues, you’d better try 1.5.1 or 1.5.2rc 
version.

From: gen tang [mailto:gen.tan...@gmail.com]
Sent: Thursday, November 5, 2015 12:43 PM
To: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Fwd: dataframe slow down with tungsten turn on

Hi,

In fact, I tested the same code with spark 1.5 with tungsten turning off. The 
result is quite the same as tungsten turning on.
It seems that it is not the problem of tungsten, it is simply that spark 1.5 is 
slower than spark 1.4.

Is there any idea about why it happens?
Thanks a lot in advance

Cheers
Gen


-- Forwarded message --
From: gen tang <gen.tan...@gmail.com<mailto:gen.tan...@gmail.com>>
Date: Wed, Nov 4, 2015 at 3:54 PM
Subject: dataframe slow down with tungsten turn on
To: "u...@spark.apache.org<mailto:u...@spark.apache.org>" 
<u...@spark.apache.org<mailto:u...@spark.apache.org>>
Hi sparkers,

I am using dataframe to do some large ETL jobs.
More precisely, I create dataframe from HIVE table and do some operations. And 
then I save it as json.

When I used spark-1.4.1, the whole process is quite fast, about 1 mins. 
However, when I use the same code with spark-1.5.1(with tungsten turn on), it 
takes a about 2 hours to finish the same job.

I checked the detail of tasks, almost all the time is consumed by computation.
[https://owa.gf.com.cn/owa/service.svc/s/GetFileAttachment?id=AAMkAGEzNGJiN2Q4LTI2ODYtNGIyYS1hYWIyLTMzMTYxOGQzYTViNABGAACPuqp5iM6mRqg7wmvE6c8KBwBKGW%2B6dpgjRb4BfC%2BACXJIAAEPAABKGW%2B6dpgjRb4BfC%2BACXJIQcF3AAABEgAQAIeCeL7UEe9GhqECpYfXhDI%3D=7U3OIyan90CkQzeCMSlDnFM6WrDs5NIIksHvCIBBNwcmtRNW4tO1_1WPFeb51C1IsASUo1jqj_A.]
Any idea about why this happens?

Thanks a lot in advance for your help.

Cheers
Gen




RE: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-04 Thread Cheng, Hao
Probably 2 reasons:

1.  HadoopFsRelation was introduced since 1.4, but seems CsvRelation was 
created based on 1.3

2.  HadoopFsRelation introduces the concept of Partition, which probably 
not necessary for LibSVMRelation.

But I think it will be easy to change as extending from HadoopFsRelation.

Hao

From: Jeff Zhang [mailto:zjf...@gmail.com]
Sent: Thursday, November 5, 2015 10:31 AM
To: dev@spark.apache.org
Subject: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?


Not sure the reason,  it seems LibSVMRelation and CsvRelation can extends 
HadoopFsRelation and leverage the features from HadoopFsRelation.  Any other 
consideration for that ?


--
Best Regards

Jeff Zhang


RE: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-04 Thread Cheng, Hao
I think you’re right, we do offer the opportunity for developers to make 
mistakes while implementing the new Data Source.

Here we assume that the new relation MUST NOT extends more than one trait of 
the CatalystScan, TableScan, PrunedScan, PrunedFilteredScan , etc. otherwise it 
will causes problem as you described, probably we can add additional checking / 
reporting rule for the abuse.


From: Jeff Zhang [mailto:zjf...@gmail.com]
Sent: Thursday, November 5, 2015 1:55 PM
To: Cheng, Hao
Cc: dev@spark.apache.org
Subject: Re: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

Thanks Hao. I have ready made it extends HadoopFsRelation and it works. Will 
create a jira for that.

Besides that, I noticed that in DataSourceStrategy, spark build physical plan 
based on the trait of the BaseRelation in pattern matching (e.g. CatalystScan, 
TableScan, HadoopFsRelation). That means the order matters. I think it is risky 
because that means one BaseRelation can't extends more than 2 of these traits. 
And seems there's no place to restrict to extends more than 2 traits. Maybe 
needs to clean and reorganize these traits otherwise user may meets some weird 
issue when developing new DataSource.



On Thu, Nov 5, 2015 at 1:16 PM, Cheng, Hao 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:
Probably 2 reasons:

1.  HadoopFsRelation was introduced since 1.4, but seems CsvRelation was 
created based on 1.3

2.  HadoopFsRelation introduces the concept of Partition, which probably 
not necessary for LibSVMRelation.

But I think it will be easy to change as extending from HadoopFsRelation.

Hao

From: Jeff Zhang [mailto:zjf...@gmail.com<mailto:zjf...@gmail.com>]
Sent: Thursday, November 5, 2015 10:31 AM
To: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?


Not sure the reason,  it seems LibSVMRelation and CsvRelation can extends 
HadoopFsRelation and leverage the features from HadoopFsRelation.  Any other 
consideration for that ?


--
Best Regards

Jeff Zhang



--
Best Regards

Jeff Zhang


RE: Sort Merge Join

2015-11-02 Thread Cheng, Hao
No as far as I can tell, @Michael @YinHuai @Reynold , any comments on this 
optimization?

From: Jonathan Coveney [mailto:jcove...@gmail.com]
Sent: Tuesday, November 3, 2015 4:17 AM
To: Alex Nastetsky
Cc: Cheng, Hao; user
Subject: Re: Sort Merge Join

Additionally, I'm curious if there are any JIRAS around making dataframes 
support ordering better? there are a lot of operations that can be optimized if 
you know that you have a total ordering on your data...are there any plans, or 
at least JIRAS, around having the catalyst optimizer handle this case?

2015-11-02 9:39 GMT-05:00 Alex Nastetsky 
<alex.nastet...@vervemobile.com<mailto:alex.nastet...@vervemobile.com>>:
Thanks for the response.

Taking the file system based data source as “UnknownPartitioning”, will be a 
simple and SAFE way for JOIN, as it’s hard to guarantee the records from 
different data sets with the identical join keys will be loaded by the same 
node/task , since lots of factors need to be considered, like task pool size, 
cluster size, source format, storage, data locality etc.,.
I’ll agree it’s worth to optimize it for performance concerns, and actually in 
Hive, it is called bucket join. I am not sure will that happens soon in Spark 
SQL.

Yes, this is supported in

  *   Hive with bucket join
  *   Pig with USING 
"merge"<https://pig.apache.org/docs/r0.15.0/perf.html#merge-joins>
  *   MR with CompositeInputFormat
But I guess it's not supported in Spark?

On Mon, Nov 2, 2015 at 12:32 AM, Cheng, Hao 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:
1) Once SortMergeJoin is enabled, will it ever use ShuffledHashJoin? For 
example, in the code below, the two datasets have different number of 
partitions, but it still does a SortMerge join after a "hashpartitioning".

[Hao:] A distributed JOIN operation (either HashBased or SortBased Join) 
requires the records with the identical join keys MUST BE shuffled to the same 
“reducer” node / task, hashpartitioning is just a strategy to tell spark 
shuffle service how to achieve that, in theory, we even can use the 
`RangePartitioning` instead (but it’s less efficient, that’s why we don’t 
choose it for JOIN). So conceptually the JOIN operator doesn’t care so much 
about the shuffle strategy so much if it satisfies the demand on data 
distribution.

2) If both datasets have already been previously partitioned/sorted the same 
and stored on the file system (e.g. in a previous job), is there a way to tell 
Spark this so that it won't want to do a "hashpartitioning" on them? It looks 
like Spark just considers datasets that have been just read from the the file 
system to have UnknownPartitioning. In the example below, I try to join a 
dataframe to itself, and it still wants to hash repartition.

[Hao:] Take this as example:

EXPLAIN SELECT a.value, b.value, c.value FROM src a JOIN src b ON a.key=b.key 
JOIN src c ON b.key=c.key

== Physical Plan ==
TungstenProject [value#20,value#22,value#24]
SortMergeJoin [key#21], [key#23]
  TungstenSort [key#21 ASC], false, 0
   TungstenProject [key#21,value#22,value#20]
SortMergeJoin [key#19], [key#21]
 TungstenSort [key#19 ASC], false, 0
  TungstenExchange hashpartitioning(key#19,200)
   ConvertToUnsafe
HiveTableScan [key#19,value#20], (MetastoreRelation default, src, 
Some(a))
 TungstenSort [key#21 ASC], false, 0
  TungstenExchange hashpartitioning(key#21,200)
   ConvertToUnsafe
HiveTableScan [key#21,value#22], (MetastoreRelation default, src, 
Some(b))
  TungstenSort [key#23 ASC], false, 0
   TungstenExchange hashpartitioning(key#23,200)
ConvertToUnsafe
 HiveTableScan [key#23,value#24], (MetastoreRelation default, src, Some(c))

There is no hashpartitioning anymore for the RESULT of “FROM src a JOIN src b 
ON a.key=b.key”, as we didn’t change the data distribution after it, so we can 
join another table “JOIN src c ON b.key=c.key” directly, which only require the 
table “c” for repartitioning on “key”.

Taking the file system based data source as “UnknownPartitioning”, will be a 
simple and SAFE way for JOIN, as it’s hard to guarantee the records from 
different data sets with the identical join keys will be loaded by the same 
node/task , since lots of factors need to be considered, like task pool size, 
cluster size, source format, storage, data locality etc.,.
I’ll agree it’s worth to optimize it for performance concerns, and actually in 
Hive, it is called bucket join. I am not sure will that happens soon in Spark 
SQL.

Hao

From: Alex Nastetsky 
[mailto:alex.nastet...@vervemobile.com<mailto:alex.nastet...@vervemobile.com>]
Sent: Monday, November 2, 2015 11:29 AM
To: user
Subject: Sort Merge Join

Hi,

I'm trying to understand SortMergeJoin (SPARK-2213).

1) Once SortMergeJoin is enabled, will it ever use ShuffledHashJoin? For 
example, in the code below, the two datasets have different number of 
partitions, but it sti

RE: Sort Merge Join

2015-11-01 Thread Cheng, Hao
1) Once SortMergeJoin is enabled, will it ever use ShuffledHashJoin? For 
example, in the code below, the two datasets have different number of 
partitions, but it still does a SortMerge join after a "hashpartitioning".

[Hao:] A distributed JOIN operation (either HashBased or SortBased Join) 
requires the records with the identical join keys MUST BE shuffled to the same 
“reducer” node / task, hashpartitioning is just a strategy to tell spark 
shuffle service how to achieve that, in theory, we even can use the 
`RangePartitioning` instead (but it’s less efficient, that’s why we don’t 
choose it for JOIN). So conceptually the JOIN operator doesn’t care so much 
about the shuffle strategy so much if it satisfies the demand on data 
distribution.

2) If both datasets have already been previously partitioned/sorted the same 
and stored on the file system (e.g. in a previous job), is there a way to tell 
Spark this so that it won't want to do a "hashpartitioning" on them? It looks 
like Spark just considers datasets that have been just read from the the file 
system to have UnknownPartitioning. In the example below, I try to join a 
dataframe to itself, and it still wants to hash repartition.

[Hao:] Take this as example:

EXPLAIN SELECT a.value, b.value, c.value FROM src a JOIN src b ON a.key=b.key 
JOIN src c ON b.key=c.key

== Physical Plan ==
TungstenProject [value#20,value#22,value#24]
SortMergeJoin [key#21], [key#23]
  TungstenSort [key#21 ASC], false, 0
   TungstenProject [key#21,value#22,value#20]
SortMergeJoin [key#19], [key#21]
 TungstenSort [key#19 ASC], false, 0
  TungstenExchange hashpartitioning(key#19,200)
   ConvertToUnsafe
HiveTableScan [key#19,value#20], (MetastoreRelation default, src, 
Some(a))
 TungstenSort [key#21 ASC], false, 0
  TungstenExchange hashpartitioning(key#21,200)
   ConvertToUnsafe
HiveTableScan [key#21,value#22], (MetastoreRelation default, src, 
Some(b))
  TungstenSort [key#23 ASC], false, 0
   TungstenExchange hashpartitioning(key#23,200)
ConvertToUnsafe
 HiveTableScan [key#23,value#24], (MetastoreRelation default, src, Some(c))

There is no hashpartitioning anymore for the RESULT of “FROM src a JOIN src b 
ON a.key=b.key”, as we didn’t change the data distribution after it, so we can 
join another table “JOIN src c ON b.key=c.key” directly, which only require the 
table “c” for repartitioning on “key”.

Taking the file system based data source as “UnknownPartitioning”, will be a 
simple and SAFE way for JOIN, as it’s hard to guarantee the records from 
different data sets with the identical join keys will be loaded by the same 
node/task , since lots of factors need to be considered, like task pool size, 
cluster size, source format, storage, data locality etc.,.
I’ll agree it’s worth to optimize it for performance concerns, and actually in 
Hive, it is called bucket join. I am not sure will that happens soon in Spark 
SQL.

Hao

From: Alex Nastetsky [mailto:alex.nastet...@vervemobile.com]
Sent: Monday, November 2, 2015 11:29 AM
To: user
Subject: Sort Merge Join

Hi,

I'm trying to understand SortMergeJoin (SPARK-2213).

1) Once SortMergeJoin is enabled, will it ever use ShuffledHashJoin? For 
example, in the code below, the two datasets have different number of 
partitions, but it still does a SortMerge join after a "hashpartitioning".

CODE:
   val sparkConf = new SparkConf()
  .setAppName("SortMergeJoinTest")
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .set("spark.eventLog.enabled", "true")
  .set("spark.sql.planner.sortMergeJoin","true")

sparkConf.setMaster("local-cluster[3,1,1024]")

val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._

val inputpath = input.gz.parquet

val df1 = sqlContext.read.parquet(inputpath).repartition(3)
val df2 = sqlContext.read.parquet(inputpath).repartition(5)
val result = df1.join(df2.withColumnRenamed("foo","foo2"), $"foo" === 
$"foo2")
result.explain()

OUTPUT:
== Physical Plan ==
SortMergeJoin [foo#0], [foo2#8]
TungstenSort [foo#0 ASC], false, 0
  TungstenExchange hashpartitioning(foo#0)
  ConvertToUnsafe
Repartition 3, true
Scan 
ParquetRelation[file:input.gz.parquet][foo#0,bar#1L,somefield#2,anotherfield#3]
TungstenSort [foo2#8 ASC], false, 0
  TungstenExchange hashpartitioning(foo2#8)
  TungstenProject [foo#4 AS foo2#8,bar#5L,somefield#6,anotherfield#7]
Repartition 5, true
Scan 
ParquetRelation[file:input.gz.parquet][foo#4,bar#5L,somefield#6,anotherfield#7]

2) If both datasets have already been previously partitioned/sorted the same 
and stored on the file system (e.g. in a previous job), is there a way to tell 
Spark this so that it won't want to do a "hashpartitioning" on them? It looks 
like Spark just considers datasets that have been just read from the the file 

[jira] [Commented] (SPARK-10371) Optimize sequential projections

2015-10-29 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981650#comment-14981650
 ] 

Cheng Hao commented on SPARK-10371:
---

Eliminating the common sub expression within the projection?

> Optimize sequential projections
> ---
>
> Key: SPARK-10371
> URL: https://issues.apache.org/jira/browse/SPARK-10371
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>
> In ML pipelines, each transformer/estimator appends new columns to the input 
> DataFrame. For example, it might produce DataFrames like the following 
> columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), 
> and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c 
> and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used.
> It would be nice to detect this pattern and re-use intermediate values.
> {code}
> val input = sqlContext.range(10)
> val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * 
> 2)
> output.explain(true)
> == Parsed Logical Plan ==
> 'Project [*,('x * 2) AS y#254]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Analyzed Logical Plan ==
> id: bigint, x: bigint, y: bigint
> Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Optimized Logical Plan ==
> Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L]
>  LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Physical Plan ==
> TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS 
> y#254L]
>  Scan PhysicalRDD[id#252L]
> Code Generation: true
> input: org.apache.spark.sql.DataFrame = [id: bigint]
> output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



RE: [Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-28 Thread Cheng, Hao
Hi Jerry, I’ve filed a bug in jira, and also the fixing

https://issues.apache.org/jira/browse/SPARK-11364

It will be great appreciated if you can verify the PR with your case.

Thanks,
Hao

From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Wednesday, October 28, 2015 8:51 AM
To: Jerry Lam; Marcelo Vanzin
Cc: user@spark.apache.org
Subject: RE: [Spark-SQL]: Unable to propagate hadoop configuration after 
SparkContext is initialized

After a draft glance, seems a bug in Spark SQL, do you mind to create a jira 
for this? And then I can start to fix it.

Thanks,
Hao

From: Jerry Lam [mailto:chiling...@gmail.com]
Sent: Wednesday, October 28, 2015 3:13 AM
To: Marcelo Vanzin
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: [Spark-SQL]: Unable to propagate hadoop configuration after 
SparkContext is initialized

Hi Marcelo,

I tried setting the properties before instantiating spark context via 
SparkConf. It works fine.
Originally, the code I have read hadoop configurations from hdfs-site.xml which 
works perfectly fine as well.
Therefore, can I conclude that sparkContext.hadoopConfiguration.set("key", 
"value") does not propagate through all SQL jobs within the same SparkContext? 
I haven't try with Spark Core so I cannot tell.

Is there a workaround given it seems to be broken? I need to do this 
programmatically after the SparkContext is instantiated not before...

Best Regards,

Jerry

On Tue, Oct 27, 2015 at 2:30 PM, Marcelo Vanzin 
<van...@cloudera.com<mailto:van...@cloudera.com>> wrote:
If setting the values in SparkConf works, there's probably some bug in
the SQL code; e.g. creating a new Configuration object instead of
using the one in SparkContext. But I'm not really familiar with that
code.

On Tue, Oct 27, 2015 at 11:22 AM, Jerry Lam 
<chiling...@gmail.com<mailto:chiling...@gmail.com>> wrote:
> Hi Marcelo,
>
> Thanks for the advice. I understand that we could set the configurations
> before creating SparkContext. My question is
> SparkContext.hadoopConfiguration.set("key","value") doesn't seem to
> propagate to all subsequent SQLContext jobs. Note that I mentioned I can
> load the parquet file but I cannot perform a count on the parquet file
> because of the AmazonClientException. It means that the credential is used
> during the loading of the parquet but not when we are processing the parquet
> file. How this can happen?
>
> Best Regards,
>
> Jerry
>
>
> On Tue, Oct 27, 2015 at 2:05 PM, Marcelo Vanzin 
> <van...@cloudera.com<mailto:van...@cloudera.com>> wrote:
>>
>> On Tue, Oct 27, 2015 at 10:43 AM, Jerry Lam 
>> <chiling...@gmail.com<mailto:chiling...@gmail.com>> wrote:
>> > Anyone experiences issues in setting hadoop configurations after
>> > SparkContext is initialized? I'm using Spark 1.5.1.
>> >
>> > I'm trying to use s3a which requires access and secret key set into
>> > hadoop
>> > configuration. I tried to set the properties in the hadoop configuration
>> > from sparktcontext.
>> >
>> > sc.hadoopConfiguration.set("fs.s3a.access.key", AWSAccessKeyId)
>> > sc.hadoopConfiguration.set("fs.s3a.secret.key", AWSSecretKey)
>>
>> Try setting "spark.hadoop.fs.s3a.access.key" and
>> "spark.hadoop.fs.s3a.secret.key" in your SparkConf before creating the
>> SparkContext.
>>
>> --
>> Marcelo
>
>

--
Marcelo



[jira] [Commented] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-28 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14979646#comment-14979646
 ] 

Cheng Hao commented on SPARK-11330:
---

[~saif.a.ellafi] I've checked that with 1.5.0 and it's confirmed it can be 
reproduced, however, it does not exists in latest master branch, I am still 
digging when and how it's been fixed.

> Filter operation on StringType after groupBy PERSISTED brings no results
> 
>
> Key: SPARK-11330
> URL: https://issues.apache.org/jira/browse/SPARK-11330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Stand alone Cluster of five servers (happens as well in 
> local mode). sqlContext instance of HiveContext (happens as well with 
> SQLContext)
> No special options other than driver memory and executor memory.
> Parquet partitions are 512 where there are 160 cores. Happens as well with 
> other partitioning
> Data is nearly 2 billion rows.
>Reporter: Saif Addin Ellafi
>Priority: Blocker
> Attachments: bug_reproduce.zip, bug_reproduce_50k.zip
>
>
> ONLY HAPPENS WHEN PERSIST() IS CALLED
> val data = sqlContext.read.parquet("/var/data/Saif/ppnr_pqt")
> data.groupBy("vintages").count.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> res9: org.apache.spark.sql.Row = [2007-01-01]
> data.groupBy("vintages").count.persist.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> Exception on empty iterator stuff
> This does not happen if using another type of field, eg IntType
> data.groupBy("mm").count.persist.select("mm").filter("mm = 
> 200805").first >>> res13: org.apache.spark.sql.Row = [200805]
> NOTE: I have no idea whether I used KRYO serializer when stored this parquet.
> NOTE2: If setting the persist after the filtering, it works fine. But this is 
> not a good enough workaround since any filter operation afterwards will break 
> results.
> NOTE3: I have reproduced the issue with several different datasets.
> NOTE4: The only real workaround is to store the groupBy dataframe in database 
> and reload it as a new dataframe.
> Query to raw-data works fine:
> data.select("vintages").filter("vintages = '2007-01-01'").first >>> res4: 
> org.apache.spark.sql.Row = [2007-01-01]
> Originally, the issue happened with a larger aggregation operation, the 
> result was that data was inconsistent bringing different results every call.
> Reducing the operation step by step, I got into this issue.
> In any case, the original operation was:
> val data = sqlContext.read.parquet("/var/Saif/data_pqt")
> val res = data.groupBy("product", "band", "age", "vint", "mb", 
> "mm").agg(count($"account_id").as("N"), 
> sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), 
> sum($"spend").as("spend"), sum($"payment").as("payment"), 
> sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct" 
> === 1).as("newacct")).persist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res0: Int = 102
> res.unpersist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res1: Int = 103



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-28 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14979699#comment-14979699
 ] 

Cheng Hao edited comment on SPARK-11330 at 10/29/15 2:48 AM:
-

OK, seems it's solved in https://issues.apache.org/jira/browse/SPARK-10859, it 
should not be the problem in the 1.5.2 or 1.6.0 any more.


was (Author: chenghao):
OK, seems it's solved in https://issues.apache.org/jira/browse/SPARK-11330, it 
should not be the problem in the 1.5.2 or 1.6.0 any more.

> Filter operation on StringType after groupBy PERSISTED brings no results
> 
>
> Key: SPARK-11330
> URL: https://issues.apache.org/jira/browse/SPARK-11330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Stand alone Cluster of five servers (happens as well in 
> local mode). sqlContext instance of HiveContext (happens as well with 
> SQLContext)
> No special options other than driver memory and executor memory.
> Parquet partitions are 512 where there are 160 cores. Happens as well with 
> other partitioning
> Data is nearly 2 billion rows.
>Reporter: Saif Addin Ellafi
>Priority: Blocker
> Attachments: bug_reproduce.zip, bug_reproduce_50k.zip
>
>
> ONLY HAPPENS WHEN PERSIST() IS CALLED
> val data = sqlContext.read.parquet("/var/data/Saif/ppnr_pqt")
> data.groupBy("vintages").count.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> res9: org.apache.spark.sql.Row = [2007-01-01]
> data.groupBy("vintages").count.persist.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> Exception on empty iterator stuff
> This does not happen if using another type of field, eg IntType
> data.groupBy("mm").count.persist.select("mm").filter("mm = 
> 200805").first >>> res13: org.apache.spark.sql.Row = [200805]
> NOTE: I have no idea whether I used KRYO serializer when stored this parquet.
> NOTE2: If setting the persist after the filtering, it works fine. But this is 
> not a good enough workaround since any filter operation afterwards will break 
> results.
> NOTE3: I have reproduced the issue with several different datasets.
> NOTE4: The only real workaround is to store the groupBy dataframe in database 
> and reload it as a new dataframe.
> Query to raw-data works fine:
> data.select("vintages").filter("vintages = '2007-01-01'").first >>> res4: 
> org.apache.spark.sql.Row = [2007-01-01]
> Originally, the issue happened with a larger aggregation operation, the 
> result was that data was inconsistent bringing different results every call.
> Reducing the operation step by step, I got into this issue.
> In any case, the original operation was:
> val data = sqlContext.read.parquet("/var/Saif/data_pqt")
> val res = data.groupBy("product", "band", "age", "vint", "mb", 
> "mm").agg(count($"account_id").as("N"), 
> sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), 
> sum($"spend").as("spend"), sum($"payment").as("payment"), 
> sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct" 
> === 1).as("newacct")).persist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res0: Int = 102
> res.unpersist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res1: Int = 103



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-28 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14979699#comment-14979699
 ] 

Cheng Hao commented on SPARK-11330:
---

OK, seems it's solved in https://issues.apache.org/jira/browse/SPARK-11330, it 
should not be the problem in the 1.5.2 or 1.6.0 any more.

> Filter operation on StringType after groupBy PERSISTED brings no results
> 
>
> Key: SPARK-11330
> URL: https://issues.apache.org/jira/browse/SPARK-11330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Stand alone Cluster of five servers (happens as well in 
> local mode). sqlContext instance of HiveContext (happens as well with 
> SQLContext)
> No special options other than driver memory and executor memory.
> Parquet partitions are 512 where there are 160 cores. Happens as well with 
> other partitioning
> Data is nearly 2 billion rows.
>Reporter: Saif Addin Ellafi
>Priority: Blocker
> Attachments: bug_reproduce.zip, bug_reproduce_50k.zip
>
>
> ONLY HAPPENS WHEN PERSIST() IS CALLED
> val data = sqlContext.read.parquet("/var/data/Saif/ppnr_pqt")
> data.groupBy("vintages").count.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> res9: org.apache.spark.sql.Row = [2007-01-01]
> data.groupBy("vintages").count.persist.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> Exception on empty iterator stuff
> This does not happen if using another type of field, eg IntType
> data.groupBy("mm").count.persist.select("mm").filter("mm = 
> 200805").first >>> res13: org.apache.spark.sql.Row = [200805]
> NOTE: I have no idea whether I used KRYO serializer when stored this parquet.
> NOTE2: If setting the persist after the filtering, it works fine. But this is 
> not a good enough workaround since any filter operation afterwards will break 
> results.
> NOTE3: I have reproduced the issue with several different datasets.
> NOTE4: The only real workaround is to store the groupBy dataframe in database 
> and reload it as a new dataframe.
> Query to raw-data works fine:
> data.select("vintages").filter("vintages = '2007-01-01'").first >>> res4: 
> org.apache.spark.sql.Row = [2007-01-01]
> Originally, the issue happened with a larger aggregation operation, the 
> result was that data was inconsistent bringing different results every call.
> Reducing the operation step by step, I got into this issue.
> In any case, the original operation was:
> val data = sqlContext.read.parquet("/var/Saif/data_pqt")
> val res = data.groupBy("product", "band", "age", "vint", "mb", 
> "mm").agg(count($"account_id").as("N"), 
> sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), 
> sum($"spend").as("spend"), sum($"payment").as("payment"), 
> sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct" 
> === 1).as("newacct")).persist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res0: Int = 102
> res.unpersist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res1: Int = 103



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



RE: SparkSQL on hive error

2015-10-27 Thread Cheng, Hao
Hi Anand, can you paste the table creating statement? I’d like to reproduce 
that in my local first, and BTW, which version are you using?

Hao

From: Anand Nalya [mailto:anand.na...@gmail.com]
Sent: Tuesday, October 27, 2015 11:35 PM
To: spark users
Subject: SparkSQL on hive error

Hi,
I've a partitioned table in Hive (Avro) that I can query alright from hive cli.
When using SparkSQL, I'm able to query some of the partitions, but getting  
exception on some of the partitions.
The query is:
sqlContext.sql("select * from myTable where source='http' and date = 
20150812").take(5).foreach(println)
The exception is:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
5, node1): java.lang.IllegalArgumentException: Error: type expected at the 
position 0 of 
'BIGINT:INT:INT:INT:INT:string:INT:string:string:string:string:string:string:string:string:string:string:string:string:string:string:INT:INT:string:BIGINT:string:string:BIGINT:BIGINT:string:string:string:string:string:FLOAT:FLOAT:string:string:string:string:BIGINT:BIGINT:string:string:string:string:string:string:BIGINT:string:string'
 but 'BIGINT' is found.
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:762)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:105)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$4$$anonfun$9.apply(TableReader.scala:191)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$4$$anonfun$9.apply(TableReader.scala:188)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Any pointers, what might be wrong here?

Regards,
Anand



RE: [Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Cheng, Hao
After a draft glance, seems a bug in Spark SQL, do you mind to create a jira 
for this? And then I can start to fix it.

Thanks,
Hao

From: Jerry Lam [mailto:chiling...@gmail.com]
Sent: Wednesday, October 28, 2015 3:13 AM
To: Marcelo Vanzin
Cc: user@spark.apache.org
Subject: Re: [Spark-SQL]: Unable to propagate hadoop configuration after 
SparkContext is initialized

Hi Marcelo,

I tried setting the properties before instantiating spark context via 
SparkConf. It works fine.
Originally, the code I have read hadoop configurations from hdfs-site.xml which 
works perfectly fine as well.
Therefore, can I conclude that sparkContext.hadoopConfiguration.set("key", 
"value") does not propagate through all SQL jobs within the same SparkContext? 
I haven't try with Spark Core so I cannot tell.

Is there a workaround given it seems to be broken? I need to do this 
programmatically after the SparkContext is instantiated not before...

Best Regards,

Jerry

On Tue, Oct 27, 2015 at 2:30 PM, Marcelo Vanzin 
> wrote:
If setting the values in SparkConf works, there's probably some bug in
the SQL code; e.g. creating a new Configuration object instead of
using the one in SparkContext. But I'm not really familiar with that
code.

On Tue, Oct 27, 2015 at 11:22 AM, Jerry Lam 
> wrote:
> Hi Marcelo,
>
> Thanks for the advice. I understand that we could set the configurations
> before creating SparkContext. My question is
> SparkContext.hadoopConfiguration.set("key","value") doesn't seem to
> propagate to all subsequent SQLContext jobs. Note that I mentioned I can
> load the parquet file but I cannot perform a count on the parquet file
> because of the AmazonClientException. It means that the credential is used
> during the loading of the parquet but not when we are processing the parquet
> file. How this can happen?
>
> Best Regards,
>
> Jerry
>
>
> On Tue, Oct 27, 2015 at 2:05 PM, Marcelo Vanzin 
> > wrote:
>>
>> On Tue, Oct 27, 2015 at 10:43 AM, Jerry Lam 
>> > wrote:
>> > Anyone experiences issues in setting hadoop configurations after
>> > SparkContext is initialized? I'm using Spark 1.5.1.
>> >
>> > I'm trying to use s3a which requires access and secret key set into
>> > hadoop
>> > configuration. I tried to set the properties in the hadoop configuration
>> > from sparktcontext.
>> >
>> > sc.hadoopConfiguration.set("fs.s3a.access.key", AWSAccessKeyId)
>> > sc.hadoopConfiguration.set("fs.s3a.secret.key", AWSSecretKey)
>>
>> Try setting "spark.hadoop.fs.s3a.access.key" and
>> "spark.hadoop.fs.s3a.secret.key" in your SparkConf before creating the
>> SparkContext.
>>
>> --
>> Marcelo
>
>


--
Marcelo



[jira] [Comment Edited] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-27 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14977600#comment-14977600
 ] 

Cheng Hao edited comment on SPARK-11330 at 10/28/15 2:28 AM:
-

Hi, [~saif.a.ellafi], I've tried the code like below:
{code}
case class Spark11330(account_id: Int, product: String, vint: String,
  band: String, age: Int, mb: String, mm: String,
  balance: Float, balancec: Float)

test("SPARK-11330: Filter operation on StringType after groupBy PERSISTED 
brings no results") {
withTempPath { f =>
  // generate the more data.
  val d = Seq(
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200808", 
1000.0f, 2000.0f),
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200808", 
2000.0f, 2000.0f),
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200809", 
2000.0f, 2000.0f),
Spark11330(2, "product2", "2007-01-01", "band3", 29, "mb1", "200809", 
2010.0f, 3000.0f))

  val data = List.tabulate[Seq[Spark11330]](10) { i => d }.flatten

  // save as parquet file in local disk
  sqlContext.sparkContext.parallelize(data, 4)

.toDF().write.format("parquet").mode(SaveMode.Overwrite).save(f.getAbsolutePath)

  // reproduce
  val df = sqlContext.read.parquet(f.getAbsolutePath)

  val f1 = df.groupBy("vint").count().persist().filter("vint = 
'2007-01-01'").first
  val f2 = df.groupBy("vint").count().filter("vint = '2007-01-01'").first

  assert(f1 == f2)

  val res = df
.groupBy("product", "band", "age", "vint", "mb", "mm")
.agg(
  count($"account_id").as("N"),
  sum($"balance").as("balance_eom"),
  sum($"balancec").as("balance_eoc")).persist()

  val c1 = res.select("vint", 
"mm").filter("vint='2007-01-01'").select("mm").distinct.collect
  res.unpersist()
  val c2 = res.select("vint", 
"mm").filter("vint='2007-01-01'").select("mm").distinct.collect
  assert(c1.sameElements(c2))
}
  }
{code}

Seems everything works fine, I am not sure if I missed something, can you try 
to reproduce the issue based on my code?


was (Author: chenghao):
Hi, [~saif.a.ellafi], I've tried the code like below:
{code}
case class Spark11330(account_id: Int, product: String, vint: String,
  band: String, age: Int, mb: String, mm: String,
  balance: Float, balancec: Float)

test("SPARK-11330: Filter operation on StringType after groupBy PERSISTED 
brings no results") {
withTempPath { f =>
  val d = Seq(
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200808", 
1000.0f, 2000.0f),
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200808", 
2000.0f, 2000.0f),
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200809", 
2000.0f, 2000.0f),
Spark11330(2, "product2", "2007-01-01", "band3", 29, "mb1", "200809", 
2010.0f, 3000.0f))

  val data = List.tabulate[Seq[Spark11330]](10) { i => d }.flatten

  sqlContext.sparkContext.parallelize(data, 4)

.toDF().write.format("parquet").mode(SaveMode.Overwrite).save(f.getAbsolutePath)

  val df = sqlContext.read.parquet(f.getAbsolutePath)

  val f1 = df.groupBy("vint").count().persist().filter("vint = 
'2007-01-01'").first
  val f2 = df.groupBy("vint").count().filter("vint = '2007-01-01'").first

  assert(f1 == f2)

  val res = df
.groupBy("product", "band", "age", "vint", "mb", "mm")
.agg(
  count($"account_id").as("N"),
  sum($"balance").as("balance_eom"),
  sum($"balancec").as("balance_eoc")).persist()

  val c1 = res.select("vint", 
"mm").filter("vint='2007-01-01'").select("mm").distinct.collect
  res.unpersist()
  val c2 = res.select("vint", 
"mm").filter("vint='2007-01-01'").select("mm").distinct.collect
  assert(

[jira] [Commented] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-27 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14977600#comment-14977600
 ] 

Cheng Hao commented on SPARK-11330:
---

Hi, [~saif.a.ellafi], I've tried the code like below:
{code}
case class Spark11330(account_id: Int, product: String, vint: String,
  band: String, age: Int, mb: String, mm: String,
  balance: Float, balancec: Float)

test("SPARK-11330: Filter operation on StringType after groupBy PERSISTED 
brings no results") {
withTempPath { f =>
  val d = Seq(
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200808", 
1000.0f, 2000.0f),
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200808", 
2000.0f, 2000.0f),
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200809", 
2000.0f, 2000.0f),
Spark11330(2, "product2", "2007-01-01", "band3", 29, "mb1", "200809", 
2010.0f, 3000.0f))

  val data = List.tabulate[Seq[Spark11330]](10) { i => d }.flatten

  sqlContext.sparkContext.parallelize(data, 4)

.toDF().write.format("parquet").mode(SaveMode.Overwrite).save(f.getAbsolutePath)

  val df = sqlContext.read.parquet(f.getAbsolutePath)

  val f1 = df.groupBy("vint").count().persist().filter("vint = 
'2007-01-01'").first
  val f2 = df.groupBy("vint").count().filter("vint = '2007-01-01'").first

  assert(f1 == f2)

  val res = df
.groupBy("product", "band", "age", "vint", "mb", "mm")
.agg(
  count($"account_id").as("N"),
  sum($"balance").as("balance_eom"),
  sum($"balancec").as("balance_eoc")).persist()

  val c1 = res.select("vint", 
"mm").filter("vint='2007-01-01'").select("mm").distinct.collect
  res.unpersist()
  val c2 = res.select("vint", 
"mm").filter("vint='2007-01-01'").select("mm").distinct.collect
  assert(c1.sameElements(c2))
}
  }
{code}

Seems everything works fine, I am not sure if I missed something, can you try 
to reproduce the issue based on my code?

> Filter operation on StringType after groupBy PERSISTED brings no results
> 
>
> Key: SPARK-11330
> URL: https://issues.apache.org/jira/browse/SPARK-11330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Stand alone Cluster of five servers (happens as well in 
> local mode). sqlContext instance of HiveContext (happens as well with 
> SQLContext)
> No special options other than driver memory and executor memory.
> Parquet partitions are 512 where there are 160 cores. Happens as well with 
> other partitioning
> Data is nearly 2 billion rows.
>Reporter: Saif Addin Ellafi
>Priority: Blocker
>
> ONLY HAPPENS WHEN PERSIST() IS CALLED
> val data = sqlContext.read.parquet("/var/data/Saif/ppnr_pqt")
> data.groupBy("vintages").count.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> res9: org.apache.spark.sql.Row = [2007-01-01]
> data.groupBy("vintages").count.persist.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> Exception on empty iterator stuff
> This does not happen if using another type of field, eg IntType
> data.groupBy("mm").count.persist.select("mm").filter("mm = 
> 200805").first >>> res13: org.apache.spark.sql.Row = [200805]
> NOTE: I have no idea whether I used KRYO serializer when stored this parquet.
> NOTE2: If setting the persist after the filtering, it works fine. But this is 
> not a good enough workaround since any filter operation afterwards will break 
> results.
> NOTE3: I have reproduced the issue with several different datasets.
> NOTE4: The only real workaround is to store the groupBy dataframe in database 
> and reload it as a new dataframe.
> Query to raw-data works fine:
> data.select("vintages").filter("vintages = '2007-01-01'").first >>> res4: 
> org.apache.spark.sql.Row = [2007-01-01]
> Originally, the issue happened with a larger aggregation operation, the 
> result was that d

RE: HiveContext ignores ("skip.header.line.count"="1")

2015-10-26 Thread Cheng, Hao
I am not sure if we really want to support that with HiveContext, but a 
workround is to use the Spark package at https://github.com/databricks/spark-csv


From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Tuesday, October 27, 2015 10:54 AM
To: Daniel Haviv; user
Subject: RE: HiveContext ignores ("skip.header.line.count"="1")

Please open a JIRA?



Date: Mon, 26 Oct 2015 15:32:42 +0200
Subject: HiveContext ignores ("skip.header.line.count"="1")
From: daniel.ha...@veracity-group.com
To: user@spark.apache.org
Hi,
I have a csv table in Hive which is configured to skip the header row using 
TBLPROPERTIES("skip.header.line.count"="1").
When querying from Hive the header row is not included in the data, but when 
running the same query via HiveContext I get the header row.

I made sure that HiveContext sees the skip.header.line.count setting by running 
"show create table"

Any ideas?

Thank you.
Daniel


[jira] [Updated] (SPARK-9735) Auto infer partition schema of HadoopFsRelation should should respected the user specified one

2015-10-20 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-9735:
-
Description: 
This code is copied from the hadoopFsRelationSuite.scala

{code}
partitionedTestDF = (for {
i <- 1 to 3
p2 <- Seq("foo", "bar")
  } yield (i, s"val_$i", 1, p2)).toDF("a", "b", "p1", "p2")

withTempPath { file =>
  val input = partitionedTestDF.select('a, 'b, 
'p1.cast(StringType).as('ps), 'p2)

  input
.write
.format(dataSourceName)
.mode(SaveMode.Overwrite)
.partitionBy("ps", "p2")
.saveAsTable("t")

  input
.write
.format(dataSourceName)
.mode(SaveMode.Append)
.partitionBy("ps", "p2")
.saveAsTable("t")

  val realData = input.collect()
  withTempTable("t") {
checkAnswer(sqlContext.table("t"), realData ++ realData)
  }
}

java.lang.ClassCastException: java.lang.Integer cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45)
at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:220)
at 
org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:62)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
07:44:01.344 ERROR org.apache.spark.executor.Executor: Exception in task 14.0 
in stage 3.0 (TID 206)
java.lang.ClassCastException: java.lang.Integer cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45)
at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:220)
at 
org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:62)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at

[jira] [Commented] (SPARK-4226) SparkSQL - Add support for subqueries in predicates

2015-10-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958524#comment-14958524
 ] 

Cheng Hao commented on SPARK-4226:
--

[~nadenf] Actually I am working on it right now, and the first PR is ready, it 
will be great appreciated if you can try 
https://github.com/apache/spark/pull/9055 in your local testing, let me know if 
there any problem or bug you found.

> SparkSQL - Add support for subqueries in predicates
> ---
>
> Key: SPARK-4226
> URL: https://issues.apache.org/jira/browse/SPARK-4226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: Spark 1.2 snapshot
>Reporter: Terry Siu
>
> I have a test table defined in Hive as follows:
> {code:sql}
> CREATE TABLE sparkbug (
>   id INT,
>   event STRING
> ) STORED AS PARQUET;
> {code}
> and insert some sample data with ids 1, 2, 3.
> In a Spark shell, I then create a HiveContext and then execute the following 
> HQL to test out subquery predicates:
> {code}
> val hc = HiveContext(hc)
> hc.hql("select customerid from sparkbug where customerid in (select 
> customerid from sparkbug where customerid in (2,3))")
> {code}
> I get the following error:
> {noformat}
> java.lang.RuntimeException: Unsupported language features in query: select 
> customerid from sparkbug where customerid in (select customerid from sparkbug 
> where customerid in (2,3))
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_SUBQUERY_EXPR
> TOK_SUBQUERY_OP
>   in
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_FUNCTION
> in
> TOK_TABLE_OR_COL
>   customerid
> 2
> 3
> TOK_TABLE_OR_COL
>   customerid
> scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
> TOK_SUBQUERY_EXPR :
> TOK_SUBQUERY_EXPR
>   TOK_SUBQUERY_OP
> in
>   TOK_QUERY
> TOK_FROM
>   TOK_TABREF
> TOK_TABNAME
>   sparkbug
> TOK_INSERT
>   TOK_DESTINATION
> TOK_DIR
>   TOK_TMP_FILE
>   TOK_SELECT
> TOK_SELEXPR
>   TOK_TABLE_OR_COL
> customerid
>   TOK_WHERE
> TOK_FUNCTION
>   in
>   TOK_TABLE_OR_COL
> customerid
>   2
>   3
>   TOK_TABLE_OR_COL
> customerid
> " +
>  
> org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)
> 
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> {noformat}
> [This 
> thread|http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html]
>  also brings up lack of subquery support in SparkSQL. It would be nice to 
> have subquery predicate support in a near, future release (1.3, maybe?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11076) Decimal Support for Ceil/Floor

2015-10-12 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-11076:
-

 Summary: Decimal Support for Ceil/Floor
 Key: SPARK-11076
 URL: https://issues.apache.org/jira/browse/SPARK-11076
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao


Currently, Ceil & Floor doesn't support decimal, but Hive does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



RE: Hive with apache spark

2015-10-11 Thread Cheng, Hao
One option is you can read the data via JDBC, however, probably it's the worst 
option, as you probably need some hacky work to enable the parallel reading in 
Spark SQL.
Another option is copy the hive-site.xml of your Hive Server to 
$SPARK_HOME/conf, then Spark SQL will see everything that Hive Server does, and 
you can load the Hive table as need.


-Original Message-
From: Hafiz Mujadid [mailto:hafizmujadi...@gmail.com] 
Sent: Monday, October 12, 2015 1:43 AM
To: user@spark.apache.org
Subject: Hive with apache spark

Hi

how can we read data from external hive server. Hive server is running and I 
want to read data remotely using spark. is there any example ?


thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Hive-with-apache-spark-tp25020.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Cheng, Hao
A join B join C === (A join B) join C
Semantically they are equivalent, right?

From: Richard Eggert [mailto:richard.egg...@gmail.com]
Sent: Monday, October 12, 2015 5:12 AM
To: Subhajit Purkayastha
Cc: User
Subject: Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF?


It's the same as joining 2. Join two together, and then join the third one to 
the result of that.
On Oct 11, 2015 2:57 PM, "Subhajit Purkayastha" 
> wrote:
Can I join 3 different RDDs together in a Spark SQL DF? I can find examples for 
2 RDDs but not 3.

Thanks



RE: Join Order Optimization

2015-10-11 Thread Cheng, Hao
Spark SQL supports very basic join reordering optimization, based on the raw 
table data size, this was added couple major releases back.

And the “EXPLAIN EXTENDED query” command is a very informative tool to verify 
whether the optimization taking effect.

From: Raajay [mailto:raaja...@gmail.com]
Sent: Sunday, October 11, 2015 9:22 AM
To: user@spark.apache.org
Subject: Join Order Optimization

Hello,
Does Spark-SQL support join order optimization as of the 1.5.1 release ? From 
the release notes, I did not see support for this feature, but figured will ask 
the users-list to be sure.
Thanks
Raajay


RE: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Cheng, Hao
Thank you Ted, that’s very informative; from the DB optimization point of view, 
the Cost Base join re-ordering, and the multi-way joins does provide better 
performance;

But from the API design point of view, 2 arguments (relation) for JOIN in the 
DF API probably be enough for the multiple tables join cases, as we can always 
use the nested 2 way joins to represents the multi-joins.
For example: A join B join C join D (multi-way join)=>
((A join B) join C) join D
 (A join (B join C)) join D
(A join B) join (C join D) etc.
That also means, we leave the optimization work for Spark SQL, not by users, 
and we believe Spark SQL can do most of the dirty work for us.

However, sometimes, people do write an optimal SQL (e.g. with better join 
ordering) than the Spark SQL optimizer does, then we’d better resort to the API 
SqlContext.sql(“”).

Cheers
Hao

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Monday, October 12, 2015 8:37 AM
To: Cheng, Hao
Cc: Richard Eggert; Subhajit Purkayastha; User
Subject: Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

Some weekend reading:
http://stackoverflow.com/questions/20022196/are-left-outer-joins-associative

Cheers

On Sun, Oct 11, 2015 at 5:32 PM, Cheng, Hao 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:
A join B join C === (A join B) join C
Semantically they are equivalent, right?

From: Richard Eggert 
[mailto:richard.egg...@gmail.com<mailto:richard.egg...@gmail.com>]
Sent: Monday, October 12, 2015 5:12 AM
To: Subhajit Purkayastha
Cc: User
Subject: Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF?


It's the same as joining 2. Join two together, and then join the third one to 
the result of that.
On Oct 11, 2015 2:57 PM, "Subhajit Purkayastha" 
<spurk...@p3si.net<mailto:spurk...@p3si.net>> wrote:
Can I join 3 different RDDs together in a Spark SQL DF? I can find examples for 
2 RDDs but not 3.

Thanks




RE: Join Order Optimization

2015-10-11 Thread Cheng, Hao
Probably you have to read the source code, I am not sure if there are any .ppt 
or slides.

Hao

From: VJ Anand [mailto:vjan...@sankia.com]
Sent: Monday, October 12, 2015 11:43 AM
To: Cheng, Hao
Cc: Raajay; user@spark.apache.org
Subject: Re: Join Order Optimization

Hi - Is there a design document for those operations that have been implemented 
in 1.4.0? if so,where can I find them
-VJ

On Sun, Oct 11, 2015 at 7:27 PM, Cheng, Hao 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:
Yes, I think the SPARK-2211 should be the right place to follow the CBO stuff, 
but probably that will not happen right away.

The jira issue introduce the statistic info can be found at:
https://issues.apache.org/jira/browse/SPARK-2393

Hao

From: Raajay [mailto:raaja...@gmail.com<mailto:raaja...@gmail.com>]
Sent: Monday, October 12, 2015 10:17 AM
To: Cheng, Hao
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Join Order Optimization

Hi Cheng,
Could you point me to the JIRA that introduced this change ?

Also, is this SPARK-2211 the right issue to follow for cost-based optimization?
Thanks
Raajay


On Sun, Oct 11, 2015 at 7:57 PM, Cheng, Hao 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:
Spark SQL supports very basic join reordering optimization, based on the raw 
table data size, this was added couple major releases back.

And the “EXPLAIN EXTENDED query” command is a very informative tool to verify 
whether the optimization taking effect.

From: Raajay [mailto:raaja...@gmail.com<mailto:raaja...@gmail.com>]
Sent: Sunday, October 11, 2015 9:22 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Join Order Optimization

Hello,
Does Spark-SQL support join order optimization as of the 1.5.1 release ? From 
the release notes, I did not see support for this feature, but figured will ask 
the users-list to be sure.
Thanks
Raajay




--
VJ Anand
Founder
Sankia
vjan...@sankia.com<mailto:vjan...@sankia.com>
925-640-1340
www.sankia.com<http://www.sankia.com>

Confidentiality Notice: This e-mail message, including any attachments, is for 
the sole use of the intended recipient(s) and may contain confidential and 
privileged information. Any unauthorized review, use, disclosure or 
distribution is prohibited. If you are not the intended recipient, please 
contact the sender by reply e-mail and destroy all copies of the original 
message


RE: Join Order Optimization

2015-10-11 Thread Cheng, Hao
Yes, I think the SPARK-2211 should be the right place to follow the CBO stuff, 
but probably that will not happen right away.

The jira issue introduce the statistic info can be found at:
https://issues.apache.org/jira/browse/SPARK-2393

Hao

From: Raajay [mailto:raaja...@gmail.com]
Sent: Monday, October 12, 2015 10:17 AM
To: Cheng, Hao
Cc: user@spark.apache.org
Subject: Re: Join Order Optimization

Hi Cheng,
Could you point me to the JIRA that introduced this change ?

Also, is this SPARK-2211 the right issue to follow for cost-based optimization?
Thanks
Raajay


On Sun, Oct 11, 2015 at 7:57 PM, Cheng, Hao 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:
Spark SQL supports very basic join reordering optimization, based on the raw 
table data size, this was added couple major releases back.

And the “EXPLAIN EXTENDED query” command is a very informative tool to verify 
whether the optimization taking effect.

From: Raajay [mailto:raaja...@gmail.com<mailto:raaja...@gmail.com>]
Sent: Sunday, October 11, 2015 9:22 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Join Order Optimization

Hello,
Does Spark-SQL support join order optimization as of the 1.5.1 release ? From 
the release notes, I did not see support for this feature, but figured will ask 
the users-list to be sure.
Thanks
Raajay



RE: Insert via HiveContext is slow

2015-10-09 Thread Cheng, Hao
I think DF performs the same as the SQL API does in the multi-inserts, if you 
don’t use the cached table.

Hao

From: Daniel Haviv [mailto:daniel.ha...@veracity-group.com]
Sent: Friday, October 9, 2015 3:09 PM
To: Cheng, Hao
Cc: user
Subject: Re: Insert via HiveContext is slow

Thanks Hao.
It seems like one issue.
The other issue to me seems the renaming of files at the end of the insert.
would DF.save perform the task better?

Thanks,
Daniel

On Fri, Oct 9, 2015 at 3:35 AM, Cheng, Hao 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:
I think that’s a known performance issue(Compared to Hive) of Spark SQL in 
multi-inserts.
A workaround is create a temp cached table for the projection first, and then 
do the multiple inserts base on the cached table.

We are actually working on the POC of some similar cases, hopefully it comes 
out soon.

Hao

From: Daniel Haviv 
[mailto:daniel.ha...@veracity-group.com<mailto:daniel.ha...@veracity-group.com>]
Sent: Friday, October 9, 2015 3:08 AM
To: user
Subject: Re: Insert via HiveContext is slow

Forgot to mention that my insert is a multi table insert :
sqlContext2.sql("""from avro_events
   lateral view explode(usChnlList) usParamLine as usParamLine
   lateral view explode(dsChnlList) dsParamLine as dsParamLine
   insert into table UpStreamParam partition(day_ts, cmtsid)
   select cmtstimestamp,datats,macaddress,
usParamLine['chnlidx'] chnlidx,
usParamLine['modulation'] modulation,
usParamLine['severity'] severity,
usParamLine['rxpower'] rxpower,
usParamLine['sigqnoise'] sigqnoise,
usParamLine['noisedeviation'] noisedeviation,
usParamLine['prefecber'] prefecber,
usParamLine['postfecber'] postfecber,
usParamLine['txpower'] txpower,
usParamLine['txpowerdrop'] txpowerdrop,
usParamLine['nmter'] nmter,
usParamLine['premtter'] premtter,
usParamLine['postmtter'] postmtter,
usParamLine['unerroreds'] unerroreds,
usParamLine['corrected'] corrected,
usParamLine['uncorrectables'] uncorrectables,
from_unixtime(cast(datats/1000 as bigint),'MMdd') 
day_ts,
cmtsid
   insert into table DwnStreamParam partition(day_ts, cmtsid)
   select  cmtstimestamp,datats,macaddress,
dsParamLine['chnlidx'] chnlidx,
dsParamLine['modulation'] modulation,
dsParamLine['severity'] severity,
dsParamLine['rxpower'] rxpower,
dsParamLine['sigqnoise'] sigqnoise,
dsParamLine['noisedeviation'] noisedeviation,
dsParamLine['prefecber'] prefecber,
dsParamLine['postfecber'] postfecber,
dsParamLine['sigqrxmer'] sigqrxmer,
dsParamLine['sigqmicroreflection'] sigqmicroreflection,
dsParamLine['unerroreds'] unerroreds,
dsParamLine['corrected'] corrected,
dsParamLine['uncorrectables'] uncorrectables,
from_unixtime(cast(datats/1000 as bigint),'MMdd') 
day_ts,
cmtsid

""")



On Thu, Oct 8, 2015 at 9:51 PM, Daniel Haviv 
<daniel.ha...@veracity-group.com<mailto:daniel.ha...@veracity-group.com>> wrote:
Hi,
I'm inserting into a partitioned ORC table using an insert sql statement passed 
via HiveContext.
The performance I'm getting is pretty bad and I was wondering if there are ways 
to speed things up.
Would saving the DF like this 
df.write().mode(SaveMode.Append).partitionBy("date").saveAsTable("Tablename") 
be faster ?


Thank you.
Daniel




[jira] [Closed] (SPARK-11041) Add (NOT) IN / EXISTS support for predicates

2015-10-09 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao closed SPARK-11041.
-
Resolution: Duplicate

> Add (NOT) IN / EXISTS support for predicates
> 
>
> Key: SPARK-11041
> URL: https://issues.apache.org/jira/browse/SPARK-11041
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>    Reporter: Cheng Hao
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11041) Add (NOT) IN / EXISTS support for predicates

2015-10-09 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-11041:
-

 Summary: Add (NOT) IN / EXISTS support for predicates
 Key: SPARK-11041
 URL: https://issues.apache.org/jira/browse/SPARK-11041
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



RE: Insert via HiveContext is slow

2015-10-08 Thread Cheng, Hao
I think that’s a known performance issue(Compared to Hive) of Spark SQL in 
multi-inserts.
A workaround is create a temp cached table for the projection first, and then 
do the multiple inserts base on the cached table.

We are actually working on the POC of some similar cases, hopefully it comes 
out soon.

Hao

From: Daniel Haviv [mailto:daniel.ha...@veracity-group.com]
Sent: Friday, October 9, 2015 3:08 AM
To: user
Subject: Re: Insert via HiveContext is slow

Forgot to mention that my insert is a multi table insert :
sqlContext2.sql("""from avro_events
   lateral view explode(usChnlList) usParamLine as usParamLine
   lateral view explode(dsChnlList) dsParamLine as dsParamLine
   insert into table UpStreamParam partition(day_ts, cmtsid)
   select cmtstimestamp,datats,macaddress,
usParamLine['chnlidx'] chnlidx,
usParamLine['modulation'] modulation,
usParamLine['severity'] severity,
usParamLine['rxpower'] rxpower,
usParamLine['sigqnoise'] sigqnoise,
usParamLine['noisedeviation'] noisedeviation,
usParamLine['prefecber'] prefecber,
usParamLine['postfecber'] postfecber,
usParamLine['txpower'] txpower,
usParamLine['txpowerdrop'] txpowerdrop,
usParamLine['nmter'] nmter,
usParamLine['premtter'] premtter,
usParamLine['postmtter'] postmtter,
usParamLine['unerroreds'] unerroreds,
usParamLine['corrected'] corrected,
usParamLine['uncorrectables'] uncorrectables,
from_unixtime(cast(datats/1000 as bigint),'MMdd') 
day_ts,
cmtsid
   insert into table DwnStreamParam partition(day_ts, cmtsid)
   select  cmtstimestamp,datats,macaddress,
dsParamLine['chnlidx'] chnlidx,
dsParamLine['modulation'] modulation,
dsParamLine['severity'] severity,
dsParamLine['rxpower'] rxpower,
dsParamLine['sigqnoise'] sigqnoise,
dsParamLine['noisedeviation'] noisedeviation,
dsParamLine['prefecber'] prefecber,
dsParamLine['postfecber'] postfecber,
dsParamLine['sigqrxmer'] sigqrxmer,
dsParamLine['sigqmicroreflection'] sigqmicroreflection,
dsParamLine['unerroreds'] unerroreds,
dsParamLine['corrected'] corrected,
dsParamLine['uncorrectables'] uncorrectables,
from_unixtime(cast(datats/1000 as bigint),'MMdd') 
day_ts,
cmtsid

""")



On Thu, Oct 8, 2015 at 9:51 PM, Daniel Haviv 
> wrote:
Hi,
I'm inserting into a partitioned ORC table using an insert sql statement passed 
via HiveContext.
The performance I'm getting is pretty bad and I was wondering if there are ways 
to speed things up.
Would saving the DF like this 
df.write().mode(SaveMode.Append).partitionBy("date").saveAsTable("Tablename") 
be faster ?


Thank you.
Daniel



[jira] [Updated] (SPARK-10992) Partial Aggregation Support for Hive UDAF

2015-10-07 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-10992:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-4366

> Partial Aggregation Support for Hive UDAF
> -
>
> Key: SPARK-10992
> URL: https://issues.apache.org/jira/browse/SPARK-10992
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>    Reporter: Cheng Hao
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10831) Spark SQL Configuration missing in the doc

2015-09-25 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-10831:
-

 Summary: Spark SQL Configuration missing in the doc
 Key: SPARK-10831
 URL: https://issues.apache.org/jira/browse/SPARK-10831
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Reporter: Cheng Hao


E.g.
spark.sql.codegen
spark.sql.planner.sortMergeJoin
spark.sql.dialect
spark.sql.caseSensitive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10829) Scan DataSource with predicate expression combine partition key and attributes doesn't work

2015-09-24 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-10829:
-

 Summary: Scan DataSource with predicate expression combine 
partition key and attributes doesn't work
 Key: SPARK-10829
 URL: https://issues.apache.org/jira/browse/SPARK-10829
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Blocker


To reproduce that with the code:
{code}
withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
  withTempPath { dir =>
val path = s"${dir.getCanonicalPath}/part=1"
(1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path)

// If the "part = 1" filter gets pushed down, this query will throw an 
exception since
// "part" is not a valid column in the actual Parquet file
checkAnswer(
  sqlContext.read.parquet(path).filter("a > 0 and (part = 0 or a > 1)"),
  (2 to 3).map(i => Row(i, i.toString, 1)))
  }
}
{code}
We expect the result as:
{code}
2, 1
3, 1
{code}
But we got:
{code}
1, 1
2, 1
3, 1
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based

2015-09-23 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904778#comment-14904778
 ] 

Cheng Hao commented on SPARK-10733:
---

[~jameszhouyi] Can you please patch the 
https://github.com/chenghao-intel/spark/commit/91af33397100802d6ba577a3f423bb47d5a761ea
 and try your workload? And be sure set the log level to `INFO`.

[~andrewor14] [~yhuai] One possibility is Sort-Merge-Join eat out all of the 
memory, as Sort-Merge-Join will not free the memory until we finish iterating 
all join result, however, partial aggregation will actually accept the iterator 
the join result, which means possible no memory at all for aggregation.

> TungstenAggregation cannot acquire page after switching to sort-based
> -
>
> Key: SPARK-10733
> URL: https://issues.apache.org/jira/browse/SPARK-10733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> This is uncovered after fixing SPARK-10474. Stack trace:
> {code}
> 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 
> 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 
> bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



RE: Performance Spark SQL vs Dataframe API faster

2015-09-22 Thread Cheng, Hao
Yes, should be the same, as they are just different frontend, but the same 
thing in optimization / execution.

-Original Message-
From: sanderg [mailto:s.gee...@wimionline.be] 
Sent: Tuesday, September 22, 2015 10:06 PM
To: user@spark.apache.org
Subject: Performance Spark SQL vs Dataframe API faster

Is there a difference in performance between writing a spark job using only SQL 
statements and writing it using the dataframe api or does it translate to the 
same thing under the hood?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Performance-Spark-SQL-vs-Dataframe-API-faster-tp24768.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-17 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14802912#comment-14802912
 ] 

Cheng Hao commented on SPARK-10474:
---

The root reason for this failure, is because of the 
`TungstenAggregationIterator.switchToSortBasedAggregation`, as it's eat out 
memory by HashAggregation, and then, we cannot allocate memory when turn the 
sort-based aggregation even in the spilling time.

I post a workaround solution PR for discussion.

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')

[jira] [Comment Edited] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-17 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14802912#comment-14802912
 ] 

Cheng Hao edited comment on SPARK-10474 at 9/17/15 1:48 PM:


The root reason for this failure, is the trigger condition from  hash-based 
aggregation to sort-based aggregation in the `TungstenAggregationIterator`, 
current code logic is if no more memory to can be allocated, then turn to 
sort-based aggregation,  however, since no memory left, the data spill will 
also failed in UnsafeExternalSorter.initializeWriting.

I post a workaround solution PR for discussion.


was (Author: chenghao):
The root reason for this failure, is because of the 
`TungstenAggregationIterator.switchToSortBasedAggregation`, as it's eat out 
memory by HashAggregation, and then, we cannot allocate memory when turn the 
sort-based aggregation even in the spilling time.

I post a workaround solution PR for discussion.

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 

[jira] [Commented] (SPARK-10606) Cube/Rollup/GrpSet doesn't create the correct plan when group by is on something other than an AttributeReference

2015-09-16 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791499#comment-14791499
 ] 

Cheng Hao commented on SPARK-10606:
---

[~rhbutani] Which version are you using, actually I've fixed the bug at 
SPARK-8972, it should be included in 1.5. Can you try that with 1.5?

> Cube/Rollup/GrpSet doesn't create the correct plan when group by is on 
> something other than an AttributeReference
> -
>
> Key: SPARK-10606
> URL: https://issues.apache.org/jira/browse/SPARK-10606
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Harish Butani
>Priority: Critical
>
> Consider the following table: t(a : String, b : String) and the query
> {code}
> select a, concat(b, '1'), count(*)
> from t
> group by a, concat(b, '1') with cube
> {code}
> The projections in the Expand operator are not setup correctly. The expand 
> logic in Analyzer:expand is comparing grouping expressions against 
> child.output. So {{concat(b, '1')}} is never mapped to a null Literal.  
> A simple fix is to add a Rule to introduce a Projection below the 
> Cube/Rollup/GrpSet operator that additionally projects the   
> groupingExpressions that are missing in the child.
> Marking this as Critical, because you get wrong results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



RE: RE: spark sql hook

2015-09-16 Thread Cheng, Hao
Probably a workable solution is, create your own SQLContext by extending the 
class HiveContext, and override the `analyzer`, and add your own rule to do the 
hacking.

From: r7raul1...@163.com [mailto:r7raul1...@163.com]
Sent: Thursday, September 17, 2015 11:08 AM
To: Cheng, Hao; user
Subject: Re: RE: spark sql hook

Example:
select * from test.table chang to  select * from production.table


r7raul1...@163.com<mailto:r7raul1...@163.com>

From: Cheng, Hao<mailto:hao.ch...@intel.com>
Date: 2015-09-17 11:05
To: r7raul1...@163.com<mailto:r7raul1...@163.com>; 
user<mailto:user@spark.apache.org>
Subject: RE: spark sql hook
Catalyst TreeNode is very fundamental API, not sure what kind of hook you need. 
Any concrete example will be more helpful to understand your requirement.

Hao

From: r7raul1...@163.com<mailto:r7raul1...@163.com> [mailto:r7raul1...@163.com]
Sent: Thursday, September 17, 2015 10:54 AM
To: user
Subject: spark sql hook


I want to modify some sql treenode before execute. I cau do this by hive hook 
in hive. Does spark support such hook? Any advise?

r7raul1...@163.com<mailto:r7raul1...@163.com>


RE: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Cheng, Hao
We actually meet the similiar problem in a real case, see 
https://issues.apache.org/jira/browse/SPARK-10474

After checking the source code, the external sort memory management strategy 
seems the root cause of the issue.

Currently, we allocate the 4MB (page size) buffer as initial in the beginning 
of the sorting, and during the processing of each input record, we possible run 
into the cycle of spill => de-allocate buffer => try allocate a buffer with 
size x2. I know this strategy is quite flexible in some cases. However, for 
example in a data skew case, says 2 tasks with large amount of records runs at 
a single executor, the keep growing buffer strategy will eventually eat out the 
pre-set offheap memory threshold, and then exception thrown like what we’ve 
seen.

I mean can we just take a simple memory management strategy for external 
sorter, like:
Step 1) Allocate a fixed size  buffer for the current task (maybe: 
MAX_MEMORY_THRESHOLD/(2 * PARALLEL_TASKS_PER_EXECUTOR))
Step 2) for (record in the input) { if (hasMemoryForRecord(record)) 
insert(record) else spill(buffer); insert(record); }
Step 3) Deallocate(buffer)

Probably we’d better to move the discussion in jira.
From: Reynold Xin [mailto:r...@databricks.com]
Sent: Thursday, September 17, 2015 12:28 AM
To: Pete Robbins
Cc: Dev
Subject: Re: Unable to acquire memory errors in HiveCompatibilitySuite

SparkEnv for the driver was created in SparkContext. The default parallelism 
field is set to the number of slots (max number of active tasks). Maybe we can 
just use the default parallelism to compute that in local mode.

On Wednesday, September 16, 2015, Pete Robbins 
> wrote:
so forcing the ShuffleMemoryManager to assume 32 cores and therefore calculate 
a pagesize of 1MB passes the tests.
How can we determine the correct value to use in getPageSize rather than 
Runtime.getRuntime.availableProcessors()?

On 16 September 2015 at 10:17, Pete Robbins 
> 
wrote:
I see what you are saying. Full stack trace:

java.io.IOException: Unable to acquire 4194304 bytes of memory
  at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
  at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:349)
  at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:478)
  at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:138)
  at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:489)
  at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
  at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
  at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
  at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
  at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
  at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at 

RE: spark sql hook

2015-09-16 Thread Cheng, Hao
Catalyst TreeNode is very fundamental API, not sure what kind of hook you need. 
Any concrete example will be more helpful to understand your requirement.

Hao

From: r7raul1...@163.com [mailto:r7raul1...@163.com]
Sent: Thursday, September 17, 2015 10:54 AM
To: user
Subject: spark sql hook


I want to modify some sql treenode before execute. I cau do this by hive hook 
in hive. Does spark support such hook? Any advise?

r7raul1...@163.com


[jira] [Commented] (SPARK-4226) SparkSQL - Add support for subqueries in predicates

2015-09-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746642#comment-14746642
 ] 

Cheng Hao commented on SPARK-4226:
--

Thank you [~brooks], you're right! I meant it will makes more complicated in 
the implementation, e.g. to resolved and split the conjunction for the 
condition, that's also what I was trying to avoid in my PR by using the 
anti-join. 

> SparkSQL - Add support for subqueries in predicates
> ---
>
> Key: SPARK-4226
> URL: https://issues.apache.org/jira/browse/SPARK-4226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: Spark 1.2 snapshot
>Reporter: Terry Siu
>
> I have a test table defined in Hive as follows:
> {code:sql}
> CREATE TABLE sparkbug (
>   id INT,
>   event STRING
> ) STORED AS PARQUET;
> {code}
> and insert some sample data with ids 1, 2, 3.
> In a Spark shell, I then create a HiveContext and then execute the following 
> HQL to test out subquery predicates:
> {code}
> val hc = HiveContext(hc)
> hc.hql("select customerid from sparkbug where customerid in (select 
> customerid from sparkbug where customerid in (2,3))")
> {code}
> I get the following error:
> {noformat}
> java.lang.RuntimeException: Unsupported language features in query: select 
> customerid from sparkbug where customerid in (select customerid from sparkbug 
> where customerid in (2,3))
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_SUBQUERY_EXPR
> TOK_SUBQUERY_OP
>   in
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_FUNCTION
> in
> TOK_TABLE_OR_COL
>   customerid
> 2
> 3
> TOK_TABLE_OR_COL
>   customerid
> scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
> TOK_SUBQUERY_EXPR :
> TOK_SUBQUERY_EXPR
>   TOK_SUBQUERY_OP
> in
>   TOK_QUERY
> TOK_FROM
>   TOK_TABREF
> TOK_TABNAME
>   sparkbug
> TOK_INSERT
>   TOK_DESTINATION
> TOK_DIR
>   TOK_TMP_FILE
>   TOK_SELECT
> TOK_SELEXPR
>   TOK_TABLE_OR_COL
> customerid
>   TOK_WHERE
> TOK_FUNCTION
>   in
>   TOK_TABLE_OR_COL
> customerid
>   2
>   3
>   TOK_TABLE_OR_COL
> customerid
> " +
>  
> org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)
> 
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> {noformat}
> [This 
> thread|http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html]
>  also brings up lack of subquery support in SparkSQL. It would be nice to 
> have subquery predicate support in a near, future release (1.3, maybe?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744969#comment-14744969
 ] 

Cheng Hao commented on SPARK-10474:
---

The root causes for the exception is the executor don't have enough memory for 
external sorting(UnsafeXXXSorter), 
The memory used for the sorting is MAX_JVM_HEAP * spark.shuffle.memoryFraction 
* spark.shuffle.safetyFraction.

So a workaround is to set a bigger memory for jvm, or the spark conf keys 
"spark.shuffle.memoryFraction"(0.2 by default) and 
"spark.shuffle.safetyFraction"(0.8 by default).


> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) 

[jira] [Commented] (SPARK-10466) UnsafeRow exception in Sort-Based Shuffle with data spill

2015-09-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744966#comment-14744966
 ] 

Cheng Hao commented on SPARK-10466:
---

[~naliazheli] It's an irrelevant issue, you'd better to subscribe the spark 
mail list and then ask question in English. 
See(http://spark.apache.org/community.html)

> UnsafeRow exception in Sort-Based Shuffle with data spill 
> --
>
> Key: SPARK-10466
> URL: https://issues.apache.org/jira/browse/SPARK-10466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>    Reporter: Cheng Hao
>    Assignee: Cheng Hao
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In sort-based shuffle, if we have data spill, it will cause assert exception, 
> the follow code can reproduce that
> {code}
> withSparkConf(("spark.shuffle.sort.bypassMergeThreshold", "2")) {
>   withSQLConf(("spark.sql.autoBroadcastJoinThreshold", "0")) {
> withTempTable("mytemp") {
>   sparkContext.parallelize(1 to 1000, 3).map(i => (i, 
> i)).toDF("key", "value").registerTempTable("mytemp")
>   sql("select key, value as v1 from mytemp where key > 
> 1").registerTempTable("l")
>   sql("select key, value as v2 from mytemp where key > 
> 3").registerTempTable("r")
>   val df3 = sql("select v1, v2 from l left join r on l.key=r.key")
>   df3.count()
> }
>   }
> }
> {code}
> {code}
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)
> 17:32:06.172 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 1.0 (TID 4, localhost): java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitione

[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745008#comment-14745008
 ] 

Cheng Hao commented on SPARK-10474:
---

But from the current implementation, we'd better not to throw exception if 
acquired memory(offheap) is not satisfied,  maybe we'd better use fixed memory 
allocations for both data page and the pointer array, what do you think?

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.

[jira] [Commented] (SPARK-10466) UnsafeRow exception in Sort-Based Shuffle with data spill

2015-09-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744967#comment-14744967
 ] 

Cheng Hao commented on SPARK-10466:
---

[~naliazheli] It's an irrelevant issue, you'd better to subscribe the spark 
mail list and then ask question in English. 
See(http://spark.apache.org/community.html)

> UnsafeRow exception in Sort-Based Shuffle with data spill 
> --
>
> Key: SPARK-10466
> URL: https://issues.apache.org/jira/browse/SPARK-10466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>    Reporter: Cheng Hao
>    Assignee: Cheng Hao
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In sort-based shuffle, if we have data spill, it will cause assert exception, 
> the follow code can reproduce that
> {code}
> withSparkConf(("spark.shuffle.sort.bypassMergeThreshold", "2")) {
>   withSQLConf(("spark.sql.autoBroadcastJoinThreshold", "0")) {
> withTempTable("mytemp") {
>   sparkContext.parallelize(1 to 1000, 3).map(i => (i, 
> i)).toDF("key", "value").registerTempTable("mytemp")
>   sql("select key, value as v1 from mytemp where key > 
> 1").registerTempTable("l")
>   sql("select key, value as v2 from mytemp where key > 
> 3").registerTempTable("r")
>   val df3 = sql("select v1, v2 from l left join r on l.key=r.key")
>   df3.count()
> }
>   }
> }
> {code}
> {code}
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)
> 17:32:06.172 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 1.0 (TID 4, localhost): java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitione

[jira] [Issue Comment Deleted] (SPARK-10466) UnsafeRow exception in Sort-Based Shuffle with data spill

2015-09-15 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-10466:
--
Comment: was deleted

(was: [~naliazheli] It's an irrelevant issue, you'd better to subscribe the 
spark mail list and then ask question in English. 
See(http://spark.apache.org/community.html))

> UnsafeRow exception in Sort-Based Shuffle with data spill 
> --
>
> Key: SPARK-10466
> URL: https://issues.apache.org/jira/browse/SPARK-10466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>    Reporter: Cheng Hao
>    Assignee: Cheng Hao
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In sort-based shuffle, if we have data spill, it will cause assert exception, 
> the follow code can reproduce that
> {code}
> withSparkConf(("spark.shuffle.sort.bypassMergeThreshold", "2")) {
>   withSQLConf(("spark.sql.autoBroadcastJoinThreshold", "0")) {
> withTempTable("mytemp") {
>   sparkContext.parallelize(1 to 1000, 3).map(i => (i, 
> i)).toDF("key", "value").registerTempTable("mytemp")
>   sql("select key, value as v1 from mytemp where key > 
> 1").registerTempTable("l")
>   sql("select key, value as v2 from mytemp where key > 
> 3").registerTempTable("r")
>   val df3 = sql("select v1, v2 from l left join r on l.key=r.key")
>   df3.count()
> }
>   }
> }
> {code}
> {code}
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)
> 17:32:06.172 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 1.0 (TID 4, localhost): java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitione

[jira] [Commented] (SPARK-4226) SparkSQL - Add support for subqueries in predicates

2015-09-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745467#comment-14745467
 ] 

Cheng Hao commented on SPARK-4226:
--

[~marmbrus] [~yhuai] After investigating a little bit, I think using anti-join 
is much more efficient than rewriting the NOT IN / NOT EXISTS with left outer 
join followed by null filtering. As the anti-join will return negative once 
it's found the first matched from the second relation, however the left outer 
join will go thru every match pairs and then do filtering.

Besides, for the NOT EXISTS clause, without the anti-join, seems more 
complicated in implementation. For example:
{code}
mysql> select * from d1;
+--+--+
| a| b|
+--+--+
|2 |2 |
|8 |   10 |
+--+--+
2 rows in set (0.00 sec)

mysql> select * from d2;
+--+--+
| a| b|
+--+--+
|1 |1 |
|8 | NULL |
|0 |0 |
+--+--+
3 rows in set (0.00 sec)

mysql> select * from d1 where not exists (select b from d2 where d1.a=d2.a);
+--+--+
| a| b|
+--+--+
|2 |2 |
+--+--+
1 row in set (0.00 sec)

// If we rewrite the above query in left outer join, the filter condition 
cannot simply be the subquery project list.
mysql> select d1.a, d1.b from d1 left join d2 on d1.a=d2.a where d2.b is null;
+--+--+
| a| b|
+--+--+
|8 |   10 |
|2 |2 |
+--+--+
2 rows in set (0.00 sec)
// get difference result with NOT EXISTS.
{code}

If you feel that make sense, I can reopen my PR and do the rebasing.

> SparkSQL - Add support for subqueries in predicates
> ---
>
> Key: SPARK-4226
> URL: https://issues.apache.org/jira/browse/SPARK-4226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: Spark 1.2 snapshot
>Reporter: Terry Siu
>
> I have a test table defined in Hive as follows:
> {code:sql}
> CREATE TABLE sparkbug (
>   id INT,
>   event STRING
> ) STORED AS PARQUET;
> {code}
> and insert some sample data with ids 1, 2, 3.
> In a Spark shell, I then create a HiveContext and then execute the following 
> HQL to test out subquery predicates:
> {code}
> val hc = HiveContext(hc)
> hc.hql("select customerid from sparkbug where customerid in (select 
> customerid from sparkbug where customerid in (2,3))")
> {code}
> I get the following error:
> {noformat}
> java.lang.RuntimeException: Unsupported language features in query: select 
> customerid from sparkbug where customerid in (select customerid from sparkbug 
> where customerid in (2,3))
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_SUBQUERY_EXPR
> TOK_SUBQUERY_OP
>   in
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_FUNCTION
> in
> TOK_TABLE_OR_COL
>   customerid
> 2
> 3
> TOK_TABLE_OR_COL
>   customerid
> scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
> TOK_SUBQUERY_EXPR :
> TOK_SUBQUERY_EXPR
>   TOK_SUBQUERY_OP
> in
>   TOK_QUERY
> TOK_FROM
>   TOK_TABREF
> TOK_TABNAME
>   sparkbug
> TOK_INSERT
>   TOK_DESTINATION
> TOK_DIR
>   TOK_TMP_FILE
>   TOK_SELECT
> TOK_SELEXPR
>   TOK_TABLE_OR_COL
> customerid
>   TOK_WHERE
> TOK_FUNCTION
>   in
>   TOK_TABLE_OR_COL
> customerid
>   2
>   3
>   TOK_TABLE_OR_COL
> customerid
> " +
>  
> org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)
> 
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
> at

  1   2   3   4   5   6   7   >