date:20161214

[jira] [Commented] (SPARK-18878) Fix/investigate the more identified test failures in Java/Scala on Windows

2016-12-14 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750709#comment-15750709
 ] 

Hyukjin Kwon commented on SPARK-18878:
--

cc [~srowen] Currently, the test failures were not able to be identified all 
due to the time limitation (one hour) in AppVeyor. This was increased for my 
account after manually asking - https://github.com/appveyor/ci/issues/517 but 
it seems it can't be increased that much. (It is increased up to one hour and 
30 minutes).

I will specify the errors within each child task after manually testing 
separately when it is required.

> Fix/investigate the more identified test failures in Java/Scala on Windows
> --
>
> Key: SPARK-18878
> URL: https://issues.apache.org/jira/browse/SPARK-18878
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Hyukjin Kwon
>
> It seems many tests are being failed on Windows. Some are only related with 
> tests whereas others are related with the functionalities themselves which 
> causes actual failures for some APIs on Windows.
> The tests were hanging due to some issues in SPARK-17591 and SPARK-18785 and 
> now apparently we could proceed much further (apparently it seems we might 
> reach the end).
> The tests proceeded via AppVeyor - 
> https://ci.appveyor.com/project/spark-test/spark/build/259-spark-test-windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18878) Fix/investigate the more identified test failures in Java/Scala on Windows

2016-12-14 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-18878:


 Summary: Fix/investigate the more identified test failures in 
Java/Scala on Windows
 Key: SPARK-18878
 URL: https://issues.apache.org/jira/browse/SPARK-18878
 Project: Spark
  Issue Type: Test
  Components: Tests
Reporter: Hyukjin Kwon


It seems many tests are being failed on Windows. Some are only related with 
tests whereas others are related with the functionalities themselves which 
causes actual failures for some APIs on Windows.

The tests were hanging due to some issues in SPARK-17591 and SPARK-18785 and 
now apparently we could proceed much further (apparently it seems we might 
reach the end).

The tests proceeded via AppVeyor - 
https://ci.appveyor.com/project/spark-test/spark/build/259-spark-test-windows




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18669) Update Apache docs regard watermarking in Structured Streaming

2016-12-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18669:


Assignee: Tathagata Das  (was: Apache Spark)

> Update Apache docs regard watermarking in Structured Streaming
> --
>
> Key: SPARK-18669
> URL: https://issues.apache.org/jira/browse/SPARK-18669
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18669) Update Apache docs regard watermarking in Structured Streaming

2016-12-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18669:


Assignee: Apache Spark  (was: Tathagata Das)

> Update Apache docs regard watermarking in Structured Streaming
> --
>
> Key: SPARK-18669
> URL: https://issues.apache.org/jira/browse/SPARK-18669
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18669) Update Apache docs regard watermarking in Structured Streaming

2016-12-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750690#comment-15750690
 ] 

Apache Spark commented on SPARK-18669:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/16294

> Update Apache docs regard watermarking in Structured Streaming
> --
>
> Key: SPARK-18669
> URL: https://issues.apache.org/jira/browse/SPARK-18669
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17119) Add configuration property to allow the history server to delete .inprogress files

2016-12-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17119:


Assignee: Apache Spark

> Add configuration property to allow the history server to delete .inprogress 
> files
> --
>
> Key: SPARK-17119
> URL: https://issues.apache.org/jira/browse/SPARK-17119
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Bjorn Jonsson
>Assignee: Apache Spark
>Priority: Minor
>  Labels: historyserver
>
> The History Server (HS) currently only considers completed applications when 
> deleting event logs from spark.history.fs.logDirectory (since SPARK-6879). 
> This means that over time, .inprogress files (from failed jobs, jobs where 
> the SparkContext is not closed, spark-shell exits etc...) can accumulate and 
> impact the HS.
> Instead of having to manually delete these files, maybe users could have the 
> option of telling the HS to delete all files where (now - 
> attempt.lastUpdated) > spark.history.fs.cleaner.maxAge, or just delete 
> .inprogress files with lastUpdated older then 7d?
> https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L467



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17119) Add configuration property to allow the history server to delete .inprogress files

2016-12-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750683#comment-15750683
 ] 

Apache Spark commented on SPARK-17119:
--

User 'cnZach' has created a pull request for this issue:
https://github.com/apache/spark/pull/16293

> Add configuration property to allow the history server to delete .inprogress 
> files
> --
>
> Key: SPARK-17119
> URL: https://issues.apache.org/jira/browse/SPARK-17119
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Bjorn Jonsson
>Priority: Minor
>  Labels: historyserver
>
> The History Server (HS) currently only considers completed applications when 
> deleting event logs from spark.history.fs.logDirectory (since SPARK-6879). 
> This means that over time, .inprogress files (from failed jobs, jobs where 
> the SparkContext is not closed, spark-shell exits etc...) can accumulate and 
> impact the HS.
> Instead of having to manually delete these files, maybe users could have the 
> option of telling the HS to delete all files where (now - 
> attempt.lastUpdated) > spark.history.fs.cleaner.maxAge, or just delete 
> .inprogress files with lastUpdated older then 7d?
> https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L467



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17119) Add configuration property to allow the history server to delete .inprogress files

2016-12-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17119:


Assignee: (was: Apache Spark)

> Add configuration property to allow the history server to delete .inprogress 
> files
> --
>
> Key: SPARK-17119
> URL: https://issues.apache.org/jira/browse/SPARK-17119
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Bjorn Jonsson
>Priority: Minor
>  Labels: historyserver
>
> The History Server (HS) currently only considers completed applications when 
> deleting event logs from spark.history.fs.logDirectory (since SPARK-6879). 
> This means that over time, .inprogress files (from failed jobs, jobs where 
> the SparkContext is not closed, spark-shell exits etc...) can accumulate and 
> impact the HS.
> Instead of having to manually delete these files, maybe users could have the 
> option of telling the HS to delete all files where (now - 
> attempt.lastUpdated) > spark.history.fs.cleaner.maxAge, or just delete 
> .inprogress files with lastUpdated older then 7d?
> https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L467



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18877) Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20

2016-12-14 Thread Navya Krishnappa (JIRA)

Navya Krishnappa created SPARK-18877:


 Summary: Unable to read given csv data. Excepion: 
java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 
exceeds max precision 20
 Key: SPARK-18877
 URL: https://issues.apache.org/jira/browse/SPARK-18877
 Project: Spark
  Issue Type: Bug
Reporter: Navya Krishnappa


When reading below mentioned csv data, even though the maximum decimal 
precision is 38, following exception is thrown 
java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 
exceeds max precision 20


Decimal
2323366225312000
2433573971400
23233662253000
23233662253





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18455) General support for correlated subquery processing

2016-12-14 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750638#comment-15750638
 ] 

Reynold Xin commented on SPARK-18455:
-

Thanks for sharing the doc. This is a really well written survey of Spark's 
subquery support.

Do you have documentations on how you plan to do de-correlation and other 
stuff, i.e. for PR2 and PR4?


> General support for correlated subquery processing
> --
>
> Key: SPARK-18455
> URL: https://issues.apache.org/jira/browse/SPARK-18455
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Nattavut Sutyanyong
> Attachments: SPARK-18455-scoping-doc.pdf
>
>
> Subquery support has been introduced in Spark 2.0. The initial implementation 
> covers the most common subquery use case: the ones used in TPC queries for 
> instance.
> Spark currently supports the following subqueries:
> * Uncorrelated Scalar Subqueries. All cases are supported.
> * Correlated Scalar Subqueries. We only allow subqueries that are aggregated 
> and use equality predicates.
> * Predicate Subqueries. IN or Exists type of queries. We allow most 
> predicates, except when they are pulled from under an Aggregate or Window 
> operator. In that case we only support equality predicates.
> However this does not cover the full range of possible subqueries. This, in 
> part, has to do with the fact that we currently rewrite all correlated 
> subqueries into a (LEFT/LEFT SEMI/LEFT ANTI) join.
> We currently lack supports for the following use cases:
> * The use of predicate subqueries in a projection.
> * The use of non-equality predicates below Aggregates and or Window operators.
> * The use of non-Aggregate subqueries for correlated scalar subqueries.
> This JIRA aims to lift these current limitations in subquery processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18862) Split SparkR mllib.R into multiple files

2016-12-14 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750614#comment-15750614
 ] 

Felix Cheung commented on SPARK-18862:
--

ah :)
Would we end up having mllib-gmm.R, mllib-als.R and so on though? I do worry 
about 20-30 of files for mllib-*

> Split SparkR mllib.R into multiple files
> 
>
> Key: SPARK-18862
> URL: https://issues.apache.org/jira/browse/SPARK-18862
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>
> SparkR mllib.R is getting bigger as we add more ML wrappers, I'd like to 
> split it into multiple files to make us easy to maintain:
> * mllibClassification.R
> * mllibRegression.R
> * mllibClustering.R
> * mllibFeature.R
> or:
> * mllib/classification.R
> * mllib/regression.R
> * mllib/clustering.R
> * mllib/features.R
> For R convention, it's more prefer the first way. And I'm not sure whether R 
> supports the second organized way (will check later). Please let me know your 
> preference. I think the start of a new release cycle is a good opportunity to 
> do this, since it will involves less conflicts. If this proposal was 
> approved, I can work on it.
> cc [~felixcheung] [~josephkb] [~mengxr] 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18876) An error occurred while trying to connect to the Java server

2016-12-14 Thread pranludi (JIRA)

pranludi created SPARK-18876:


 Summary: An error occurred while trying to connect to the Java 
server
 Key: SPARK-18876
 URL: https://issues.apache.org/jira/browse/SPARK-18876
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.2, 2.0.1
 Environment: Python 2.7.12
Reporter: pranludi


I am trying to create spark context object with the following commands on 
pyspark:

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java 
server (127.0.0.1:35918)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 963, 
in start
self.socket.connect((self.address, self.port))
  File "/usr/local/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/gamedev/spark-2.0.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
 line 419, in coalesce
  File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 
1131, in __call__
answer = self.gateway_client.send_command(command)
  File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 881, 
in send_command
connection = self._get_connection()
  File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 829, 
in _get_connection
connection = self._create_connection()
  File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 835, 
in _create_connection
connection.start()
  File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 970, 
in start
raise Py4JNetworkError(msg, e)
py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to 
the Java server (127.0.0.1:35918)

-
I try spark version 2.0.0, 2.0.1, 2.0.2 
no problem 2.0.0
but 2.0.1, 2.0.2 occur

python code
--
.
df = spark.read.json('hdfs://big_big_400.json')

json_log = []
for log in df.collect():
jj = {}
try:
for f in log.__fields__:
if f == 'I_LogDes':
if log[f] is not None:
log_des_json = json.loads(log[f])
for jf in log_des_json:
json_key = add_2(jf)
if json_key in jj:
json_key = '%s_2' % json_key
jj[json_key] = typeIntStr(log_des_json[jf])
else:
jj[remove_i(f)] = typeIntStr(log[f])
json_log.append(jj)
except:
print log

# !!! here error occur
df = spark.read.json(spark.sparkContext.parallelize(json_log))




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18849) Vignettes final checks for Spark 2.1

2016-12-14 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-18849.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 16286
[https://github.com/apache/spark/pull/16286]

> Vignettes final checks for Spark 2.1
> 
>
> Key: SPARK-18849
> URL: https://issues.apache.org/jira/browse/SPARK-18849
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Xiangrui Meng
>Assignee: Felix Cheung
> Fix For: 2.1.0
>
>
> Make a final pass over the vignettes and ensure the content is consistent.
> * remove "since version" because is not that useful for vignettes
> * re-order/group the list of ML algorithms so there exists a logical ordering
> * check for warning or error in output message
> * anything else that seems out of place



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18875) Fix R API doc generation by adding `DESCRIPTION` file

2016-12-14 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-18875:
--
Assignee: Dongjoon Hyun

> Fix R API doc generation by adding `DESCRIPTION` file
> -
>
> Key: SPARK-18875
> URL: https://issues.apache.org/jira/browse/SPARK-18875
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SparkR
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.0.3, 2.1.0
>
>
> Since 1.4.0, R API document index page has a broken link on `DESCRIPTION 
> file`. This issue aims to fix that.
> * Official Latest Website: 
> http://spark.apache.org/docs/latest/api/R/index.html
> * Apache Spark 2.1.0-rc2: 
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18875) Fix R API doc generation by adding `DESCRIPTION` file

2016-12-14 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-18875.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.3

Issue resolved by pull request 16292
[https://github.com/apache/spark/pull/16292]

> Fix R API doc generation by adding `DESCRIPTION` file
> -
>
> Key: SPARK-18875
> URL: https://issues.apache.org/jira/browse/SPARK-18875
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SparkR
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.0.3, 2.1.0
>
>
> Since 1.4.0, R API document index page has a broken link on `DESCRIPTION 
> file`. This issue aims to fix that.
> * Official Latest Website: 
> http://spark.apache.org/docs/latest/api/R/index.html
> * Apache Spark 2.1.0-rc2: 
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18862) Split SparkR mllib.R into multiple files

2016-12-14 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750435#comment-15750435
 ] 

Yanbo Liang commented on SPARK-18862:
-

Great! I found other R packages organize source files in flat structure, so 
with a bit worried that R can not support subdirectory. Thanks for your 
reference, it's very helpful.
To the naming, I think {{ml}} is not an official name, we still use {{mllib}} 
for public, see [here|https://github.com/apache/spark/pull/16241/files]. I 
think grouping by algorithm of family is very reasonable, so I would like to 
use the name {{mllib-glm.R, mllib-gbt.R, mllib-randomForest.R, etc}}, what do 
you think of it? Thanks.  

> Split SparkR mllib.R into multiple files
> 
>
> Key: SPARK-18862
> URL: https://issues.apache.org/jira/browse/SPARK-18862
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>
> SparkR mllib.R is getting bigger as we add more ML wrappers, I'd like to 
> split it into multiple files to make us easy to maintain:
> * mllibClassification.R
> * mllibRegression.R
> * mllibClustering.R
> * mllibFeature.R
> or:
> * mllib/classification.R
> * mllib/regression.R
> * mllib/clustering.R
> * mllib/features.R
> For R convention, it's more prefer the first way. And I'm not sure whether R 
> supports the second organized way (will check later). Please let me know your 
> preference. I think the start of a new release cycle is a good opportunity to 
> do this, since it will involves less conflicts. If this proposal was 
> approved, I can work on it.
> cc [~felixcheung] [~josephkb] [~mengxr] 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18869) Add TreeNode.p that returns BaseType

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18869.
-
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.3

> Add TreeNode.p that returns BaseType
> 
>
> Key: SPARK-18869
> URL: https://issues.apache.org/jira/browse/SPARK-18869
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.3, 2.1.0
>
>
> After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] 
> rather than a more specific type. It would be easier for interactive 
> debugging to introduce a function that returns the BaseType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17822:

Fix Version/s: 2.0.3

> JVMObjectTracker.objMap may leak JVM objects
> 
>
> Key: SPARK-17822
> URL: https://issues.apache.org/jira/browse/SPARK-17822
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
>Assignee: Xiangrui Meng
> Fix For: 2.0.3, 2.1.0
>
> Attachments: screenshot-1.png
>
>
> JVMObjectTracker.objMap is used to track JVM objects for SparkR. However, we 
> observed that JVM objects that are not used anymore are still trapped in this 
> map, which prevents those object get GCed. 
> Seems it makes sense to use weak reference (like persistentRdds in 
> SparkContext). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18793) SparkR vignette update: random forest

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18793:

Fix Version/s: (was: 2.1.1)
   (was: 2.2.0)
   2.1.0

> SparkR vignette update: random forest
> -
>
> Key: SPARK-18793
> URL: https://issues.apache.org/jira/browse/SPARK-18793
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
> Fix For: 2.1.0
>
>
> Update vignettes to cover randomForest



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18865) SparkR vignettes MLP and LDA updates

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18865:

Fix Version/s: (was: 2.1.1)
   (was: 2.2.0)
   2.1.0

> SparkR vignettes MLP and LDA updates
> 
>
> Key: SPARK-18865
> URL: https://issues.apache.org/jira/browse/SPARK-18865
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.1.0
>
>
> spark.mlp doesn't provide an example. spark.lda and spark.mlp have repeated 
> content. 
> spark.lda document misses default values for some parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18751) Deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18751:

Fix Version/s: (was: 2.1.1)
   (was: 2.2.0)
   2.1.0

> Deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext
> 
>
> Key: SPARK-18751
> URL: https://issues.apache.org/jira/browse/SPARK-18751
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.0
>
>
> When SparkContext.stop is called in Utils.tryOrStopSparkContext (the 
> following three places), it will cause deadlock because the stop method needs 
> to wait for the thread running stop to exit.
> - ContextCleaner.keepCleaning
> - LiveListenerBus.listenerThread.run
> - TaskSchedulerImpl.start



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18840) HDFSCredentialProvider throws exception in non-HDFS security environment

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18840:

Fix Version/s: (was: 2.1.1)
   (was: 2.2.0)
   2.1.0

> HDFSCredentialProvider throws exception in non-HDFS security environment
> 
>
> Key: SPARK-18840
> URL: https://issues.apache.org/jira/browse/SPARK-18840
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.1.0
>
>
> Current in {{HDFSCredentialProvider}}, the code logic assumes HDFS delegation 
> token should be existed, this is ok for HDFS environment, but for some cloud 
> environment like Azure, HDFS is not required, so it will throw exception:
> {code}
> java.util.NoSuchElementException: head of empty list
> at scala.collection.immutable.Nil$.head(List.scala:337)
> at scala.collection.immutable.Nil$.head(List.scala:334)
> at 
> org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:627)
> {code}
> We should also consider this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18835) Do not expose shaded types in JavaTypeInference API

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18835:

Fix Version/s: (was: 2.1.1)
   (was: 2.2.0)
   2.1.0

> Do not expose shaded types in JavaTypeInference API
> ---
>
> Key: SPARK-18835
> URL: https://issues.apache.org/jira/browse/SPARK-18835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently, {{inferDataType(TypeToken)}} is called from a different maven 
> module, and because we shade Guava, that sometimes leads to errors (e.g. when 
> running tests using maven):
> {noformat}
> udf3Test(test.org.apache.spark.sql.JavaUDFSuite)  Time elapsed: 0.084 sec  
> <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(Lcom/google/common/reflect/TypeToken;)Lscala/Tuple2;
> at 
> test.org.apache.spark.sql.JavaUDFSuite.udf3Test(JavaUDFSuite.java:107)
> Results :
> Tests in error: 
>   JavaUDFSuite.udf3Test:107 » NoSuchMethod 
> org.apache.spark.sql.catalyst.JavaTyp...
> {noformat}
> Instead, we shouldn't expose Guava types in these APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18349) Update R API documentation on ml model summary

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18349:

Fix Version/s: (was: 2.1.1)
   2.1.0

> Update R API documentation on ml model summary
> --
>
> Key: SPARK-18349
> URL: https://issues.apache.org/jira/browse/SPARK-18349
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Miao Wang
> Fix For: 2.1.0
>
>
> It has been discovered that there is a fair bit of consistency in the 
> documentation of summary functions, eg.
> {code}
> #' @return \code{summary} returns a summary object of the fitted model, a 
> list of components
> #' including formula, number of features, list of features, feature 
> importances, number of
> #' trees, and tree weights
> setMethod("summary", signature(object = "GBTRegressionModel")
> {code}
> For instance, what should be listed for the return value? Should it be a name 
> or a phrase, or should it be a list of items; and should there be a longer 
> description on what they mean, or reference link to Scala doc.
> We will need to review this for all model summary implementations in mllib.R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17822:

Fix Version/s: (was: 2.1.1)
   (was: 2.0.3)
   (was: 2.2.0)
   2.1.0

> JVMObjectTracker.objMap may leak JVM objects
> 
>
> Key: SPARK-17822
> URL: https://issues.apache.org/jira/browse/SPARK-17822
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
>Assignee: Xiangrui Meng
> Fix For: 2.1.0
>
> Attachments: screenshot-1.png
>
>
> JVMObjectTracker.objMap is used to track JVM objects for SparkR. However, we 
> observed that JVM objects that are not used anymore are still trapped in this 
> map, which prevents those object get GCed. 
> Seems it makes sense to use weak reference (like persistentRdds in 
> SparkContext). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18681) Throw Filtering is supported only on partition keys of type string exception

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18681:

Fix Version/s: (was: 2.1.1)
   2.1.0

> Throw Filtering is supported only on partition keys of type string exception
> 
>
> Key: SPARK-18681
> URL: https://issues.apache.org/jira/browse/SPARK-18681
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
> Fix For: 2.1.0
>
>
> Cloudera put 
> {{/var/run/cloudera-scm-agent/process/15000-hive-HIVEMETASTORE/hive-site.xml}}
>  as the configuration file for the Hive Metastore Server, where 
> {{hive.metastore.try.direct.sql=false}}. But Spark reading the gateway 
> configuration file and get default value 
> {{hive.metastore.try.direct.sql=true}}. we should use {{getMetaConf}} or 
> {{getMSC.getConfigValue}} method to obtain the original configuration from 
> Hive Metastore Server.
> {noformat}
> spark-sql> CREATE TABLE test (value INT) PARTITIONED BY (part INT);
> Time taken: 0.221 seconds
> spark-sql> select * from test where part=1 limit 10;
> 16/12/02 08:33:45 ERROR thriftserver.SparkSQLDriver: Failed in [select * from 
> test where part=1 limit 10]
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:610)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:549)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:547)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:547)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:954)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:938)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:938)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:156)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150)
>   at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2435)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:295)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:134)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:133)
>   at 
>

[jira] [Updated] (SPARK-18797) Update spark.logit in sparkr-vignettes

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18797:

Fix Version/s: (was: 2.1.1)
   (was: 2.2.0)
   2.1.0

> Update spark.logit in sparkr-vignettes
> --
>
> Key: SPARK-18797
> URL: https://issues.apache.org/jira/browse/SPARK-18797
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.1.0
>
>
> spark.logit is added in 2.1. We need to update spark-vignettes to reflect the 
> changes. This is part of SparkR QA work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18812) Clarify "Spark ML"

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18812:

Fix Version/s: (was: 2.1.1)
   (was: 2.2.0)
   2.1.0

> Clarify "Spark ML"
> --
>
> Key: SPARK-18812
> URL: https://issues.apache.org/jira/browse/SPARK-18812
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 2.1.0
>
>
> It is useful to add an FAQ entry to explain "Spark ML" and reduce confusion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18816:

Fix Version/s: (was: 2.1.1)
   (was: 2.2.0)
   2.1.0

> executor page fails to show log links if executors are added after an app is 
> launched
> -
>
> Key: SPARK-18816
> URL: https://issues.apache.org/jira/browse/SPARK-18816
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
>Assignee: Alex Bozarth
>Priority: Blocker
> Fix For: 2.1.0
>
> Attachments: screenshot-1.png
>
>
> How to reproduce with standalone mode:
> 1. Launch a spark master
> 2. Launch a spark shell. At this point, there is no executor associated with 
> this application. 
> 3. Launch a slave. Now, there is an executor assigned to the spark shell. 
> However, there is no link to stdout/stderr on the executor page (please see 
> https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18811) Stream Source resolution should happen in StreamExecution thread, not main thread

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18811:

Fix Version/s: (was: 2.1.1)
   (was: 2.2.0)
   2.1.0

> Stream Source resolution should happen in StreamExecution thread, not main 
> thread
> -
>
> Key: SPARK-18811
> URL: https://issues.apache.org/jira/browse/SPARK-18811
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.1.0
>
>
> When you start a stream, if we are trying to resolve the source of the 
> stream, for example if we need to resolve partition columns, this could take 
> a long time. This long execution time should not block the main thread where 
> `query.start()` was called on. It should happen in the stream execution 
> thread possibly before starting any triggers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18760) Provide consistent format output for all file formats

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18760:

Fix Version/s: (was: 2.1.1)
   (was: 2.2.0)
   2.1.0

> Provide consistent format output for all file formats
> -
>
> Key: SPARK-18760
> URL: https://issues.apache.org/jira/browse/SPARK-18760
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> We currently rely on FileFormat implementations to override toString in order 
> to get a proper explain output. It'd be better to just depend on shortName 
> for those.
> Before:
> {noformat}
> scala> spark.read.text("test.text").explain()
> == Physical Plan ==
> *FileScan text [value#15] Batched: false, Format: 
> org.apache.spark.sql.execution.datasources.text.TextFileFormat@xyz, Location: 
> InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {noformat}
> After:
> {noformat}
> scala> spark.read.text("test.text").explain()
> == Physical Plan ==
> *FileScan text [value#15] Batched: false, Format: text, Location: 
> InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18325) SparkR 2.1 QA: Check for new R APIs requiring example code

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18325:

Fix Version/s: (was: 2.1.1)
   (was: 2.2.0)
   2.1.0

> SparkR 2.1 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-18325
> URL: https://issues.apache.org/jira/browse/SPARK-18325
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
> Fix For: 2.1.0
>
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18590) R - Include package vignettes and help pages, build source package in Spark distribution

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18590:

Fix Version/s: (was: 2.1.1)
   2.1.0

> R - Include package vignettes and help pages, build source package in Spark 
> distribution
> 
>
> Key: SPARK-18590
> URL: https://issues.apache.org/jira/browse/SPARK-18590
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.1.0
>
>
> We should include in Spark distribution the built source package for SparkR. 
> This will enable help and vignettes when the package is used. Also this 
> source package is what we would release to CRAN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18815) NPE when collecting column stats for string/binary column having only null values

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18815:

Fix Version/s: (was: 2.1.1)
   (was: 2.2.0)
   2.1.0

> NPE when collecting column stats for string/binary column having only null 
> values
> -
>
> Key: SPARK-18815
> URL: https://issues.apache.org/jira/browse/SPARK-18815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18794) SparkR vignette update: gbt

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18794:

Fix Version/s: (was: 2.1.1)
   (was: 2.2.0)
   2.1.0

> SparkR vignette update: gbt
> ---
>
> Key: SPARK-18794
> URL: https://issues.apache.org/jira/browse/SPARK-18794
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
> Fix For: 2.1.0
>
>
> Update vignettes to cover gradient boosted trees



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18795) SparkR vignette update: ksTest

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18795:

Fix Version/s: (was: 2.1.1)
   (was: 2.2.0)
   2.1.0

> SparkR vignette update: ksTest
> --
>
> Key: SPARK-18795
> URL: https://issues.apache.org/jira/browse/SPARK-18795
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Fix For: 2.1.0
>
>
> Update vignettes to cover ksTest



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18807) Should suppress output print for calls to JVM methods with void return values

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18807:

Fix Version/s: (was: 2.1.1)
   2.1.0

> Should suppress output print for calls to JVM methods with void return values
> -
>
> Key: SPARK-18807
> URL: https://issues.apache.org/jira/browse/SPARK-18807
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 2.1.0
>
>
> Several SparkR API calling into JVM methods that have void return values are 
> getting printed out, especially when running in a REPL or IDE.
> example:
> > setLogLevel("WARN")
> NULL
> We should fix this to make the result more clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18628) Update handle invalid documentation string

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18628:

Fix Version/s: (was: 2.1.1)
   (was: 2.2.0)
   2.1.0

> Update handle invalid documentation string
> --
>
> Key: SPARK-18628
> URL: https://issues.apache.org/jira/browse/SPARK-18628
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Assignee: Krishna Kalyan
>Priority: Trivial
>  Labels: starter
> Fix For: 2.1.0
>
>
> The handleInvalid paramater documentation string currently doesn't have 
> quotes around the options, after SPARK-18366 is in, it would be good to 
> update both the Scala param and Python param to have quotes around the 
> options making it easier for users to read.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18810) SparkR install.spark does not work for RCs, snapshots

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18810:

Fix Version/s: (was: 2.1.1)
   (was: 2.2.0)
   2.1.0

> SparkR install.spark does not work for RCs, snapshots
> -
>
> Key: SPARK-18810
> URL: https://issues.apache.org/jira/browse/SPARK-18810
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shivaram Venkataraman
>Assignee: Felix Cheung
> Fix For: 2.1.0
>
>
> We publish source archives of the SparkR package now in RCs and in nightly 
> snapshot builds. One of the problems that still remains is that 
> `install.spark` does not work for these as it looks for the final Spark 
> version to be present in the apache download mirrors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18774) Ignore non-existing files when ignoreCorruptFiles is enabled

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18774:

Fix Version/s: (was: 2.1.1)
   (was: 2.2.0)
   2.1.0

> Ignore non-existing files when ignoreCorruptFiles is enabled
> 
>
> Key: SPARK-18774
> URL: https://issues.apache.org/jira/browse/SPARK-18774
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18790) Keep a general offset history of stream batches

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18790:

Fix Version/s: (was: 2.1.1)
   2.1.0

> Keep a general offset history of stream batches
> ---
>
> Key: SPARK-18790
> URL: https://issues.apache.org/jira/browse/SPARK-18790
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Tyson Condie
>Assignee: Tyson Condie
> Fix For: 2.0.3, 2.1.0
>
>
> Instead of only keeping the minimum number of offsets around, we should keep 
> enough information to allow us to roll back n batches and reexecute the 
> stream starting from a given point. In particular, we should create a config 
> in SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and 
> ensure that we keep enough log files in the following places to roll back the 
> specified number of batches:
> the offsets that are present in each batch
> versions of the state store
> the files lists stored for the FileStreamSource
> the metadata log stored by the FileStreamSink



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18745) java.lang.IndexOutOfBoundsException running query 68 Spark SQL on (100TB)

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18745:

Fix Version/s: (was: 2.1.1)
   2.1.0

> java.lang.IndexOutOfBoundsException running query 68 Spark SQL on (100TB)
> -
>
> Key: SPARK-18745
> URL: https://issues.apache.org/jira/browse/SPARK-18745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: JESSE CHEN
>Assignee: Kazuaki Ishizaki
>Priority: Critical
> Fix For: 2.0.3, 2.1.0
>
>
> Running query 68 with decreased executor memory (using 12GB executors instead 
> of 24GB) on 100TB parquet database using the Spark master dated 11/04 gave 
> IndexOutOfBoundsException.
> The query is as follows:
> {noformat}
> [select  c_last_name
>,c_first_name
>,ca_city
>,bought_city
>,ss_ticket_number
>,extended_price
>,extended_tax
>,list_price
>  from (select ss_ticket_number
>  ,ss_customer_sk
>  ,ca_city bought_city
>  ,sum(ss_ext_sales_price) extended_price 
>  ,sum(ss_ext_list_price) list_price
>  ,sum(ss_ext_tax) extended_tax 
>from store_sales
>,date_dim
>,store
>,household_demographics
>,customer_address 
>where store_sales.ss_sold_date_sk = date_dim.d_date_sk
>  and store_sales.ss_store_sk = store.s_store_sk  
> and store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk
> and store_sales.ss_addr_sk = customer_address.ca_address_sk
> and date_dim.d_dom between 1 and 2 
> and (household_demographics.hd_dep_count = 8 or
>  household_demographics.hd_vehicle_count= -1)
> and date_dim.d_year in (2000,2000+1,2000+2)
> and store.s_city in ('Plainview','Rogers')
>group by ss_ticket_number
>,ss_customer_sk
>,ss_addr_sk,ca_city) dn
>   ,customer
>   ,customer_address current_addr
>  where ss_customer_sk = c_customer_sk
>and customer.c_current_addr_sk = current_addr.ca_address_sk
>and current_addr.ca_city <> bought_city
>  order by c_last_name
>  ,ss_ticket_number
>   limit 100]
> {noformat}
> Spark output that showed the exception:
> {noformat}
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at 
> org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:215)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123)
>   at 
> org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecuteBroadcast(Exchange.scala:61)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:36)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:68)
>   at 
>

[jira] [Updated] (SPARK-16589) Chained cartesian produces incorrect number of records

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16589:

Fix Version/s: (was: 2.1.1)
   2.1.0

> Chained cartesian produces incorrect number of records
> --
>
> Key: SPARK-16589
> URL: https://issues.apache.org/jira/browse/SPARK-16589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Andrew Ray
>  Labels: correctness
> Fix For: 2.0.3, 2.1.0
>
>
> Chaining cartesian calls in PySpark results in the number of records lower 
> than expected. It can be reproduced as follows:
> {code}
> rdd = sc.parallelize(range(10), 1)
> rdd.cartesian(rdd).cartesian(rdd).count()
> ## 355
> rdd.cartesian(rdd).cartesian(rdd).distinct().count()
> ## 251
> {code}
> It looks like it is related to serialization. If we reserialize after initial 
> cartesian:
> {code}
> rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), 
> 1)).cartesian(rdd).count()
> ## 1000
> {code}
> or insert identity map:
> {code}
> rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count()
> ## 1000
> {code}
> it yields correct results.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18843) Fix timeout in awaitResultInForkJoinSafely

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18843:

Fix Version/s: (was: 2.1.1)
   2.1.0

> Fix timeout in awaitResultInForkJoinSafely
> --
>
> Key: SPARK-18843
> URL: https://issues.apache.org/jira/browse/SPARK-18843
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.3, 2.1.0
>
>
> Master has the fix in https://github.com/apache/spark/pull/16230. However, 
> since we don't merge this PR into master because it's too risky, we should at 
> least fix the timeout value for 2.0 and 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18856) Newly created catalog table assumed to have 0 rows and 0 bytes

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18856.
-
   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 2.1.0

> Newly created catalog table assumed to have 0 rows and 0 bytes
> --
>
> Key: SPARK-18856
> URL: https://issues.apache.org/jira/browse/SPARK-18856
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 2.1.0
>
>
> {code}
> scala> spark.range(100).selectExpr("id % 10 p", 
> "id").write.partitionBy("p").format("json").saveAsTable("testjson")
> scala> spark.table("testjson").queryExecution.optimizedPlan.statistics
> res6: org.apache.spark.sql.catalyst.plans.logical.Statistics = 
> Statistics(sizeInBytes=0, isBroadcastable=false)
> {code}
> It shouldn't be 0. The issue is that in DataSource.scala, we do:
> {code}
> val fileCatalog = if 
> (sparkSession.sqlContext.conf.manageFilesourcePartitions &&
> catalogTable.isDefined && 
> catalogTable.get.tracksPartitionsInCatalog) {
>   new CatalogFileIndex(
> sparkSession,
> catalogTable.get,
> catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(0L))
> } else {
>   new InMemoryFileIndex(sparkSession, globbedPaths, options, 
> Some(partitionSchema))
> }
> {code}
> We shouldn't use 0L as the fallback.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18875) Fix R API doc generation by adding `DESCRIPTION` file

2016-12-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18875:


Assignee: Apache Spark

> Fix R API doc generation by adding `DESCRIPTION` file
> -
>
> Key: SPARK-18875
> URL: https://issues.apache.org/jira/browse/SPARK-18875
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SparkR
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> Since 1.4.0, R API document index page has a broken link on `DESCRIPTION 
> file`. This issue aims to fix that.
> * Official Latest Website: 
> http://spark.apache.org/docs/latest/api/R/index.html
> * Apache Spark 2.1.0-rc2: 
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18875) Fix R API doc generation by adding `DESCRIPTION` file

2016-12-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18875:


Assignee: (was: Apache Spark)

> Fix R API doc generation by adding `DESCRIPTION` file
> -
>
> Key: SPARK-18875
> URL: https://issues.apache.org/jira/browse/SPARK-18875
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SparkR
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since 1.4.0, R API document index page has a broken link on `DESCRIPTION 
> file`. This issue aims to fix that.
> * Official Latest Website: 
> http://spark.apache.org/docs/latest/api/R/index.html
> * Apache Spark 2.1.0-rc2: 
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18875) Fix R API doc generation by adding `DESCRIPTION` file

2016-12-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750404#comment-15750404
 ] 

Apache Spark commented on SPARK-18875:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/16292

> Fix R API doc generation by adding `DESCRIPTION` file
> -
>
> Key: SPARK-18875
> URL: https://issues.apache.org/jira/browse/SPARK-18875
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SparkR
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since 1.4.0, R API document index page has a broken link on `DESCRIPTION 
> file`. This issue aims to fix that.
> * Official Latest Website: 
> http://spark.apache.org/docs/latest/api/R/index.html
> * Apache Spark 2.1.0-rc2: 
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18875) Fix R API doc generation by adding `DESCRIPTION` file

2016-12-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18875:
--
Description: 
Since 1.4.0, R API document index page has a broken link on `DESCRIPTION file`. 
This issue aims to fix that.

* Official Latest Website: http://spark.apache.org/docs/latest/api/R/index.html
* Apache Spark 2.1.0-rc2: 
http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html


  was:
Currently, R API document index page has a broken link on `DESCRIPTION file`. 
This issue aims to fix that.

* Official Latest Website: http://spark.apache.org/docs/latest/api/R/index.html
* Apache Spark 2.1.0-rc2: 
http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html



> Fix R API doc generation by adding `DESCRIPTION` file
> -
>
> Key: SPARK-18875
> URL: https://issues.apache.org/jira/browse/SPARK-18875
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SparkR
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since 1.4.0, R API document index page has a broken link on `DESCRIPTION 
> file`. This issue aims to fix that.
> * Official Latest Website: 
> http://spark.apache.org/docs/latest/api/R/index.html
> * Apache Spark 2.1.0-rc2: 
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18875) Fix R API doc generation by adding `DESCRIPTION` file

2016-12-14 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-18875:
-

 Summary: Fix R API doc generation by adding `DESCRIPTION` file
 Key: SPARK-18875
 URL: https://issues.apache.org/jira/browse/SPARK-18875
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SparkR
Affects Versions: 2.0.2, 1.6.3
Reporter: Dongjoon Hyun
Priority: Minor


Currently, R API document index page has a broken link on `DESCRIPTION file`. 
This issue aims to fix that.

* Official Latest Website: http://spark.apache.org/docs/latest/api/R/index.html
* Apache Spark 2.1.0-rc2: 
http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18281) toLocalIterator yields time out error on pyspark2

2016-12-14 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750281#comment-15750281
 ] 

Liang-Chi Hsieh commented on SPARK-18281:
-

[~mwdus...@us.ibm.com] BTW, I updated the fixing and if you have time to test 
it again, that would be great. Thank you.

> toLocalIterator yields time out error on pyspark2
> -
>
> Key: SPARK-18281
> URL: https://issues.apache.org/jira/browse/SPARK-18281
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
> Environment: Ubuntu 14.04.5 LTS
> Driver: AWS M4.XLARGE
> Slaves: AWS M4.4.XLARGE
> mesos 1.0.1
> spark 2.0.1
> pyspark
>Reporter: Luke Miner
>
> I run the example straight out of the api docs for toLocalIterator and it 
> gives a time out exception:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> rdd = sc.parallelize(range(10))
> [x for x in rdd.toLocalIterator()]
> {code}
> conf file:
> spark.driver.maxResultSize 6G
> spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G 
> -XX:+HeapDumpOnOutOfMemoryError
> spark.executor.memory   16G
> spark.executor.uri  foo/spark-2.0.1-bin-hadoop2.7.tgz
> spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
> spark.hadoop.fs.s3a.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3n.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3a.connection.timeout 50
> spark.hadoop.fs.s3n.multipart.uploads.enabled   true
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
> spark.hadoop.parquet.block.size 2147483648
> spark.hadoop.parquet.enable.summary-metadatafalse
> spark.jars.packages 
> com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34
> spark.local.dir /raid0/spark
> spark.mesos.coarse  false
> spark.mesos.constraints  priority:1
> spark.network.timeout   600
> spark.rpc.message.maxSize500
> spark.speculation   false
> spark.sql.parquet.mergeSchema   false
> spark.sql.planner.externalSort  true
> spark.submit.deployMode client
> spark.task.cpus 1
> Exception here:
> {code}
> ---
> timeout   Traceback (most recent call last)
>  in ()
>   2 sc = SparkContext()
>   3 rdd = sc.parallelize(range(10))
> > 4 [x for x in rdd.toLocalIterator()]
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in 
> _load_from_socket(port, serializer)
> 140 try:
> 141 rf = sock.makefile("rb", 65536)
> --> 142 for item in serializer.load_stream(rf):
> 143 yield item
> 144 finally:
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> load_stream(self, stream)
> 137 while True:
> 138 try:
> --> 139 yield self._read_with_length(stream)
> 140 except EOFError:
> 141 return
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> _read_with_length(self, stream)
> 154 
> 155 def _read_with_length(self, stream):
> --> 156 length = read_int(stream)
> 157 if length == SpecialLengths.END_OF_DATA_SECTION:
> 158 raise EOFError
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> read_int(stream)
> 541 
> 542 def read_int(stream):
> --> 543 length = stream.read(4)
> 544 if not length:
> 545 raise EOFError
> /usr/lib/python2.7/socket.pyc in read(self, size)
> 378 # fragmentation issues on many platforms.
> 379 try:
> --> 380 data = self._sock.recv(left)
> 381 except error, e:
> 382 if e.args[0] == EINTR:
> timeout: timed out
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18281) toLocalIterator yields time out error on pyspark2

2016-12-14 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750246#comment-15750246
 ] 

Liang-Chi Hsieh commented on SPARK-18281:
-

[~mwdus...@us.ibm.com] Thanks for this test case! It is useful to me. However I 
need to increase the partition number to 1000 to reproduce this issue.

The additional partitions will increase the time to materialize RDD elements 
and so cause timeout.

I think we can't set a timeout to the socket reading operation like currently 
doing as the RDD materialization time is unpredictable. I will keep the 
connection timeout untouched but unset timeout for socket reading. 

> toLocalIterator yields time out error on pyspark2
> -
>
> Key: SPARK-18281
> URL: https://issues.apache.org/jira/browse/SPARK-18281
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
> Environment: Ubuntu 14.04.5 LTS
> Driver: AWS M4.XLARGE
> Slaves: AWS M4.4.XLARGE
> mesos 1.0.1
> spark 2.0.1
> pyspark
>Reporter: Luke Miner
>
> I run the example straight out of the api docs for toLocalIterator and it 
> gives a time out exception:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> rdd = sc.parallelize(range(10))
> [x for x in rdd.toLocalIterator()]
> {code}
> conf file:
> spark.driver.maxResultSize 6G
> spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G 
> -XX:+HeapDumpOnOutOfMemoryError
> spark.executor.memory   16G
> spark.executor.uri  foo/spark-2.0.1-bin-hadoop2.7.tgz
> spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
> spark.hadoop.fs.s3a.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3n.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3a.connection.timeout 50
> spark.hadoop.fs.s3n.multipart.uploads.enabled   true
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
> spark.hadoop.parquet.block.size 2147483648
> spark.hadoop.parquet.enable.summary-metadatafalse
> spark.jars.packages 
> com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34
> spark.local.dir /raid0/spark
> spark.mesos.coarse  false
> spark.mesos.constraints  priority:1
> spark.network.timeout   600
> spark.rpc.message.maxSize500
> spark.speculation   false
> spark.sql.parquet.mergeSchema   false
> spark.sql.planner.externalSort  true
> spark.submit.deployMode client
> spark.task.cpus 1
> Exception here:
> {code}
> ---
> timeout   Traceback (most recent call last)
>  in ()
>   2 sc = SparkContext()
>   3 rdd = sc.parallelize(range(10))
> > 4 [x for x in rdd.toLocalIterator()]
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in 
> _load_from_socket(port, serializer)
> 140 try:
> 141 rf = sock.makefile("rb", 65536)
> --> 142 for item in serializer.load_stream(rf):
> 143 yield item
> 144 finally:
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> load_stream(self, stream)
> 137 while True:
> 138 try:
> --> 139 yield self._read_with_length(stream)
> 140 except EOFError:
> 141 return
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> _read_with_length(self, stream)
> 154 
> 155 def _read_with_length(self, stream):
> --> 156 length = read_int(stream)
> 157 if length == SpecialLengths.END_OF_DATA_SECTION:
> 158 raise EOFError
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> read_int(stream)
> 541 
> 542 def read_int(stream):
> --> 543 length = stream.read(4)
> 544 if not length:
> 545 raise EOFError
> /usr/lib/python2.7/socket.pyc in read(self, size)
> 378 # fragmentation issues on many platforms.
> 379 try:
> --> 380 data = self._sock.recv(left)
> 381 except error, e:
> 382 if e.args[0] == EINTR:
> timeout: timed out
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2016-12-14 Thread Sital Kedia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750217#comment-15750217
 ] 

Sital Kedia commented on SPARK-18838:
-

[~zsxwing] - Its not only the ExecutorAllocationManager, other critical 
listeners like HeartbeatReceiver also depend on it.  In addition to that there 
might be some latency sensitive user added listener. Making the event 
processing faster by multi-threading will fix all theses issues. I have an 
initial version of the PR for this, would appreciate if you can take a look and 
give feedback on the overall design. 

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and these delay is 
> causing job failure. For example, a significant delay in receiving the 
> `SparkListenerTaskStart` might cause `ExecutorAllocationManager` manager to 
> remove an executor which is not idle.  The event processor in `ListenerBus` 
> is a single thread which loops through all the Listeners for each event and 
> processes each event synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  
> The single threaded processor often becomes the bottleneck for large jobs.  
> In addition to that, if one of the Listener is very slow, all the listeners 
> will pay the price of delay incurred by the slow listener. 
> To solve the above problems, we plan to have a per listener single threaded 
> executor service and separate event queue. That way we are not bottlenecked 
> by the single threaded event processor and also critical listeners will not 
> be penalized by the slow listeners. The downside of this approach is separate 
> event queue per listener will increase the driver memory footprint. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18838) High latency of event processing for large jobs

2016-12-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18838:


Assignee: (was: Apache Spark)

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and these delay is 
> causing job failure. For example, a significant delay in receiving the 
> `SparkListenerTaskStart` might cause `ExecutorAllocationManager` manager to 
> remove an executor which is not idle.  The event processor in `ListenerBus` 
> is a single thread which loops through all the Listeners for each event and 
> processes each event synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  
> The single threaded processor often becomes the bottleneck for large jobs.  
> In addition to that, if one of the Listener is very slow, all the listeners 
> will pay the price of delay incurred by the slow listener. 
> To solve the above problems, we plan to have a per listener single threaded 
> executor service and separate event queue. That way we are not bottlenecked 
> by the single threaded event processor and also critical listeners will not 
> be penalized by the slow listeners. The downside of this approach is separate 
> event queue per listener will increase the driver memory footprint. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2016-12-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750209#comment-15750209
 ] 

Apache Spark commented on SPARK-18838:
--

User 'sitalkedia' has created a pull request for this issue:
https://github.com/apache/spark/pull/16291

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and these delay is 
> causing job failure. For example, a significant delay in receiving the 
> `SparkListenerTaskStart` might cause `ExecutorAllocationManager` manager to 
> remove an executor which is not idle.  The event processor in `ListenerBus` 
> is a single thread which loops through all the Listeners for each event and 
> processes each event synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  
> The single threaded processor often becomes the bottleneck for large jobs.  
> In addition to that, if one of the Listener is very slow, all the listeners 
> will pay the price of delay incurred by the slow listener. 
> To solve the above problems, we plan to have a per listener single threaded 
> executor service and separate event queue. That way we are not bottlenecked 
> by the single threaded event processor and also critical listeners will not 
> be penalized by the slow listeners. The downside of this approach is separate 
> event queue per listener will increase the driver memory footprint. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18838) High latency of event processing for large jobs

2016-12-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18838:


Assignee: Apache Spark

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Apache Spark
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and these delay is 
> causing job failure. For example, a significant delay in receiving the 
> `SparkListenerTaskStart` might cause `ExecutorAllocationManager` manager to 
> remove an executor which is not idle.  The event processor in `ListenerBus` 
> is a single thread which loops through all the Listeners for each event and 
> processes each event synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  
> The single threaded processor often becomes the bottleneck for large jobs.  
> In addition to that, if one of the Listener is very slow, all the listeners 
> will pay the price of delay incurred by the slow listener. 
> To solve the above problems, we plan to have a per listener single threaded 
> executor service and separate event queue. That way we are not bottlenecked 
> by the single threaded event processor and also critical listeners will not 
> be penalized by the slow listeners. The downside of this approach is separate 
> event queue per listener will increase the driver memory footprint. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18874) First phase: Deferring the correlated predicate pull up to Optimizer phase

2016-12-14 Thread Nattavut Sutyanyong (JIRA)

Nattavut Sutyanyong created SPARK-18874:
---

 Summary: First phase: Deferring the correlated predicate pull up 
to Optimizer phase
 Key: SPARK-18874
 URL: https://issues.apache.org/jira/browse/SPARK-18874
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Nattavut Sutyanyong


This JIRA implements the first phase of SPARK-18455 by deferring the correlated 
predicate pull up from Analyzer to Optimizer. The goal is to preserve the 
current functionality of subquery in Spark 2.0 (if it works, it continues to 
work after this JIRA, if it does not, it won't). The performance of subquery 
processing is expected to be at par with Spark 2.0.

The representation of the LogicalPlan after Analyzer will be different after 
this JIRA that it will preserve the original positions of correlated predicates 
in a subquery. This new representation is a preparation work for the second 
phase of extending the support of correlated subquery to cases Spark 2.0 does 
not support such as deep correlation, outer references in SELECT clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18873) New test cases for scalar subquery

2016-12-14 Thread Nattavut Sutyanyong (JIRA)

Nattavut Sutyanyong created SPARK-18873:
---

 Summary: New test cases for scalar subquery
 Key: SPARK-18873
 URL: https://issues.apache.org/jira/browse/SPARK-18873
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Tests
Reporter: Nattavut Sutyanyong


This JIRA is for submitting a PR for new test cases on scalar subquery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18872) New test cases for EXISTS subquery

2016-12-14 Thread Nattavut Sutyanyong (JIRA)

Nattavut Sutyanyong created SPARK-18872:
---

 Summary: New test cases for EXISTS subquery
 Key: SPARK-18872
 URL: https://issues.apache.org/jira/browse/SPARK-18872
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Tests
Reporter: Nattavut Sutyanyong


This JIRA is for submitting a PR for new EXISTS/NOT EXISTS subquery test cases. 
It follows the same idea as the IN subquery test cases which contain simple 
patterns, then build more complex constructs in both parent and subquery sides. 
This batch of test cases are mostly, if not all, positive test cases that do 
not return any syntax errors or unsupported functionality. We make effort to 
have test cases returning rows in the result set so that they can indirectly 
detect incorrect result problems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed

2016-12-14 Thread Rishi Kamaleswaran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750163#comment-15750163
 ] 

Rishi Kamaleswaran commented on SPARK-18699:


Thanks for the reply! Unfortunately neither of those options work in my case.


> Spark CSV parsing types other than String throws exception when malformed
> -
>
> Key: SPARK-18699
> URL: https://issues.apache.org/jira/browse/SPARK-18699
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Jakub Nowacki
>
> If CSV is read and the schema contains any other type than String, exception 
> is thrown when the string value in CSV is malformed; e.g. if the timestamp 
> does not match the defined one, an exception is thrown:
> {code}
> Caused by: java.lang.IllegalArgumentException
>   at java.sql.Date.valueOf(Date.java:143)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at scala.util.Try.getOrElse(Try.scala:79)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
>   ... 8 more
> {code}
> It behaves similarly with Integer and Long types, from what I've seen.
> To my understanding modes PERMISSIVE and DROPMALFORMED should just null the 
> value or drop the line, but instead they kill the job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18281) toLocalIterator yields time out error on pyspark2

2016-12-14 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750158#comment-15750158
 ] 

Liang-Chi Hsieh commented on SPARK-18281:
-

Hi [~holdenk], what you meant for "we immediately do a foreach on the Scala 
iterator which is somewhat strange."?

> toLocalIterator yields time out error on pyspark2
> -
>
> Key: SPARK-18281
> URL: https://issues.apache.org/jira/browse/SPARK-18281
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
> Environment: Ubuntu 14.04.5 LTS
> Driver: AWS M4.XLARGE
> Slaves: AWS M4.4.XLARGE
> mesos 1.0.1
> spark 2.0.1
> pyspark
>Reporter: Luke Miner
>
> I run the example straight out of the api docs for toLocalIterator and it 
> gives a time out exception:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> rdd = sc.parallelize(range(10))
> [x for x in rdd.toLocalIterator()]
> {code}
> conf file:
> spark.driver.maxResultSize 6G
> spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G 
> -XX:+HeapDumpOnOutOfMemoryError
> spark.executor.memory   16G
> spark.executor.uri  foo/spark-2.0.1-bin-hadoop2.7.tgz
> spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
> spark.hadoop.fs.s3a.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3n.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3a.connection.timeout 50
> spark.hadoop.fs.s3n.multipart.uploads.enabled   true
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
> spark.hadoop.parquet.block.size 2147483648
> spark.hadoop.parquet.enable.summary-metadatafalse
> spark.jars.packages 
> com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34
> spark.local.dir /raid0/spark
> spark.mesos.coarse  false
> spark.mesos.constraints  priority:1
> spark.network.timeout   600
> spark.rpc.message.maxSize500
> spark.speculation   false
> spark.sql.parquet.mergeSchema   false
> spark.sql.planner.externalSort  true
> spark.submit.deployMode client
> spark.task.cpus 1
> Exception here:
> {code}
> ---
> timeout   Traceback (most recent call last)
>  in ()
>   2 sc = SparkContext()
>   3 rdd = sc.parallelize(range(10))
> > 4 [x for x in rdd.toLocalIterator()]
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in 
> _load_from_socket(port, serializer)
> 140 try:
> 141 rf = sock.makefile("rb", 65536)
> --> 142 for item in serializer.load_stream(rf):
> 143 yield item
> 144 finally:
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> load_stream(self, stream)
> 137 while True:
> 138 try:
> --> 139 yield self._read_with_length(stream)
> 140 except EOFError:
> 141 return
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> _read_with_length(self, stream)
> 154 
> 155 def _read_with_length(self, stream):
> --> 156 length = read_int(stream)
> 157 if length == SpecialLengths.END_OF_DATA_SECTION:
> 158 raise EOFError
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> read_int(stream)
> 541 
> 542 def read_int(stream):
> --> 543 length = stream.read(4)
> 544 if not length:
> 545 raise EOFError
> /usr/lib/python2.7/socket.pyc in read(self, size)
> 378 # fragmentation issues on many platforms.
> 379 try:
> --> 380 data = self._sock.recv(left)
> 381 except error, e:
> 382 if e.args[0] == EINTR:
> timeout: timed out
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18871) New test cases for IN subquery

2016-12-14 Thread Nattavut Sutyanyong (JIRA)

Nattavut Sutyanyong created SPARK-18871:
---

 Summary: New test cases for IN subquery
 Key: SPARK-18871
 URL: https://issues.apache.org/jira/browse/SPARK-18871
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Tests
Reporter: Nattavut Sutyanyong


This JIRA is open for submitting a PR for new test cases for IN/NOT IN 
subquery. We plan to put approximately 100+ test cases under 
`SQLQueryTestSuite`. The test cases range from IN/NOT IN subqueries with simple 
SELECT in both parent and subquery to subqueries with more complex constructs 
in both sides (joins, aggregates, etc.) Test data include null value, and 
duplicate values. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18455) General support for correlated subquery processing

2016-12-14 Thread Nattavut Sutyanyong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750145#comment-15750145
 ] 

Nattavut Sutyanyong commented on SPARK-18455:
-

Quantified predicate is not planned for this work. It would be a future work. 
Equality predicates with [ANY | ALL] could be transformed to other currently 
supported forms but inequality predicates make the transformation more complex. 
Null values may not be a main hurdle as comparison operators (=, >, >=, <, <=, 
!=) are null-tolerant operators.

> General support for correlated subquery processing
> --
>
> Key: SPARK-18455
> URL: https://issues.apache.org/jira/browse/SPARK-18455
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Nattavut Sutyanyong
> Attachments: SPARK-18455-scoping-doc.pdf
>
>
> Subquery support has been introduced in Spark 2.0. The initial implementation 
> covers the most common subquery use case: the ones used in TPC queries for 
> instance.
> Spark currently supports the following subqueries:
> * Uncorrelated Scalar Subqueries. All cases are supported.
> * Correlated Scalar Subqueries. We only allow subqueries that are aggregated 
> and use equality predicates.
> * Predicate Subqueries. IN or Exists type of queries. We allow most 
> predicates, except when they are pulled from under an Aggregate or Window 
> operator. In that case we only support equality predicates.
> However this does not cover the full range of possible subqueries. This, in 
> part, has to do with the fact that we currently rewrite all correlated 
> subqueries into a (LEFT/LEFT SEMI/LEFT ANTI) join.
> We currently lack supports for the following use cases:
> * The use of predicate subqueries in a projection.
> * The use of non-equality predicates below Aggregates and or Window operators.
> * The use of non-Aggregate subqueries for correlated scalar subqueries.
> This JIRA aims to lift these current limitations in subquery processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18455) General support for correlated subquery processing

2016-12-14 Thread Nattavut Sutyanyong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750128#comment-15750128
 ] 

Nattavut Sutyanyong edited comment on SPARK-18455 at 12/15/16 2:17 AM:
---

I have attached a scoping document of this work to the record.


was (Author: nsyca):
Scoping document

> General support for correlated subquery processing
> --
>
> Key: SPARK-18455
> URL: https://issues.apache.org/jira/browse/SPARK-18455
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Nattavut Sutyanyong
> Attachments: SPARK-18455-scoping-doc.pdf
>
>
> Subquery support has been introduced in Spark 2.0. The initial implementation 
> covers the most common subquery use case: the ones used in TPC queries for 
> instance.
> Spark currently supports the following subqueries:
> * Uncorrelated Scalar Subqueries. All cases are supported.
> * Correlated Scalar Subqueries. We only allow subqueries that are aggregated 
> and use equality predicates.
> * Predicate Subqueries. IN or Exists type of queries. We allow most 
> predicates, except when they are pulled from under an Aggregate or Window 
> operator. In that case we only support equality predicates.
> However this does not cover the full range of possible subqueries. This, in 
> part, has to do with the fact that we currently rewrite all correlated 
> subqueries into a (LEFT/LEFT SEMI/LEFT ANTI) join.
> We currently lack supports for the following use cases:
> * The use of predicate subqueries in a projection.
> * The use of non-equality predicates below Aggregates and or Window operators.
> * The use of non-Aggregate subqueries for correlated scalar subqueries.
> This JIRA aims to lift these current limitations in subquery processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18455) General support for correlated subquery processing

2016-12-14 Thread Nattavut Sutyanyong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nattavut Sutyanyong updated SPARK-18455:

Attachment: SPARK-18455-scoping-doc.pdf

Scoping document

> General support for correlated subquery processing
> --
>
> Key: SPARK-18455
> URL: https://issues.apache.org/jira/browse/SPARK-18455
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Nattavut Sutyanyong
> Attachments: SPARK-18455-scoping-doc.pdf
>
>
> Subquery support has been introduced in Spark 2.0. The initial implementation 
> covers the most common subquery use case: the ones used in TPC queries for 
> instance.
> Spark currently supports the following subqueries:
> * Uncorrelated Scalar Subqueries. All cases are supported.
> * Correlated Scalar Subqueries. We only allow subqueries that are aggregated 
> and use equality predicates.
> * Predicate Subqueries. IN or Exists type of queries. We allow most 
> predicates, except when they are pulled from under an Aggregate or Window 
> operator. In that case we only support equality predicates.
> However this does not cover the full range of possible subqueries. This, in 
> part, has to do with the fact that we currently rewrite all correlated 
> subqueries into a (LEFT/LEFT SEMI/LEFT ANTI) join.
> We currently lack supports for the following use cases:
> * The use of predicate subqueries in a projection.
> * The use of non-equality predicates below Aggregates and or Window operators.
> * The use of non-Aggregate subqueries for correlated scalar subqueries.
> This JIRA aims to lift these current limitations in subquery processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18861) Spark-SQL unconsistent behavior with "struct" expressions

2016-12-14 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-18861.
--
Resolution: Not A Problem

I see. This seems actually a behaviour documented in 
https://github.com/apache/spark/blob/f830bb9170f6b853565d9dd30ca7418b93a54fe3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1196-L1200

The cases except for the first case seem working fine as expected in the 
current master

{code}
scala> Seq((1, 2, 3), (2, 3, 4)).toDF("a", "b", 
"c").createOrReplaceTempView("t1")

scala> sql("SELECT case when a>b then struct(cast(a as int), cast(b as int)) 
else struct(cast(c as int), cast(c as int)) end from t1").show()
+-+
|CASE WHEN (a > b) THEN named_struct(col1, CAST(a AS INT), col2, CAST(b AS 
INT)) ELSE named_struct(col1, CAST(c AS INT), col2, CAST(c AS INT)) END|
+-+
|   
 [3,3]|
|   
 [4,4]|
+-+
{code}

I am resolving this {{Not A Problem}} as the issue seems obsolete to me.
Please reopen this if anyone feels this is an inappropriate action.

> Spark-SQL unconsistent behavior with "struct" expressions
> -
>
> Key: SPARK-18861
> URL: https://issues.apache.org/jira/browse/SPARK-18861
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ohad Raviv
>
> We are getting strangly inconsistent behavior with expressions involving 
> "struct". Let's start with this simple table:
> {quote}
> Seq((1, 2, 3), (2, 3, 4)).toDF("a", "b", "c").createOrReplaceTempView("t1")
> sql("desc t1").show()
> {quote}
> Then we get this DF:
> {quote}
> |col_name|data_type|comment|
> |   a|  int|   |
> |   b|  int|   |
> |   c|  int|   |
> {quote}
> Now, although we can clearly see that all the fields are of type int, we we 
> run:
> {quote}
> sql("SELECT case when a>b then struct(a,b) else struct(c,c) end from t1")
> {quote}
> we get this error:
> {quote}
> org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (t1.`a` > 
> t1.`b`) THEN struct(t1.`a`, t1.`b`) ELSE struct(t1.`c`, t1.`c`) END' due to 
> data type mismatch: THEN and ELSE expressions should all be same type or 
> coercible to a common type; line 1 pos 7
> {quote}
> if we try this:
> {quote}
> sql("SELECT case when a>b then struct(cast(a as int), cast(b as int)) else 
> struct(cast(c as int), cast(c as int)) end from t1")
> {quote}
> we get another exception:
> {quote}
> requirement failed: Unresolved attributes found when constructing 
> LocalRelation.
> java.lang.IllegalArgumentException: requirement failed: Unresolved attributes 
> found when constructing LocalRelation.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LocalRelation.(LocalRelation.scala:49)
> {quote}
> However, these do work:
> {quote}
> sql("SELECT case when a>b then struct(cast(a as double), cast(b as double)) 
> else struct(cast(c as double), cast(c as double)) end from t1")
> sql("SELECT case when a>b then struct(cast(a as string), cast(b as string)) 
> else struct(cast(c as string), cast(c as string)) end from t1")
> {quote}
> any ideas?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750076#comment-15750076
 ] 

Apache Spark commented on SPARK-18817:
--

User 'shivaram' has created a pull request for this issue:
https://github.com/apache/spark/pull/16290

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18870) Distinct aggregates give incorrect answers on streaming dataframes

2016-12-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18870:


Assignee: Tathagata Das  (was: Apache Spark)

> Distinct aggregates give incorrect answers on streaming dataframes
> --
>
> Key: SPARK-18870
> URL: https://issues.apache.org/jira/browse/SPARK-18870
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
>
> Unsupported operations checking dont check whether AggregationExpression have 
> isDistinct=true. So `streamingDf.groupBy().agg(countDistinct("key")) ` gives 
> incorrect results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18870) Distinct aggregates give incorrect answers on streaming dataframes

2016-12-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18870:


Assignee: Apache Spark  (was: Tathagata Das)

> Distinct aggregates give incorrect answers on streaming dataframes
> --
>
> Key: SPARK-18870
> URL: https://issues.apache.org/jira/browse/SPARK-18870
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Blocker
>
> Unsupported operations checking dont check whether AggregationExpression have 
> isDistinct=true. So `streamingDf.groupBy().agg(countDistinct("key")) ` gives 
> incorrect results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18870) Distinct aggregates give incorrect answers on streaming dataframes

2016-12-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750061#comment-15750061
 ] 

Apache Spark commented on SPARK-18870:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/16289

> Distinct aggregates give incorrect answers on streaming dataframes
> --
>
> Key: SPARK-18870
> URL: https://issues.apache.org/jira/browse/SPARK-18870
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
>
> Unsupported operations checking dont check whether AggregationExpression have 
> isDistinct=true. So `streamingDf.groupBy().agg(countDistinct("key")) ` gives 
> incorrect results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-18783) ML StringIndexer does not work with nested fields

2016-12-14 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-18783.
-
Resolution: Won't Fix

> ML StringIndexer does not work with nested fields
> -
>
> Key: SPARK-18783
> URL: https://issues.apache.org/jira/browse/SPARK-18783
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: manuel garrido
>
> Using StringIndexer.transform with a nested field (from parsing json data) 
> results in the output dataframe not having the new column.
> {code}
> sample = [
>  {'city': u'',
>   'device': {u'make': u'HTC',
>u'os': u'Android'}
>  },
>  {'city': u'Bangalore',
>   'device': {u'make': u'Xiaomi',
>u'os': u'Android'}
>  },
>  {'city': u'Overpelt',
>   'device': {u'make': u'Samsung',
>u'os': u'Android'}
>  }
> ]
> sample_df = sc.parallelize(sample).toDF()
> # First we use a StringIndexer with a non nested field
> city_indexer = StringIndexer(inputCol="city", outputCol="cityIndex", 
> handleInvalid="skip")
> city_indexed = city_indexer.fit(sample_df).transform(sample_df)
> print([i.asDict() for i in city_indexed.collect()])
> >>>[{'device': {u'make': u'HTC', u'os': u'Android'}, 'city': u'', 
> >>>'cityIndex': 0.0}, {'device': {u'make': u'Xiaomi', u'os': u'Android'}, 
> >>>'city': u'Bangalore', 'cityIndex': 2.0}, {'device': {u'make': u'Samsung', 
> >>>u'os': u'Android'}, 'city': u'Overpelt', 'cityIndex': 1.0}]
> # Now we try with a nested field
> os_indexer = StringIndexer(inputCol="device.os", outputCol="osIndex", 
> handleInvalid="skip")
> os_indexed = os_indexer.fit(sample_df).transform(sample_df)
> print([i.asDict() for i in os_indexed.collect()])
> >>>[{'device': {u'make': u'HTC', u'os': u'Android'}, 'city': u''}, {'device': 
> >>>{u'make': u'Xiaomi', u'os': u'Android'}, 'city': u'Bangalore'}, {'device': 
> >>>{u'make': u'Samsung', u'os': u'Android'}, 'city': u'Overpelt'}]  #===> we 
> >>>see the field osIndex is not showing up
> #If we rename the same field device.os as a flat field it works as expected
> os_indexer = StringIndexer(inputCol="device_os", outputCol="osIndex", 
> handleInvalid="skip")
> os_indexed = os_indexer.fit(
> sample_df.withColumn('device_os', col('device.os'))
> ).transform(
> sample_df.withColumn('device_os', col('device.os'))
> )
> print([i.asDict() for i in os_indexed.collect()])
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18783) ML StringIndexer does not work with nested fields

2016-12-14 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750057#comment-15750057
 ] 

Joseph K. Bradley commented on SPARK-18783:
---

I'd separate this into 2 issues:
1. nested fields (new feature)
2. silent failure during transform

For issue #1: I doubt we'll support nested fields soon, though it would be neat 
to have in the future.  One related issue is multi-column support: [SPARK-8418].

For issue #2:
This is because of a hack we did to allow PipelineModel.transform() to work 
without a label column.  During fitting, the StringIndexerModel would index the 
label.  But during prediction/transform, there would not be a label.  It's 
documented here: 
[http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.StringIndexerModel]

We ran into this issue here: [SPARK-8051].  Long term, we should think about 
adding a Param to PipelineStage to turn the stage on/off during fit/transform.  
That's a pretty awkward API, though, so we'll have to discuss it.

I'm going to close this since I don't think we'll add nesting in the near 
future, but we can continue the conversation as needed.  Thanks!

> ML StringIndexer does not work with nested fields
> -
>
> Key: SPARK-18783
> URL: https://issues.apache.org/jira/browse/SPARK-18783
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: manuel garrido
>
> Using StringIndexer.transform with a nested field (from parsing json data) 
> results in the output dataframe not having the new column.
> {code}
> sample = [
>  {'city': u'',
>   'device': {u'make': u'HTC',
>u'os': u'Android'}
>  },
>  {'city': u'Bangalore',
>   'device': {u'make': u'Xiaomi',
>u'os': u'Android'}
>  },
>  {'city': u'Overpelt',
>   'device': {u'make': u'Samsung',
>u'os': u'Android'}
>  }
> ]
> sample_df = sc.parallelize(sample).toDF()
> # First we use a StringIndexer with a non nested field
> city_indexer = StringIndexer(inputCol="city", outputCol="cityIndex", 
> handleInvalid="skip")
> city_indexed = city_indexer.fit(sample_df).transform(sample_df)
> print([i.asDict() for i in city_indexed.collect()])
> >>>[{'device': {u'make': u'HTC', u'os': u'Android'}, 'city': u'', 
> >>>'cityIndex': 0.0}, {'device': {u'make': u'Xiaomi', u'os': u'Android'}, 
> >>>'city': u'Bangalore', 'cityIndex': 2.0}, {'device': {u'make': u'Samsung', 
> >>>u'os': u'Android'}, 'city': u'Overpelt', 'cityIndex': 1.0}]
> # Now we try with a nested field
> os_indexer = StringIndexer(inputCol="device.os", outputCol="osIndex", 
> handleInvalid="skip")
> os_indexed = os_indexer.fit(sample_df).transform(sample_df)
> print([i.asDict() for i in os_indexed.collect()])
> >>>[{'device': {u'make': u'HTC', u'os': u'Android'}, 'city': u''}, {'device': 
> >>>{u'make': u'Xiaomi', u'os': u'Android'}, 'city': u'Bangalore'}, {'device': 
> >>>{u'make': u'Samsung', u'os': u'Android'}, 'city': u'Overpelt'}]  #===> we 
> >>>see the field osIndex is not showing up
> #If we rename the same field device.os as a flat field it works as expected
> os_indexer = StringIndexer(inputCol="device_os", outputCol="osIndex", 
> handleInvalid="skip")
> os_indexed = os_indexer.fit(
> sample_df.withColumn('device_os', col('device.os'))
> ).transform(
> sample_df.withColumn('device_os', col('device.os'))
> )
> print([i.asDict() for i in os_indexed.collect()])
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed

2016-12-14 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750050#comment-15750050
 ] 

Hyukjin Kwon commented on SPARK-18699:
--

BTW, maybe you could try to set {{nullValue}} to {{" "}} or set 
{{ignoreLeadingWhiteSpace}} and {{ignoreTrailingWhiteSpace}} to {{true}} for 
now.

> Spark CSV parsing types other than String throws exception when malformed
> -
>
> Key: SPARK-18699
> URL: https://issues.apache.org/jira/browse/SPARK-18699
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Jakub Nowacki
>
> If CSV is read and the schema contains any other type than String, exception 
> is thrown when the string value in CSV is malformed; e.g. if the timestamp 
> does not match the defined one, an exception is thrown:
> {code}
> Caused by: java.lang.IllegalArgumentException
>   at java.sql.Date.valueOf(Date.java:143)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at scala.util.Try.getOrElse(Try.scala:79)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
>   ... 8 more
> {code}
> It behaves similarly with Integer and Long types, from what I've seen.
> To my understanding modes PERMISSIVE and DROPMALFORMED should just null the 
> value or drop the line, but instead they kill the job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18795) SparkR vignette update: ksTest

2016-12-14 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750043#comment-15750043
 ] 

Joseph K. Bradley commented on SPARK-18795:
---

No problem, thanks for understanding.

> SparkR vignette update: ksTest
> --
>
> Key: SPARK-18795
> URL: https://issues.apache.org/jira/browse/SPARK-18795
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Fix For: 2.1.1, 2.2.0
>
>
> Update vignettes to cover ksTest



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18374) Incorrect words in StopWords/english.txt

2016-12-14 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750041#comment-15750041
 ] 

Joseph K. Bradley commented on SPARK-18374:
---

Oh nice, I didn't realize that was in use.  I'll start doing that.

> Incorrect words in StopWords/english.txt
> 
>
> Key: SPARK-18374
> URL: https://issues.apache.org/jira/browse/SPARK-18374
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1
>Reporter: nirav patel
>Assignee: yuhao yang
>Priority: Minor
>  Labels: releasenotes
> Fix For: 2.2.0
>
>
> I was just double checking english.txt for list of stopwords as I felt it was 
> taking out valid tokens like 'won'. I think issue is english.txt list is 
> missing apostrophe character and all character after apostrophe. So "won't" 
> becam "won" in that list; "wouldn't" is "wouldn" .
> Here are some incorrect tokens in this list:
> won
> wouldn
> ma
> mightn
> mustn
> needn
> shan
> shouldn
> wasn
> weren
> I think ideal list should have both style. i.e. won't and wont both should be 
> part of english.txt as some tokenizer might remove special characters. But 
> 'won' is obviously shouldn't be in this list.
> Here's list of snowball english stop words:
> http://snowball.tartarus.org/algorithms/english/stop.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18849) Vignettes final checks for Spark 2.1

2016-12-14 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18849:
--
Target Version/s: 2.1.0

> Vignettes final checks for Spark 2.1
> 
>
> Key: SPARK-18849
> URL: https://issues.apache.org/jira/browse/SPARK-18849
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Xiangrui Meng
>Assignee: Felix Cheung
>
> Make a final pass over the vignettes and ensure the content is consistent.
> * remove "since version" because is not that useful for vignettes
> * re-order/group the list of ML algorithms so there exists a logical ordering
> * check for warning or error in output message
> * anything else that seems out of place



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18703) Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not Dropped Until Normal Termination of JVM

2016-12-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18703.
-
   Resolution: Fixed
 Assignee: Xiao Li
Fix Version/s: 2.2.0

> Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not 
> Dropped Until Normal Termination of JVM
> --
>
> Key: SPARK-18703
> URL: https://issues.apache.org/jira/browse/SPARK-18703
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Below are the files/directories generated for three inserts againsts a Hive 
> table:
> {noformat}
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/._SUCCESS.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/_SUCCESS
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/part-0
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/._SUCCESS.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/_SUCCESS
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/part-0
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/._SUCCESS.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/_SUCCESS
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/part-0
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-0
> {noformat}
> The first 18 files are temporary. We do not drop it until the end of JVM 
> termination. If JVM does not appropriately terminate, these temporary 
> files/directories will not be dropped.
> Only the last two files are needed, as shown below.
> {noformat}
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc
>

[jira] [Updated] (SPARK-18870) Distinct aggregates give incorrect answers on streaming dataframes

2016-12-14 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-18870:
--
Affects Version/s: 2.0.2
 Target Version/s: 2.1.0
  Component/s: Structured Streaming

> Distinct aggregates give incorrect answers on streaming dataframes
> --
>
> Key: SPARK-18870
> URL: https://issues.apache.org/jira/browse/SPARK-18870
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
>
> Unsupported operations checking dont check whether AggregationExpression have 
> isDistinct=true. So `streamingDf.groupBy().agg(countDistinct("key")) ` gives 
> incorrect results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18870) Distinct aggregates give incorrect answers on streaming dataframes

2016-12-14 Thread Tathagata Das (JIRA)

Tathagata Das created SPARK-18870:
-

 Summary: Distinct aggregates give incorrect answers on streaming 
dataframes
 Key: SPARK-18870
 URL: https://issues.apache.org/jira/browse/SPARK-18870
 Project: Spark
  Issue Type: Bug
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker


Unsupported operations checking dont check whether AggregationExpression have 
isDistinct=true. So `streamingDf.groupBy().agg(countDistinct("key")) ` gives 
incorrect results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18865) SparkR vignettes MLP and LDA updates

2016-12-14 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18865:
--
Target Version/s: 2.1.0  (was: 2.1.1, 2.2.0)

> SparkR vignettes MLP and LDA updates
> 
>
> Key: SPARK-18865
> URL: https://issues.apache.org/jira/browse/SPARK-18865
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.1.1, 2.2.0
>
>
> spark.mlp doesn't provide an example. spark.lda and spark.mlp have repeated 
> content. 
> spark.lda document misses default values for some parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18865) SparkR vignettes MLP and LDA updates

2016-12-14 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18865:
--
Fix Version/s: 2.2.0
   2.1.1

> SparkR vignettes MLP and LDA updates
> 
>
> Key: SPARK-18865
> URL: https://issues.apache.org/jira/browse/SPARK-18865
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.1.1, 2.2.0
>
>
> spark.mlp doesn't provide an example. spark.lda and spark.mlp have repeated 
> content. 
> spark.lda document misses default values for some parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18865) SparkR vignettes MLP and LDA updates

2016-12-14 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18865:
--
Issue Type: Documentation  (was: Bug)

> SparkR vignettes MLP and LDA updates
> 
>
> Key: SPARK-18865
> URL: https://issues.apache.org/jira/browse/SPARK-18865
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Reporter: Miao Wang
>Assignee: Miao Wang
>
> spark.mlp doesn't provide an example. spark.lda and spark.mlp have repeated 
> content. 
> spark.lda document misses default values for some parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18865) SparkR vignettes MLP and LDA updates

2016-12-14 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-18865.
--
  Resolution: Fixed
Assignee: Miao Wang
Target Version/s: 2.1.1, 2.2.0

> SparkR vignettes MLP and LDA updates
> 
>
> Key: SPARK-18865
> URL: https://issues.apache.org/jira/browse/SPARK-18865
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Miao Wang
>Assignee: Miao Wang
>
> spark.mlp doesn't provide an example. spark.lda and spark.mlp have repeated 
> content. 
> spark.lda document misses default values for some parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18869) Add TreeNode.p that returns BaseType

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18869:

Description: After the bug fix in SPARK-18854, TreeNode.apply now returns 
TreeNode[_] rather than a more specific type. It would be easier for 
interactive debugging to introduce a function that returns the BaseType.  (was: 
After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] rather 
than a more specific type. It would be easier for interactive debugging to 
introduce lp that returns LogicalPlan, and pp that returns SparkPlan.
)

> Add TreeNode.p that returns BaseType
> 
>
> Key: SPARK-18869
> URL: https://issues.apache.org/jira/browse/SPARK-18869
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] 
> rather than a more specific type. It would be easier for interactive 
> debugging to introduce a function that returns the BaseType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18869) Add TreeNode.p that returns BaseType

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18869:

Summary: Add TreeNode.p that returns BaseType  (was: Add lp and pp to plan 
nodes for getting logical plans and physical plans)

> Add TreeNode.p that returns BaseType
> 
>
> Key: SPARK-18869
> URL: https://issues.apache.org/jira/browse/SPARK-18869
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] 
> rather than a more specific type. It would be easier for interactive 
> debugging to introduce lp that returns LogicalPlan, and pp that returns 
> SparkPlan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18869) Add lp and pp to plan nodes for getting logical plans and physical plans

2016-12-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18869:


Assignee: Apache Spark  (was: Reynold Xin)

> Add lp and pp to plan nodes for getting logical plans and physical plans
> 
>
> Key: SPARK-18869
> URL: https://issues.apache.org/jira/browse/SPARK-18869
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] 
> rather than a more specific type. It would be easier for interactive 
> debugging to introduce lp that returns LogicalPlan, and pp that returns 
> SparkPlan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18869) Add lp and pp to plan nodes for getting logical plans and physical plans

2016-12-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749968#comment-15749968
 ] 

Apache Spark commented on SPARK-18869:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/16288

> Add lp and pp to plan nodes for getting logical plans and physical plans
> 
>
> Key: SPARK-18869
> URL: https://issues.apache.org/jira/browse/SPARK-18869
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] 
> rather than a more specific type. It would be easier for interactive 
> debugging to introduce lp that returns LogicalPlan, and pp that returns 
> SparkPlan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18869) Add lp and pp to plan nodes for getting logical plans and physical plans

2016-12-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18869:


Assignee: Reynold Xin  (was: Apache Spark)

> Add lp and pp to plan nodes for getting logical plans and physical plans
> 
>
> Key: SPARK-18869
> URL: https://issues.apache.org/jira/browse/SPARK-18869
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] 
> rather than a more specific type. It would be easier for interactive 
> debugging to introduce lp that returns LogicalPlan, and pp that returns 
> SparkPlan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18869) Add lp and pp to plan nodes for getting logical plans and physical plans

2016-12-14 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-18869:
---

 Summary: Add lp and pp to plan nodes for getting logical plans and 
physical plans
 Key: SPARK-18869
 URL: https://issues.apache.org/jira/browse/SPARK-18869
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] rather 
than a more specific type. It would be easier for interactive debugging to 
introduce lp that returns LogicalPlan, and pp that returns SparkPlan.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18868) Flaky Test: StreamingQueryListenerSuite

2016-12-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18868:


Assignee: Apache Spark

> Flaky Test: StreamingQueryListenerSuite
> ---
>
> Key: SPARK-18868
> URL: https://issues.apache.org/jira/browse/SPARK-18868
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> Example: 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3496/consoleFull



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18868) Flaky Test: StreamingQueryListenerSuite

2016-12-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18868:


Assignee: (was: Apache Spark)

> Flaky Test: StreamingQueryListenerSuite
> ---
>
> Key: SPARK-18868
> URL: https://issues.apache.org/jira/browse/SPARK-18868
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Reporter: Burak Yavuz
>
> Example: 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3496/consoleFull



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18868) Flaky Test: StreamingQueryListenerSuite

2016-12-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749915#comment-15749915
 ] 

Apache Spark commented on SPARK-18868:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/16287

> Flaky Test: StreamingQueryListenerSuite
> ---
>
> Key: SPARK-18868
> URL: https://issues.apache.org/jira/browse/SPARK-18868
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Reporter: Burak Yavuz
>
> Example: 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3496/consoleFull



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18849) Vignettes final checks for Spark 2.1

2016-12-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18849:


Assignee: Apache Spark  (was: Felix Cheung)

> Vignettes final checks for Spark 2.1
> 
>
> Key: SPARK-18849
> URL: https://issues.apache.org/jira/browse/SPARK-18849
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> Make a final pass over the vignettes and ensure the content is consistent.
> * remove "since version" because is not that useful for vignettes
> * re-order/group the list of ML algorithms so there exists a logical ordering
> * check for warning or error in output message
> * anything else that seems out of place



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18849) Vignettes final checks for Spark 2.1

2016-12-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18849:


Assignee: Felix Cheung  (was: Apache Spark)

> Vignettes final checks for Spark 2.1
> 
>
> Key: SPARK-18849
> URL: https://issues.apache.org/jira/browse/SPARK-18849
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Xiangrui Meng
>Assignee: Felix Cheung
>
> Make a final pass over the vignettes and ensure the content is consistent.
> * remove "since version" because is not that useful for vignettes
> * re-order/group the list of ML algorithms so there exists a logical ordering
> * check for warning or error in output message
> * anything else that seems out of place



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18849) Vignettes final checks for Spark 2.1

2016-12-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749906#comment-15749906
 ] 

Apache Spark commented on SPARK-18849:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/16286

> Vignettes final checks for Spark 2.1
> 
>
> Key: SPARK-18849
> URL: https://issues.apache.org/jira/browse/SPARK-18849
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Xiangrui Meng
>Assignee: Felix Cheung
>
> Make a final pass over the vignettes and ensure the content is consistent.
> * remove "since version" because is not that useful for vignettes
> * re-order/group the list of ML algorithms so there exists a logical ordering
> * check for warning or error in output message
> * anything else that seems out of place



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18854) getNodeNumbered and generateTreeString are not consistent

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-18854:
---

Assignee: Reynold Xin

> getNodeNumbered and generateTreeString are not consistent
> -
>
> Key: SPARK-18854
> URL: https://issues.apache.org/jira/browse/SPARK-18854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.3, 2.1.0
>
>
> This is a bug introduced by subquery handling. generateTreeString numbers 
> trees including innerChildren (used to print subqueries), but getNodeNumbered 
> ignores that. As a result, getNodeNumbered is not always correct.
> Repro:
> {code}
> val df = sql("select * from range(10) where id not in " +
>   "(select id from range(2) union all select id from range(2))")
> println("---")
> println(df.queryExecution.analyzed.numberedTreeString)
> println("---")
> println("---")
> println(df.queryExecution.analyzed(3))
> println("---")
> {code}
> Output looks like
> {noformat}
> ---
> 00 Project [id#1L]
> 01 +- Filter NOT predicate-subquery#0 [(id#1L = id#2L)]
> 02:  +- Union
> 03: :- Project [id#2L]
> 04: :  +- Range (0, 2, step=1, splits=None)
> 05: +- Project [id#3L]
> 06:+- Range (0, 2, step=1, splits=None)
> 07+- Range (0, 10, step=1, splits=None)
> ---
> ---
> null
> ---
> {noformat}
> Note that 3 should be the Project node, but getNodeNumbered ignores 
> innerChild and as a result returns the wrong one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18854) getNodeNumbered and generateTreeString are not consistent

2016-12-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18854.
-
  Resolution: Fixed
   Fix Version/s: 2.1.0
  2.0.3
Target Version/s: 2.0.3, 2.1.0  (was: 2.0.3, 2.1.1, 2.2.0)

> getNodeNumbered and generateTreeString are not consistent
> -
>
> Key: SPARK-18854
> URL: https://issues.apache.org/jira/browse/SPARK-18854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
> Fix For: 2.0.3, 2.1.0
>
>
> This is a bug introduced by subquery handling. generateTreeString numbers 
> trees including innerChildren (used to print subqueries), but getNodeNumbered 
> ignores that. As a result, getNodeNumbered is not always correct.
> Repro:
> {code}
> val df = sql("select * from range(10) where id not in " +
>   "(select id from range(2) union all select id from range(2))")
> println("---")
> println(df.queryExecution.analyzed.numberedTreeString)
> println("---")
> println("---")
> println(df.queryExecution.analyzed(3))
> println("---")
> {code}
> Output looks like
> {noformat}
> ---
> 00 Project [id#1L]
> 01 +- Filter NOT predicate-subquery#0 [(id#1L = id#2L)]
> 02:  +- Union
> 03: :- Project [id#2L]
> 04: :  +- Range (0, 2, step=1, splits=None)
> 05: +- Project [id#3L]
> 06:+- Range (0, 2, step=1, splits=None)
> 07+- Range (0, 10, step=1, splits=None)
> ---
> ---
> null
> ---
> {noformat}
> Note that 3 should be the Project node, but getNodeNumbered ignores 
> innerChild and as a result returns the wrong one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18867) Throw cause if IsolatedClientLoad can't create client

2016-12-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18867:


Assignee: Apache Spark

> Throw cause if IsolatedClientLoad can't create client
> -
>
> Key: SPARK-18867
> URL: https://issues.apache.org/jira/browse/SPARK-18867
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
> Environment: RStudio 1.0.44 + SparkR (Spark 2.0.2)
>Reporter: Wei-Chiu Chuang
>Assignee: Apache Spark
>Priority: Minor
>
> If IsolatedClientLoader can't instantiate a class object, it throws 
> {{InvocationTargetException}}. But the caller doesn't need to know this 
> exception. Instead, it should throw the exception that causes the 
> {{InvocationTargetException}}, so that the caller may be able to handle it.
> This exception is reproducible if I run the following code snippet in two 
> RStudio consoles without cleaning sessions. (This is a RStudio issue after 
> all but in general it may be exhibited in other ways)
> {code}
> Sys.setenv(SPARK_HOME="/Users/weichiu/Downloads/spark-2.0.2-bin-hadoop2.7")
> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
> sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
> "2g"))
> df <- as.DataFrame(faithful)
> sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18867) Throw cause if IsolatedClientLoad can't create client

2016-12-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18867:


Assignee: (was: Apache Spark)

> Throw cause if IsolatedClientLoad can't create client
> -
>
> Key: SPARK-18867
> URL: https://issues.apache.org/jira/browse/SPARK-18867
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
> Environment: RStudio 1.0.44 + SparkR (Spark 2.0.2)
>Reporter: Wei-Chiu Chuang
>Priority: Minor
>
> If IsolatedClientLoader can't instantiate a class object, it throws 
> {{InvocationTargetException}}. But the caller doesn't need to know this 
> exception. Instead, it should throw the exception that causes the 
> {{InvocationTargetException}}, so that the caller may be able to handle it.
> This exception is reproducible if I run the following code snippet in two 
> RStudio consoles without cleaning sessions. (This is a RStudio issue after 
> all but in general it may be exhibited in other ways)
> {code}
> Sys.setenv(SPARK_HOME="/Users/weichiu/Downloads/spark-2.0.2-bin-hadoop2.7")
> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
> sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
> "2g"))
> df <- as.DataFrame(faithful)
> sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18867) Throw cause if IsolatedClientLoad can't create client

2016-12-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749893#comment-15749893
 ] 

Apache Spark commented on SPARK-18867:
--

User 'jojochuang' has created a pull request for this issue:
https://github.com/apache/spark/pull/16285

> Throw cause if IsolatedClientLoad can't create client
> -
>
> Key: SPARK-18867
> URL: https://issues.apache.org/jira/browse/SPARK-18867
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
> Environment: RStudio 1.0.44 + SparkR (Spark 2.0.2)
>Reporter: Wei-Chiu Chuang
>Priority: Minor
>
> If IsolatedClientLoader can't instantiate a class object, it throws 
> {{InvocationTargetException}}. But the caller doesn't need to know this 
> exception. Instead, it should throw the exception that causes the 
> {{InvocationTargetException}}, so that the caller may be able to handle it.
> This exception is reproducible if I run the following code snippet in two 
> RStudio consoles without cleaning sessions. (This is a RStudio issue after 
> all but in general it may be exhibited in other ways)
> {code}
> Sys.setenv(SPARK_HOME="/Users/weichiu/Downloads/spark-2.0.2-bin-hadoop2.7")
> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
> sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
> "2g"))
> df <- as.DataFrame(faithful)
> sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 241 matches

Mail list logo