[jira] [Resolved] (SPARK-12491) UDAF result differs in SQL if alias is used

2015-12-28 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-12491.
--
   Resolution: Duplicate
Fix Version/s: 1.6.0
   1.5.3

> UDAF result differs in SQL if alias is used
> ---
>
> Key: SPARK-12491
> URL: https://issues.apache.org/jira/browse/SPARK-12491
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Tristan
> Fix For: 1.5.3, 1.6.0
>
> Attachments: UDAF_GM.zip
>
>
> Using the GeometricMean UDAF example 
> (https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html),
>  I found the following discrepancy in results:
> {code}
> scala> sqlContext.sql("select group_id, gm(id) from simple group by 
> group_id").show()
> ++---+
> |group_id|_c1|
> ++---+
> |   0|0.0|
> |   1|0.0|
> |   2|0.0|
> ++---+
> scala> sqlContext.sql("select group_id, gm(id) as GeometricMean from simple 
> group by group_id").show()
> ++-+
> |group_id|GeometricMean|
> ++-+
> |   0|8.981385496571725|
> |   1|7.301716979342118|
> |   2|7.706253151292568|
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12491) UDAF result differs in SQL if alias is used

2015-12-28 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073146#comment-15073146
 ] 

Herman van Hovell commented on SPARK-12491:
---

I just tried the latest 1.5 branch on a spark cluster, and I can confirm the 
the problem has been fixed on my side.

> UDAF result differs in SQL if alias is used
> ---
>
> Key: SPARK-12491
> URL: https://issues.apache.org/jira/browse/SPARK-12491
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Tristan
> Attachments: UDAF_GM.zip
>
>
> Using the GeometricMean UDAF example 
> (https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html),
>  I found the following discrepancy in results:
> {code}
> scala> sqlContext.sql("select group_id, gm(id) from simple group by 
> group_id").show()
> ++---+
> |group_id|_c1|
> ++---+
> |   0|0.0|
> |   1|0.0|
> |   2|0.0|
> ++---+
> scala> sqlContext.sql("select group_id, gm(id) as GeometricMean from simple 
> group by group_id").show()
> ++-+
> |group_id|GeometricMean|
> ++-+
> |   0|8.981385496571725|
> |   1|7.301716979342118|
> |   2|7.706253151292568|
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12531) Add median and mode to Summary statistics

2015-12-28 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-12531.
-
Resolution: Duplicate

> Add median and mode to Summary statistics
> -
>
> Key: SPARK-12531
> URL: https://issues.apache.org/jira/browse/SPARK-12531
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Gaurav Kumar
>Priority: Minor
>
> Summary statistics should also include calculating median and mode in 
> addition to mean, variance and others.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12531) Add median and mode to Summary statistics

2015-12-28 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073159#comment-15073159
 ] 

Joseph K. Bradley commented on SPARK-12531:
---

[~gauravkumar37] The current plan for these stats is to provide them under 
DataFrames since they are useful beyond ML: [SPARK-6761] should handle 
(approximate) median.  I don't think there's a JIRA for mode yet, but it'd be 
related at least to the count-min-sketch JIRA linked from the parent of 
[SPARK-6761].  I'd recommend voting/commenting on those JIRAs for these stats.  
I'll close this JIRA since this will be done outside of MLlib.

> Add median and mode to Summary statistics
> -
>
> Key: SPARK-12531
> URL: https://issues.apache.org/jira/browse/SPARK-12531
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Gaurav Kumar
>Priority: Minor
>
> Summary statistics should also include calculating median and mode in 
> addition to mean, variance and others.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs

2015-12-28 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073187#comment-15073187
 ] 

Joseph K. Bradley commented on SPARK-12488:
---

I don't see what could be causing this yet, but I'd really like it fixed if we 
can reproduce it.  I'll watch the JIRA.

> LDA describeTopics() Generates Invalid Term IDs
> ---
>
> Key: SPARK-12488
> URL: https://issues.apache.org/jira/browse/SPARK-12488
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Ilya Ganelin
>
> When running the LDA model, and using the describeTopics function, invalid 
> values appear in the termID list that is returned:
> The below example generates 10 topics on a data set with a vocabulary of 685.
> {code}
> // Set LDA parameters
> val numTopics = 10
> val lda = new LDA().setK(numTopics).setMaxIterations(10)
> val ldaModel = lda.run(docTermVector)
> val distModel = 
> ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted.reverse
> res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 
> 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 
> 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 
> 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 
> 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 
> 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 
> 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 
> 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 
> 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 
> 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 
> 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53...
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted
> res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, 
> -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, 
> -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, 
> -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, 
> -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, 
> -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, 
> -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, 
> -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, 
> -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
> 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 
> 38, 39, 40, 41, 42, 43, 44, 45, 4...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5742) Network over RDMA

2015-12-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-5742.
--
Resolution: Later

This is something excellent to explore. However, I'm going to close the ticket 
as "later" to cut down the number of open tickets. We can always reopen it when 
we get closer to having a concrete plan.



> Network over RDMA
> -
>
> Key: SPARK-5742
> URL: https://issues.apache.org/jira/browse/SPARK-5742
> Project: Spark
>  Issue Type: Task
>  Components: Shuffle, Spark Core
>Reporter: Dina Leventol
>
> Expand network layer to work over RDMA as well as TCP
> This will improve the data flow - shuffle and cached blocks, adding rdma 
> client and server additionally to the nio and netty impl
> RDMA implementation is done over JXIO library  https://github.com/accelio/JXIO



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12522) Add the missing the document string for the SQL configuration

2015-12-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12522.
-
   Resolution: Fixed
 Assignee: Xiao Li
Fix Version/s: 2.0.0

> Add the missing the document string for the SQL configuration
> -
>
> Key: SPARK-12522
> URL: https://issues.apache.org/jira/browse/SPARK-12522
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
> Fix For: 2.0.0
>
>
> We can see the missing message indicators "TODO" when issuing the command 
> "SET -V". Need to add the missing the document for the configuration like, 
> spark.sql.columnNameOfCorruptRecord
> spark.sql.hive.verifyPartitionPath
> spark.sql.sources.parallelPartitionDiscovery.threshold
> spark.sql.hive.convertMetastoreParquet.mergeSchema
> spark.sql.hive.convertCTAS
> spark.sql.hive.thriftServer.async



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12239) ​SparkR - Not distributing SparkR module in YARN

2015-12-28 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073341#comment-15073341
 ] 

Sun Rui commented on SPARK-12239:
-

Instead of case-by-case fix, I would prefer the latter approach. [~shivaram], 
any comment?

> ​SparkR - Not distributing SparkR module in YARN
> 
>
> Key: SPARK-12239
> URL: https://issues.apache.org/jira/browse/SPARK-12239
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, YARN
>Affects Versions: 1.5.2, 1.5.3
>Reporter: Sebastian YEPES FERNANDEZ
>Priority: Critical
>
> Hello,
> I am trying to use the SparkR in a YARN environment and I have encountered 
> the following problem:
> Every thing work correctly when using bin/sparkR, but if I try running the 
> same jobs using sparkR directly through R it does not work.
> I have managed to track down what is causing the problem, when sparkR is 
> launched through R the "SparkR" module is not distributed to the worker nodes.
> I have tried working around this issue using the setting 
> "spark.yarn.dist.archives", but it does not work as it deploys the 
> file/extracted folder with the extension ".zip" and workers are actually 
> looking for a folder with the name "sparkr"
> Is there currently any way to make this work?
> {code}
> # spark-defaults.conf
> spark.yarn.dist.archives /opt/apps/spark/R/lib/sparkr.zip
> # R
> library(SparkR, lib.loc="/opt/apps/spark/R/lib/")
> sc <- sparkR.init(appName="SparkR", master="yarn-client", 
> sparkEnvir=list(spark.executor.instances="1"))
> sqlContext <- sparkRSQL.init(sc)
> df <- createDataFrame(sqlContext, faithful)
> head(df)
> 15/12/09 09:04:24 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
> fr-s-cour-wrk3.alidaho.com): java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
> {code}
> Container stderr:
> {code}
> 15/12/09 09:04:14 INFO storage.MemoryStore: Block broadcast_1 stored as 
> values in memory (estimated size 8.7 KB, free 530.0 MB)
> 15/12/09 09:04:14 INFO r.BufferedStreamThread: Fatal error: cannot open file 
> '/hadoop/hdfs/disk02/hadoop/yarn/local/usercache/spark/appcache/application_1445706872927_1168/container_e44_1445706872927_1168_01_02/sparkr/SparkR/worker/daemon.R':
>  No such file or directory
> 15/12/09 09:04:24 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.net.SocketTimeoutException: Accept timed out
>   at java.net.PlainSocketImpl.socketAccept(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
>   at java.net.ServerSocket.implAccept(ServerSocket.java:545)
>   at java.net.ServerSocket.accept(ServerSocket.java:513)
>   at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:426)
> {code}
> Worker Node that runned the Container:
> {code}
> # ls -la 
> /hadoop/hdfs/disk02/hadoop/yarn/local/usercache/spark/appcache/application_1445706872927_1168/container_e44_1445706872927_1168_01_02
> total 71M
> drwx--x--- 3 yarn hadoop 4.0K Dec  9 09:04 .
> drwx--x--- 7 yarn hadoop 4.0K Dec  9 09:04 ..
> -rw-r--r-- 1 yarn hadoop  110 Dec  9 09:03 container_tokens
> -rw-r--r-- 1 yarn hadoop   12 Dec  9 09:03 .container_tokens.crc
> -rwx-- 1 yarn hadoop  736 Dec  9 09:03 
> default_container_executor_session.sh
> -rw-r--r-- 1 yarn hadoop   16 Dec  9 09:03 
> .default_container_executor_session.sh.crc
> -rwx-- 1 yarn hadoop  790 Dec  9 09:03 default_container_executor.sh
> -rw-r--r-- 1 yarn hadoop   16 Dec  9 09:03 .default_container_executor.sh.crc
> -rwxr-xr-x 1 yarn hadoop  61K Dec  9 09:04 hadoop-lzo-0.6.0.2.3.2.0-2950.jar
> -rwxr-xr-x 1 yarn hadoop 317K Dec  9 09:04 kafka-clients-0.8.2.2.jar
> -rwx-- 1 yarn hadoop 6.0K Dec  9 09:03 launch_container.sh
> -rw-r--r-- 1 yarn hadoop   56 Dec  9 09:03 .launch_container.sh.crc
> -rwxr-xr-x 1 yarn hadoop 2.2M Dec  9 09:04 
> spark-cassandra-connector_2.10-1.5.0-M3.jar
> -rwxr-xr-x 1 yarn hadoop 7.1M Dec  9 09:04 spark-csv-assembly-1.3.0.jar
> lrwxrwxrwx 1 yarn hadoop  119 Dec  9 09:03 __spark__.jar -> 
> /hadoop/hdfs/disk03/hadoop/yarn/local/usercache/spark/filecache/361/spark-assembly-1.5.3-SNAPSHOT-hadoop2.7.1.jar
> lrwxrwxrwx 1 yarn hadoop   84 Dec  9 09:03 sparkr.zip -> 
> /hadoop/hdfs/disk01/hadoop/yarn/local/usercache/spark/filecache/359/sparkr.zip
> -rwxr-xr-x 1 yarn hadoop 1.8M Dec  9 09:04 
> spark-streaming_2.10-1.5.3-SNAPSHOT.jar
> -rwxr-xr-x 1 yarn hadoop  11M Dec  9 09:04 
> spark-streaming-kafka-assembly_2.10-1.5.3-SNAPSHOT.jar
> -rwxr-xr-x 1 yarn hadoop  48M Dec  9 09:04 
> sparkts-0.1.0-SNAPSHOT-jar-with-dependencies.jar
> drwx--x--- 2 yarn hadoop   46 Dec  9 09:04 tmp
> {code}
> *Working 

[jira] [Reopened] (SPARK-12493) Can't open "details" span of ExecutionsPage in IE11

2015-12-28 Thread meiyoula (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula reopened SPARK-12493:
--

> Can't open "details" span of ExecutionsPage in IE11
> ---
>
> Key: SPARK-12493
> URL: https://issues.apache.org/jira/browse/SPARK-12493
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: meiyoula
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12493) Can't open "details" span of ExecutionsPage in IE11

2015-12-28 Thread meiyoula (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073342#comment-15073342
 ] 

meiyoula commented on SPARK-12493:
--

[~sowen]I think it's simply to reproduce this problem. 
I don't know the root cause, but maybe it's the problem of JS compatibility.

> Can't open "details" span of ExecutionsPage in IE11
> ---
>
> Key: SPARK-12493
> URL: https://issues.apache.org/jira/browse/SPARK-12493
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: meiyoula
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12511) streaming driver with checkpointing unable to finalize leading to OOM

2015-12-28 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073344#comment-15073344
 ] 

Shixiong Zhu edited comment on SPARK-12511 at 12/29/15 1:33 AM:


Has not yet figured out the root cause. Here are my found right now: the 
"Finalizer" thread is blocked by py4j, so the finalizer queue keeps growing.

{code}
"Finalizer" #3 daemon prio=8 os_prio=31 tid=0x7feaa380e000 nid=0x3503 
runnable [0x000117ca4000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
- locked <0x0007813be228> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
- locked <0x0007813be228> (a java.io.InputStreamReader)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at py4j.CallbackConnection.sendCommand(CallbackConnection.java:82)
at py4j.CallbackClient.sendCommand(CallbackClient.java:236)
at 
py4j.reflection.PythonProxyHandler.finalize(PythonProxyHandler.java:81)
at java.lang.System$2.invokeFinalize(System.java:1270)
at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:98)
at java.lang.ref.Finalizer.access$100(Finalizer.java:34)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:210)
{code}


was (Author: zsxwing):
Has not yet figured out the root cause. Here are my found right now: the 
"Finalizer" thread is blocked by py4j, so the finalizer keeps growing.

{code}
"Finalizer" #3 daemon prio=8 os_prio=31 tid=0x7feaa380e000 nid=0x3503 
runnable [0x000117ca4000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
- locked <0x0007813be228> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
- locked <0x0007813be228> (a java.io.InputStreamReader)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at py4j.CallbackConnection.sendCommand(CallbackConnection.java:82)
at py4j.CallbackClient.sendCommand(CallbackClient.java:236)
at 
py4j.reflection.PythonProxyHandler.finalize(PythonProxyHandler.java:81)
at java.lang.System$2.invokeFinalize(System.java:1270)
at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:98)
at java.lang.ref.Finalizer.access$100(Finalizer.java:34)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:210)
{code}

> streaming driver with checkpointing unable to finalize leading to OOM
> -
>
> Key: SPARK-12511
> URL: https://issues.apache.org/jira/browse/SPARK-12511
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Streaming
>Affects Versions: 1.5.2
> Environment: pyspark 1.5.2
> yarn 2.6.0
> python 2.6
> centos 6.5
> openjdk 1.8.0
>Reporter: Antony Mayi
>Assignee: Shixiong Zhu
>Priority: Critical
> Attachments: bug.py, finalizer-classes.png, finalizer-pending.png, 
> finalizer-spark_assembly.png
>
>
> Spark streaming application when configured with checkpointing is filling 
> driver's heap with multiple ZipFileInputStream instances as results of 
> spark-assembly.jar (potentially some others like for example snappy-java.jar) 
> getting repetitively referenced (loaded?). Java Finalizer can't finalize 
> these ZipFileInputStream instances and it eventually takes all heap leading 
> the driver to OOM crash.
> h2. Steps to reproduce:
> * Submit attached [^bug.py] to spark
> * Leave it running and monitor the driver java process heap
> ** with heap dump you will primarily see growing instances of 

[jira] [Commented] (SPARK-12511) streaming driver with checkpointing unable to finalize leading to OOM

2015-12-28 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073344#comment-15073344
 ] 

Shixiong Zhu commented on SPARK-12511:
--

Has not yet figured out the root cause. Here are my found right now: the 
"Finalizer" thread is blocked by py4j, so the finalizer keeps growing.

{code}
"Finalizer" #3 daemon prio=8 os_prio=31 tid=0x7feaa380e000 nid=0x3503 
runnable [0x000117ca4000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
- locked <0x0007813be228> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
- locked <0x0007813be228> (a java.io.InputStreamReader)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at py4j.CallbackConnection.sendCommand(CallbackConnection.java:82)
at py4j.CallbackClient.sendCommand(CallbackClient.java:236)
at 
py4j.reflection.PythonProxyHandler.finalize(PythonProxyHandler.java:81)
at java.lang.System$2.invokeFinalize(System.java:1270)
at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:98)
at java.lang.ref.Finalizer.access$100(Finalizer.java:34)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:210)
{code}

> streaming driver with checkpointing unable to finalize leading to OOM
> -
>
> Key: SPARK-12511
> URL: https://issues.apache.org/jira/browse/SPARK-12511
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Streaming
>Affects Versions: 1.5.2
> Environment: pyspark 1.5.2
> yarn 2.6.0
> python 2.6
> centos 6.5
> openjdk 1.8.0
>Reporter: Antony Mayi
>Assignee: Shixiong Zhu
>Priority: Critical
> Attachments: bug.py, finalizer-classes.png, finalizer-pending.png, 
> finalizer-spark_assembly.png
>
>
> Spark streaming application when configured with checkpointing is filling 
> driver's heap with multiple ZipFileInputStream instances as results of 
> spark-assembly.jar (potentially some others like for example snappy-java.jar) 
> getting repetitively referenced (loaded?). Java Finalizer can't finalize 
> these ZipFileInputStream instances and it eventually takes all heap leading 
> the driver to OOM crash.
> h2. Steps to reproduce:
> * Submit attached [^bug.py] to spark
> * Leave it running and monitor the driver java process heap
> ** with heap dump you will primarily see growing instances of byte array data 
> (here cumulated zip payload of the jar refs):
> {noformat}
>  num #instances #bytes  class name
> --
>1: 32653   32735296  [B
>2: 480005135816  [C
>3:411344144  [Lscala.concurrent.forkjoin.ForkJoinTask;
>4: 113621261816  java.lang.Class
>5: 470541129296  java.lang.String
>6: 254601018400  java.lang.ref.Finalizer
>7:  9802 789400  [Ljava.lang.Object;
> {noformat}
> ** with visualvm you can see:
> *** increasing number of objects pending for finalization
> !finalizer-pending.png!
> *** increasing number of ZipFileInputStreams instances related to the 
> spark-assembly.jar referenced by Finalizer
> !finalizer-spark_assembly.png!
> * Depending on the heap size and running time this will lead to driver OOM 
> crash
> h2. Comments
> * The [^bug.py] is lightweight proof of the problem. In production I am 
> experiencing this as quite rapid effect - in few hours it eats gigs of heap 
> and kills the app.
> * If the same [^bug.py] is run without checkpointing there is no issue 
> whatsoever.
> * Not sure if it is just pyspark related.
> * In [^bug.py] I am using the socketTextStream input but seems to be 
> independent of the input type (in production having same problem with Kafka 
> direct stream, have seen it even with textFileStream).
> * It is happening even if the input stream doesn't produce any data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Comment Edited] (SPARK-12493) Can't open "details" span of ExecutionsPage in IE11

2015-12-28 Thread meiyoula (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073342#comment-15073342
 ] 

meiyoula edited comment on SPARK-12493 at 12/29/15 1:31 AM:


[~sowen], I think it's simply to reproduce this problem. 
I don't know the root cause, but maybe it's the problem of JS compatibility.


was (Author: meiyoula):
[~sowen]I think it's simply to reproduce this problem. 
I don't know the root cause, but maybe it's the problem of JS compatibility.

> Can't open "details" span of ExecutionsPage in IE11
> ---
>
> Key: SPARK-12493
> URL: https://issues.apache.org/jira/browse/SPARK-12493
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: meiyoula
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12511) streaming driver with checkpointing unable to finalize leading to OOM

2015-12-28 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073349#comment-15073349
 ] 

Shixiong Zhu commented on SPARK-12511:
--

Is it possible that your sending loop prevents the Py4j's Python codes from 
running?

> streaming driver with checkpointing unable to finalize leading to OOM
> -
>
> Key: SPARK-12511
> URL: https://issues.apache.org/jira/browse/SPARK-12511
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Streaming
>Affects Versions: 1.5.2
> Environment: pyspark 1.5.2
> yarn 2.6.0
> python 2.6
> centos 6.5
> openjdk 1.8.0
>Reporter: Antony Mayi
>Assignee: Shixiong Zhu
>Priority: Critical
> Attachments: bug.py, finalizer-classes.png, finalizer-pending.png, 
> finalizer-spark_assembly.png
>
>
> Spark streaming application when configured with checkpointing is filling 
> driver's heap with multiple ZipFileInputStream instances as results of 
> spark-assembly.jar (potentially some others like for example snappy-java.jar) 
> getting repetitively referenced (loaded?). Java Finalizer can't finalize 
> these ZipFileInputStream instances and it eventually takes all heap leading 
> the driver to OOM crash.
> h2. Steps to reproduce:
> * Submit attached [^bug.py] to spark
> * Leave it running and monitor the driver java process heap
> ** with heap dump you will primarily see growing instances of byte array data 
> (here cumulated zip payload of the jar refs):
> {noformat}
>  num #instances #bytes  class name
> --
>1: 32653   32735296  [B
>2: 480005135816  [C
>3:411344144  [Lscala.concurrent.forkjoin.ForkJoinTask;
>4: 113621261816  java.lang.Class
>5: 470541129296  java.lang.String
>6: 254601018400  java.lang.ref.Finalizer
>7:  9802 789400  [Ljava.lang.Object;
> {noformat}
> ** with visualvm you can see:
> *** increasing number of objects pending for finalization
> !finalizer-pending.png!
> *** increasing number of ZipFileInputStreams instances related to the 
> spark-assembly.jar referenced by Finalizer
> !finalizer-spark_assembly.png!
> * Depending on the heap size and running time this will lead to driver OOM 
> crash
> h2. Comments
> * The [^bug.py] is lightweight proof of the problem. In production I am 
> experiencing this as quite rapid effect - in few hours it eats gigs of heap 
> and kills the app.
> * If the same [^bug.py] is run without checkpointing there is no issue 
> whatsoever.
> * Not sure if it is just pyspark related.
> * In [^bug.py] I am using the socketTextStream input but seems to be 
> independent of the input type (in production having same problem with Kafka 
> direct stream, have seen it even with textFileStream).
> * It is happening even if the input stream doesn't produce any data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12493) Can't open "details" span of ExecutionsPage in IE11

2015-12-28 Thread meiyoula (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula updated SPARK-12493:
-
Description: 
Reproduce steps:
1. sbin/start-thriftserver.sh
2. Beeline connect to thriftserver, and run "show tables;"
3. open the sparkUI in IE11
4. go to SQLTab to see if the "details" span can open

> Can't open "details" span of ExecutionsPage in IE11
> ---
>
> Key: SPARK-12493
> URL: https://issues.apache.org/jira/browse/SPARK-12493
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: meiyoula
> Attachments: screenshot-1.png
>
>
> Reproduce steps:
> 1. sbin/start-thriftserver.sh
> 2. Beeline connect to thriftserver, and run "show tables;"
> 3. open the sparkUI in IE11
> 4. go to SQLTab to see if the "details" span can open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12493) Can't open "details" span of ExecutionsPage in IE11

2015-12-28 Thread meiyoula (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula updated SPARK-12493:
-
Attachment: IE version.png

> Can't open "details" span of ExecutionsPage in IE11
> ---
>
> Key: SPARK-12493
> URL: https://issues.apache.org/jira/browse/SPARK-12493
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: meiyoula
> Attachments: IE version.png, screenshot-1.png
>
>
> Reproduce steps:
> 1. sbin/start-thriftserver.sh
> 2. Beeline connect to thriftserver, and run "show tables;"
> 3. open the sparkUI in IE11
> 4. go to SQLTab to see if the "details" span can open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-12492) SQL page of Spark-sql is always blank

2015-12-28 Thread meiyoula (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula reopened SPARK-12492:
--

Still has this problem

> SQL page of Spark-sql is always blank 
> --
>
> Key: SPARK-12492
> URL: https://issues.apache.org/jira/browse/SPARK-12492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Reporter: meiyoula
> Attachments: screenshot-1.png
>
>
> When I run a sql query in spark-sql, the Execution page of SQL tab is always 
> blank. But the JDBCServer is not blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12512) WithColumn does not work on multiple column with special character

2015-12-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12512:


Assignee: Apache Spark

> WithColumn does not work on multiple column with special character
> --
>
> Key: SPARK-12512
> URL: https://issues.apache.org/jira/browse/SPARK-12512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: JO EE
>Assignee: Apache Spark
>  Labels: spark, sql
>
> Just for simplicity I am using Scalaide scala-worksheet to show the problem
> the withColumn could not work from .withColumnRenamed("bField","k.b:Field")
> {code:title=Bar.scala|borderStyle=solid}
> object bug {
>   println("Welcome to the Scala worksheet")   //> Welcome to the Scala 
> worksheet
>   
>   import org.apache.spark.SparkContext
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SQLContext
>   import org.apache.spark.sql.Row
>   import org.apache.spark.sql.types.DateType
>   import org.apache.spark.sql.functions._
>   import org.apache.spark.storage.StorageLevel._
>   import org.apache.spark.sql.types.{StructType,StructField,StringType}
>   
>   val conf = new SparkConf()
>  .setMaster("local[4]")
>  .setAppName("Testbug")   //> conf  : 
> org.apache.spark.SparkConf = org.apache.spark.SparkConf@3b94d659
>   
>   val sc = new SparkContext(conf) //> sc  : 
> org.apache.spark.SparkContext = org.apache.spark.SparkContext@1dcca8d3
>   //| 
>
>   val sqlContext = new SQLContext(sc) //> sqlContext  : 
> org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLCont
>   //| ext@2d23faef
>   
>   val schemaString = "aField,bField,cField"   //> schemaString  : String 
> = aField,bField,cField
>   
>   val schema = StructType(schemaString.split(",")
>   .map(fieldName => StructField(fieldName, StringType, true)))
>   //> schema  : 
> org.apache.spark.sql.types.StructType = StructType(StructField(aFi
>   //| eld,StringType,true), 
> StructField(bField,StringType,true), StructField(cFiel
>   //| d,StringType,true))
>   //import sqlContext.implicits._
>
>   val newRDD = sc.parallelize(List(("a","b","c")))
>   .map(x=>Row(x._1,x._2,x._3))  //> newRDD  : 
> org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitions
>   //| RDD[1] at map at 
> com.joee.worksheet.bug.scala:30
>   
>   val newDF = sqlContext.createDataFrame(newRDD, schema)
>   //> newDF  : 
> org.apache.spark.sql.DataFrame = [aField: string, bField: string, c
>   //| Field: string]
>   
>   val changeDF = newDF.withColumnRenamed("aField","anodotField")
>   .withColumnRenamed("bField","bnodotField")
>   .show() //> 
> +---+---+--+
>   //| 
> |anodotField|bnodotField|cField|
>   //| 
> +---+---+--+
>   //| |  a|  
> b| c|
>   //| 
> +---+---+--+
>   //| 
>   //| changeDF  : Unit = ()
>   val changeDFwithdotfield1 = newDF.withColumnRenamed("aField","k.a:Field")
>   //> changeDFwithdotfield1  
> : org.apache.spark.sql.DataFrame = [k.a:Field: strin
>   //| g, bField: string, 
> cField: string]
>   
>   val changeDFwithdotfield = changeDFwithdotfield1 
> .withColumnRenamed("bField","k.b:Field")
>   //> 
> org.apache.spark.sql.AnalysisException: cannot resolve 'k.a:Field' given in
>   //| put columns k.a:Field, 
> bField, cField;
>   //| at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAn
>   //| alysis(package.scala:42)
>   //| at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAn
>   //| 
> 

[jira] [Commented] (SPARK-12492) SQL page of Spark-sql is always blank

2015-12-28 Thread meiyoula (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073378#comment-15073378
 ] 

meiyoula commented on SPARK-12492:
--

[~srowen], I still met this problem using the latest master branch.
After reading the code, I find the object in "SQLListener" just be read but not 
be written.

> SQL page of Spark-sql is always blank 
> --
>
> Key: SPARK-12492
> URL: https://issues.apache.org/jira/browse/SPARK-12492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Reporter: meiyoula
> Attachments: screenshot-1.png
>
>
> When I run a sql query in spark-sql, the Execution page of SQL tab is always 
> blank. But the JDBCServer is not blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12196) Store blocks in different speed storage devices by hierarchy way

2015-12-28 Thread wei wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073386#comment-15073386
 ] 

wei wu edited comment on SPARK-12196 at 12/29/15 2:31 AM:
--

Yes, Hao. The local dir path format "[SSD]file:///") may be not identified by 
the Yarn local dir setting.
Another question is that: if the user have mount the device (in production 
environment cluster ) as the follows :
/mnt/c, /mnt/d, /mnt/e/, ,  /mnt/i
If the user want to use the new feature in spark new version, the user should 
re-mount the disk device.
we think the following configuration may be better:
spark.local.dir = /mnt/c, /mnt/d, /mnt/e/, ,  /mnt/i
spark.storage.hierarchyStore.reserved.quota = SSD 50GB, DISK, SSD 80GB,  ,  
DISK

And we suggest the  following configuration idea: 
I think we should set a  space reverser thread in block manager  to check if 
enough space is reserved for each SSD storage. The reserver is used to solve 
the no free SSD space problem when concurrently write blocks.  Just like: 
spark.ssd. reserver.interval.ms = 1000

If  the SSD capacity is small, the SSD may be cache the RDD or save the shuffle 
data.  Different job may compete the SSD resource (may be cache RDD or shuffle 
data). But the user want to give priority in use of the SSD to cache the RDD.  
I think we should add the similar configuration to Flag for enabling the SSD 
storage to shuffle data.
spark.ssd.shuffle.enabled = false





was (Author: allan wu):
Yes, Hao. The local dir path format "[SSD]file:///") may be not identified by 
the Yarn local dir setting.
Another question is that: if the use have mount the device (in production 
environment cluster ) as the follows :
/mnt/c, /mnt/d, /mnt/e/, ,  /mnt/i
If the user want to use the new feature in spark new version, the user should 
re-mount the disk device.
we think the following configuration may be better:
spark.local.dir = /mnt/c, /mnt/d, /mnt/e/, ,  /mnt/i
spark.storage.hierarchyStore.reserved.quota = SSD 50GB, DISK, SSD 80GB,  ,  
DISK

And we suggest the  following configuration idea: 
I think we should set a  space reverser thread in block manager  to check if 
enough space is reserved for each SSD storage. The reserver is used to solve 
the no free SSD space problem when concurrently write blocks.  Just like: 
spark.ssd. reserver.interval.ms = 1000

If  the SSD capacity is small, the SSD may be cache the RDD or save the shuffle 
data.  Different job may compete the SSD resource (may be cache RDD or shuffle 
data). But the user want to give priority in use of the SSD to cache the RDD.  
I think we should add the similar configuration to Flag for enabling the SSD 
storage to shuffle data.
spark.ssd.shuffle.enabled = false




> Store blocks in different speed storage devices by hierarchy way
> 
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. Set the priority and threshold for each layer in 
> spark.storage.hierarchyStore.
> {code}
> spark.storage.hierarchyStore='nvm 50GB,ssd 80GB'
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configure each layer's location, user just needs put the keyword like 
> "nvm", "ssd", which are specified in step 1, into local dirs, like 
> spark.local.dir or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will 

[jira] [Commented] (SPARK-12196) Store blocks in different speed storage devices by hierarchy way

2015-12-28 Thread wei wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073386#comment-15073386
 ] 

wei wu commented on SPARK-12196:


Yes, Hao. The local dir path format "[SSD]file:///") may be not identified by 
the Yarn local dir setting.
Another question is that: if the use have mount the device (in production 
environment cluster ) as the follows :
/mnt/c, /mnt/d, /mnt/e/, ,  /mnt/i
If the user want to use the new feature in spark new version, the user should 
re-mount the disk device.
we think the following configuration may be better:
spark.local.dir = /mnt/c, /mnt/d, /mnt/e/, ,  /mnt/i
spark.storage.hierarchyStore.reserved.quota = SSD 50GB, DISK, SSD 80GB,  ,  
DISK

And we suggest the  following configuration idea: 
I think we should set a  space reverser thread in block manager  to check if 
enough space is reserved for each SSD storage. The reserver is used to solve 
the no free SSD space problem when concurrently write blocks.  Just like: 
spark.ssd. reserver.interval.ms = 1000

If  the SSD capacity is small, the SSD may be cache the RDD or save the shuffle 
data.  Different job may compete the SSD resource (may be cache RDD or shuffle 
data). But the user want to give priority in use of the SSD to cache the RDD.  
I think we should add the similar configuration to Flag for enabling the SSD 
storage to shuffle data.
spark.ssd.shuffle.enabled = false




> Store blocks in different speed storage devices by hierarchy way
> 
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. Set the priority and threshold for each layer in 
> spark.storage.hierarchyStore.
> {code}
> spark.storage.hierarchyStore='nvm 50GB,ssd 80GB'
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configure each layer's location, user just needs put the keyword like 
> "nvm", "ssd", which are specified in step 1, into local dirs, like 
> spark.local.dir or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12547) Tighten scala style checker enforcement for UDF registration

2015-12-28 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-12547:
---

 Summary: Tighten scala style checker enforcement for UDF 
registration
 Key: SPARK-12547
 URL: https://issues.apache.org/jira/browse/SPARK-12547
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


We use scalastyle:off to turn off style checks in certain places where it is 
not possible to follow the style guide. This is usually ok. However, in udf 
registration, we disable the checker for a large amount of code simply because 
some of them exceed 100 char line limit. It is better to just disable the line 
limit check rather than everything.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-12-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11193:
--
Assignee: Jean-Baptiste Onofré

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
>Assignee: Jean-Baptiste Onofré
> Fix For: 1.6.0, 2.0.0
>
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12536) Fix the Explain Outputs of Empty LocalRelation and LocalTableScan

2015-12-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12536:


Assignee: Apache Spark

> Fix the Explain Outputs of Empty LocalRelation and LocalTableScan
> -
>
> Key: SPARK-12536
> URL: https://issues.apache.org/jira/browse/SPARK-12536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> The filter Filter (False) generates an empty LocalRelation in 
> SimplifyFilters. In the current Explain, the optimized and physical plans 
> look confusing.
> For example, 
> {code}
> val df = Seq(1 -> "a").toDF("a", "b")
> df.where("1 = 0").explain(true)
> {code}
> {code}
> == Parsed Logical Plan ==
> Filter (1 = 0)
> +- Project [_1#0 AS a#2,_2#1 AS b#3]
>+- LocalRelation [_1#0,_2#1], [[1,a]]
> == Analyzed Logical Plan ==
> a: int, b: string
> Filter (1 = 0)
> +- Project [_1#0 AS a#2,_2#1 AS b#3]
>+- LocalRelation [_1#0,_2#1], [[1,a]]
> == Optimized Logical Plan ==
> LocalRelation [a#2,b#3] 
> == Physical Plan ==
> LocalTableScan [a#2,b#3] 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12535) Generating scaladoc using sbt fails for network-common and catalyst modules

2015-12-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12535.
---
Resolution: Duplicate

> Generating scaladoc using sbt fails for network-common and catalyst modules
> ---
>
> Key: SPARK-12535
> URL: https://issues.apache.org/jira/browse/SPARK-12535
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Blocker
>
> Executing {{./build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 
> -Dscala-2.11 -Phive -Phive-thriftserver -DskipTests 
> network-common/compile:doc catalyst/compile:doc}} fail with scaladoc errors 
> (the command was narrowed to the modules that failed - I initially used 
> {{clean publishLocal}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12531) Add median and mode to Summary statistics

2015-12-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072582#comment-15072582
 ] 

Sean Owen commented on SPARK-12531:
---

That wasn't my point -- you can't compute them exactly and efficiently at the 
same time, since you either hold all data in memory or approximate. It's not 
that they're hard to implement in the naive way.

> Add median and mode to Summary statistics
> -
>
> Key: SPARK-12531
> URL: https://issues.apache.org/jira/browse/SPARK-12531
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Gaurav Kumar
>Priority: Minor
>
> Summary statistics should also include calculating median and mode in 
> addition to mean, variance and others.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12506) Push down WHERE clause arithmetic operator to JDBC layer

2015-12-28 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072575#comment-15072575
 ] 

Hyukjin Kwon edited comment on SPARK-12506 at 12/28/15 9:54 AM:


[~huaxing] Maybe we should do this one first 
https://issues.apache.org/jira/browse/SPARK-9182.

I guess you might be thinking of the change of some codes at 
{{DataSourceStrategy}} or making this with {{CatalystScan}}, right?

In case of {{CatalystScan}}, I opened the issue here already, 
https://issues.apache.org/jira/browse/SPARK-12126.

In case of {{DataSourceStrategy}}, We might need to deal with {{Cast}} first 
https://issues.apache.org/jira/browse/SPARK-9182.


was (Author: hyukjin.kwon):
[~huaxing] Maybe we should do this one first 
https://issues.apache.org/jira/browse/SPARK-9182.

I guess you might be thinking of the change of some codes at 
{{DataSourceStrategy}} or making this with {{CatalystScan}}, right?

In case of {{CatalystScan}}, I opened the issue here already, 
https://issues.apache.org/jira/browse/SPARK-12126.

In case of {{DataSourceStrategy}}, We might need to deal with {{Cast}} first.

> Push down WHERE clause arithmetic operator to JDBC layer
> 
>
> Key: SPARK-12506
> URL: https://issues.apache.org/jira/browse/SPARK-12506
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Huaxin Gao
>
> For arithmetic operator in WHERE clause such as
> select * from table where c1 + c2 > 10
> Currently where c1 + c2 >10 is done at spark layer. 
> Will push this to JDBC layer so it will be done in database. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12420) Have a built-in CSV data source implementation

2015-12-28 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072580#comment-15072580
 ] 

Hyukjin Kwon edited comment on SPARK-12420 at 12/28/15 10:05 AM:
-

+1, I was wondering why it has been staying as third party.


was (Author: hyukjin.kwon):
+1, I was wondering why it has been staying third party.

> Have a built-in CSV data source implementation
> --
>
> Key: SPARK-12420
> URL: https://issues.apache.org/jira/browse/SPARK-12420
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: Built-in CSV datasource in Spark.pdf
>
>
> CSV is the most common data format in the "small data" world. It is often the 
> first format people want to try when they see Spark on a single node. Having 
> to rely on a 3rd party component for this is a very bad user experience for 
> new users.
> We should consider inlining https://github.com/databricks/spark-csv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12492) SQL page of Spark-sql is always blank

2015-12-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12492.
---
Resolution: Cannot Reproduce

> SQL page of Spark-sql is always blank 
> --
>
> Key: SPARK-12492
> URL: https://issues.apache.org/jira/browse/SPARK-12492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Reporter: meiyoula
> Attachments: screenshot-1.png
>
>
> When I run a sql query in spark-sql, the Execution page of SQL tab is always 
> blank. But the JDBCServer is not blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12493) Can't open "details" span of ExecutionsPage in IE11

2015-12-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12493.
---
Resolution: Cannot Reproduce

> Can't open "details" span of ExecutionsPage in IE11
> ---
>
> Key: SPARK-12493
> URL: https://issues.apache.org/jira/browse/SPARK-12493
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: meiyoula
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12536) Fix the Explain Outputs of Empty LocalRelation and LocalTableScan

2015-12-28 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12536:

Description: 
The filter Filter (False) generates an empty LocalRelation in SimplifyFilters. 
In the current Explain, the optimized and physical plans look confusing.

For example, 
{code}
val df = Seq(1 -> "a").toDF("a", "b")
df.where("1 = 0").explain(true)
{code}
{code}
== Parsed Logical Plan ==
Filter (1 = 0)
+- Project [_1#0 AS a#2,_2#1 AS b#3]
   +- LocalRelation [_1#0,_2#1], [[1,a]]

== Analyzed Logical Plan ==
a: int, b: string
Filter (1 = 0)
+- Project [_1#0 AS a#2,_2#1 AS b#3]
   +- LocalRelation [_1#0,_2#1], [[1,a]]

== Optimized Logical Plan ==
LocalRelation [a#2,b#3] 

== Physical Plan ==
LocalTableScan [a#2,b#3] 
{code}


  was:
{code}
val df = Seq(1 -> "a").toDF("a", "b")
df.where("1 = 0").explain(true)
{code}

The filter Filter (1 = 0) generates an empty `LocalRelation`. In the current 
explain, the optimized and physical plans look wrong because Filter (1 = 0) is 
removed by the optimizer.

{code}
== Parsed Logical Plan ==
Filter (1 = 0)
+- Project [_1#0 AS a#2,_2#1 AS b#3]
   +- LocalRelation [_1#0,_2#1], [[1,a]]

== Analyzed Logical Plan ==
a: int, b: string
Filter (1 = 0)
+- Project [_1#0 AS a#2,_2#1 AS b#3]
   +- LocalRelation [_1#0,_2#1], [[1,a]]

== Optimized Logical Plan ==
LocalRelation [a#2,b#3] 

== Physical Plan ==
LocalTableScan [a#2,b#3] 
{code}



> Fix the Explain Outputs of Empty LocalRelation and LocalTableScan
> -
>
> Key: SPARK-12536
> URL: https://issues.apache.org/jira/browse/SPARK-12536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> The filter Filter (False) generates an empty LocalRelation in 
> SimplifyFilters. In the current Explain, the optimized and physical plans 
> look confusing.
> For example, 
> {code}
> val df = Seq(1 -> "a").toDF("a", "b")
> df.where("1 = 0").explain(true)
> {code}
> {code}
> == Parsed Logical Plan ==
> Filter (1 = 0)
> +- Project [_1#0 AS a#2,_2#1 AS b#3]
>+- LocalRelation [_1#0,_2#1], [[1,a]]
> == Analyzed Logical Plan ==
> a: int, b: string
> Filter (1 = 0)
> +- Project [_1#0 AS a#2,_2#1 AS b#3]
>+- LocalRelation [_1#0,_2#1], [[1,a]]
> == Optimized Logical Plan ==
> LocalRelation [a#2,b#3] 
> == Physical Plan ==
> LocalTableScan [a#2,b#3] 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12502) Script /dev/run-tests fails when IBM Java is used

2015-12-28 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072565#comment-15072565
 ] 

Kousuke Saruta commented on SPARK-12502:


Oh, [~srowen] has already modified the Fix Version/s from 1.6.0 to 1.6.1. 
Thanks.

> Script /dev/run-tests fails when IBM Java is used
> -
>
> Key: SPARK-12502
> URL: https://issues.apache.org/jira/browse/SPARK-12502
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.5.2
> Environment: IBM Java
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 1.6.1, 2.0.0
>
>
> When execute ./dev/run-tests with IBM Java, an exception occurs.
> This is due to difference of a "java version" format.
> $ JAVA_HOME=~/ibmjava8 ./dev/run-tests
> Traceback (most recent call last):
>   File "run-tests.py", line 571, in 
> main()
>   File "run-tests.py", line 474, in main
> java_version = determine_java_version(java_exe)
>   File "run-tests.py", line 169, in determine_java_version
> major = int(match.group(1))
> AttributeError: 'NoneType' object has no attribute 'group'
> $ ~/ibmjava8/bin/java -version
> java version "1.8.0"
> Java(TM) SE Runtime Environment (build pxa6480sr2-20151023_01(SR2))
> IBM J9 VM (build 2.8, JRE 1.8.0 Linux amd64-64 Compressed References 
> 20151019_272764 (JIT enabled, AOT enabled)
> J9VM - R28_Java8_SR2_20151019_2144_B272764
> JIT  - tr.r14.java_20151006_102517.04
> GC   - R28_Java8_SR2_20151019_2144_B272764_CMPRSS
> J9CL - 20151019_272764)
> JCL - 20151022_01 based on Oracle jdk8u65-b17
> $ ~/openjdk8/bin/java -version
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
> $



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-12196) Store blocks in different speed storage devices by hierarchy way

2015-12-28 Thread Zhang, Liye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang, Liye updated SPARK-12196:

Comment: was deleted

(was: I am out of office with limited email access from 12/21/2015 to 
12/25/2015. Sorry for slow email response. Any emergency, contact my manager 
(Cheng, Hao hao.ch...@intel.com). Thanks
)

> Store blocks in different speed storage devices by hierarchy way
> 
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. Set the priority and threshold for each layer in 
> spark.storage.hierarchyStore.
> {code}
> spark.storage.hierarchyStore='nvm 50GB,ssd 80GB'
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configure each layer's location, user just needs put the keyword like 
> "nvm", "ssd", which are specified in step 1, into local dirs, like 
> spark.local.dir or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11559) Make `runs` no effect in k-means

2015-12-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072572#comment-15072572
 ] 

Apache Spark commented on SPARK-11559:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10306

> Make `runs` no effect in k-means
> 
>
> Key: SPARK-11559
> URL: https://issues.apache.org/jira/browse/SPARK-11559
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>
> We deprecated `runs` in Spark 1.6 (SPARK-11358). In 1.7.0, we can either 
> remove `runs` or make it no effect (with warning messages). So we can 
> simplify the implementation. I prefer the latter for better binary 
> compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12196) Store blocks in different speed storage devices by hierarchy way

2015-12-28 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072634#comment-15072634
 ] 

Cheng Hao commented on SPARK-12196:
---

Thank you wei wu to support this feature! 

However, we're trying to avoid to change the existing configuration format, as 
it might impact the user applications, and besides, in Yarn/Mesos, this 
configuration key will not work anymore.

An updated PR will be submitted soon, welcome to join the discussion the in PR.

> Store blocks in different speed storage devices by hierarchy way
> 
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. Set the priority and threshold for each layer in 
> spark.storage.hierarchyStore.
> {code}
> spark.storage.hierarchyStore='nvm 50GB,ssd 80GB'
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configure each layer's location, user just needs put the keyword like 
> "nvm", "ssd", which are specified in step 1, into local dirs, like 
> spark.local.dir or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12536) Fix the Explain Outputs of Empty LocalRelation and LocalTableScan

2015-12-28 Thread Xiao Li (JIRA)
Xiao Li created SPARK-12536:
---

 Summary: Fix the Explain Outputs of Empty LocalRelation and 
LocalTableScan
 Key: SPARK-12536
 URL: https://issues.apache.org/jira/browse/SPARK-12536
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.0
Reporter: Xiao Li


{code}
val df = Seq(1 -> "a").toDF("a", "b")
df.where("1 = 0").explain(true)
{code}

The filter Filter (1 = 0) generates an empty `LocalRelation`. In the current 
explain, the optimized and physical plans look wrong because Filter (1 = 0) is 
removed by the optimizer.

{code}
== Parsed Logical Plan ==
Filter (1 = 0)
+- Project [_1#0 AS a#2,_2#1 AS b#3]
   +- LocalRelation [_1#0,_2#1], [[1,a]]

== Analyzed Logical Plan ==
a: int, b: string
Filter (1 = 0)
+- Project [_1#0 AS a#2,_2#1 AS b#3]
   +- LocalRelation [_1#0,_2#1], [[1,a]]

== Optimized Logical Plan ==
LocalRelation [a#2,b#3] 

== Physical Plan ==
LocalTableScan [a#2,b#3] 
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12420) Have a built-in CSV data source implementation

2015-12-28 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072580#comment-15072580
 ] 

Hyukjin Kwon commented on SPARK-12420:
--

+1, I was wondering why it has been staying third party.

> Have a built-in CSV data source implementation
> --
>
> Key: SPARK-12420
> URL: https://issues.apache.org/jira/browse/SPARK-12420
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: Built-in CSV datasource in Spark.pdf
>
>
> CSV is the most common data format in the "small data" world. It is often the 
> first format people want to try when they see Spark on a single node. Having 
> to rely on a 3rd party component for this is a very bad user experience for 
> new users.
> We should consider inlining https://github.com/databricks/spark-csv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12353) wrong output for countByValue and countByValueAndWindow

2015-12-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12353:
--
Assignee: Saisai Shao

> wrong output for countByValue and countByValueAndWindow
> ---
>
> Key: SPARK-12353
> URL: https://issues.apache.org/jira/browse/SPARK-12353
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Input/Output, PySpark, Streaming
>Affects Versions: 1.5.2
> Environment: Ubuntu 14.04, Python 2.7.6
>Reporter: Bo Jin
>Assignee: Saisai Shao
>  Labels: releasenotes
> Fix For: 2.0.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> http://stackoverflow.com/q/34114585/4698425
> In PySpark Streaming, function countByValue and countByValueAndWindow return 
> one single number which is the count of distinct elements, instead of a list 
> of (k,v) pairs.
> It's inconsistent with the documentation: 
> countByValue: When called on a DStream of elements of type K, return a new 
> DStream of (K, Long) pairs where the value of each key is its frequency in 
> each RDD of the source DStream.
> countByValueAndWindow: When called on a DStream of (K, V) pairs, returns a 
> new DStream of (K, Long) pairs where the value of each key is its frequency 
> within a sliding window. Like in reduceByKeyAndWindow, the number of reduce 
> tasks is configurable through an optional argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12353) wrong output for countByValue and countByValueAndWindow

2015-12-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12353.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10350
[https://github.com/apache/spark/pull/10350]

> wrong output for countByValue and countByValueAndWindow
> ---
>
> Key: SPARK-12353
> URL: https://issues.apache.org/jira/browse/SPARK-12353
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Input/Output, PySpark, Streaming
>Affects Versions: 1.5.2
> Environment: Ubuntu 14.04, Python 2.7.6
>Reporter: Bo Jin
>  Labels: releasenotes
> Fix For: 2.0.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> http://stackoverflow.com/q/34114585/4698425
> In PySpark Streaming, function countByValue and countByValueAndWindow return 
> one single number which is the count of distinct elements, instead of a list 
> of (k,v) pairs.
> It's inconsistent with the documentation: 
> countByValue: When called on a DStream of elements of type K, return a new 
> DStream of (K, Long) pairs where the value of each key is its frequency in 
> each RDD of the source DStream.
> countByValueAndWindow: When called on a DStream of (K, V) pairs, returns a 
> new DStream of (K, Long) pairs where the value of each key is its frequency 
> within a sliding window. Like in reduceByKeyAndWindow, the number of reduce 
> tasks is configurable through an optional argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12196) Store blocks in different speed storage devices by hierarchy way

2015-12-28 Thread wei wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072567#comment-15072567
 ] 

wei wu commented on SPARK-12196:


We also have the similar idea about Spark supported SSD for Block Manager and  
have done a prototype for it.  And we also have done some performance test on 
it. How about we add the following function and  API?

We use the benchmark problem from databricks: 
https://github.com/databricks/spark-perf/tree/master/spark-tests, 
With the test configuration Executor number: 3,  Per executor Memory: 4GB and 2 
cores, Data Size(1867MB);
The performance results is:
Test case  Memory   SSDHDD
Count 0.259s3s  
   6.75s
count-with-filter   0.56s 3.24s   
10s
aggregate-by-key2s   4.8s 9s

The prototype configuration just like as follows:
We use the following Configuration that is similar with  the hadoop data node 
path configuration:
spark.local.dir = [DISK]file:/// disk0; [SSD]file:///disk1; 
[DISK]file:///disk2;[SSD]file:/// disk3; [DISK]file:/// disk4; [DISK]file:/// 
disk5; [DISK]file:/// disk6; [DISK]file:/// disk7;
or
spark.local.dir = file:/// disk0; [SSD];file:///disk1; 
file:///disk2;[SSD]file:/// disk3; file:/// disk4; file:/// disk5; file:/// 
disk6; file:/// disk7;
or
spark.local.dir = file:/// disk0;file:///disk1; file:///disk2;file:/// disk3; 
file:/// disk4; file:/// disk5; file:/// disk6; file:/// disk7;

We add the [SSD] and [DISK] identifier for the different disk path. 
The [SSD] mark the disk as SSD storage. The [DISK] mark the disk as HDD disk.
If we ignore the [DISK] in disk path, the disk is default as HDD storage.

Add the related StorageLevel API for SSD:
StorageLevel. MEMORY_AND_SSD   // cache the block in memory, then 
ssd
StorageLevel. SSD_ONLY   //cache the block only in ssd
StorageLevel. MEMORY_AND_SSD_AND_DISK //cache block in memory, then ssd, 
then hdd
StorageLevel. SSD_AND_DISK  // cache the block in ssd, then hdd

For example: the user can use the follow API to cache the block data:
RDD.persist(StorageLevel.MEMORY_AND_SSD)
RDD.persist(StorageLevel.SSD)
RDD.persist(StorageLevel.SSD_AND_DISK)
RDD.persist(StorageLevel. MEMORY_AND_SSD_AND_DISK)






> Store blocks in different speed storage devices by hierarchy way
> 
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. Set the priority and threshold for each layer in 
> spark.storage.hierarchyStore.
> {code}
> spark.storage.hierarchyStore='nvm 50GB,ssd 80GB'
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configure each layer's location, user just needs put the keyword like 
> "nvm", "ssd", which are specified in step 1, into local dirs, like 
> spark.local.dir or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To 

[jira] [Commented] (SPARK-12534) Document missing command line options to Spark properties mapping

2015-12-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072587#comment-15072587
 ] 

Sean Owen commented on SPARK-12534:
---

I don't think this info is worth maintaining redundantly in the configs doc. 
It's not about the CLI.

> Document missing command line options to Spark properties mapping
> -
>
> Key: SPARK-12534
> URL: https://issues.apache.org/jira/browse/SPARK-12534
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Documentation, YARN
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Minor
>
> Several Spark properties equivalent to Spark submit command line options are 
> missing.
> {quote}
> The equivalent for spark-submit --num-executors should be 
> spark.executor.instances
> When use in SparkConf?
> http://spark.apache.org/docs/latest/running-on-yarn.html
> Could you try setting that with sparkR.init()?
> _
> From: Franc Carter 
> Sent: Friday, December 25, 2015 9:23 PM
> Subject: number of executors in sparkR.init()
> To: 
> Hi,
> I'm having trouble working out how to get the number of executors set when 
> using sparkR.init().
> If I start sparkR with
>   sparkR  --master yarn --num-executors 6 
> then I get 6 executors
> However, if start sparkR with
>   sparkR 
> followed by
>   sc <- sparkR.init(master="yarn-client",   
> sparkEnvir=list(spark.num.executors='6'))
> then I only get 2 executors.
> Can anyone point me in the direction of what I might doing wrong ? I need to 
> initialise this was so that rStudio can hook in to SparkR
> thanks
> -- 
> Franc
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12515) Minor clarification on DataFrameReader.jdbc doc

2015-12-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12515.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10465
[https://github.com/apache/spark/pull/10465]

> Minor clarification on DataFrameReader.jdbc doc
> ---
>
> Key: SPARK-12515
> URL: https://issues.apache.org/jira/browse/SPARK-12515
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Trivial
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12224) R support for JDBC source

2015-12-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12224:


Assignee: Apache Spark

> R support for JDBC source
> -
>
> Key: SPARK-12224
> URL: https://issues.apache.org/jira/browse/SPARK-12224
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12224) R support for JDBC source

2015-12-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12224:


Assignee: (was: Apache Spark)

> R support for JDBC source
> -
>
> Key: SPARK-12224
> URL: https://issues.apache.org/jira/browse/SPARK-12224
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12506) Push down WHERE clause arithmetic operator to JDBC layer

2015-12-28 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072575#comment-15072575
 ] 

Hyukjin Kwon commented on SPARK-12506:
--

[~huaxing] Maybe we should do this one first 
https://issues.apache.org/jira/browse/SPARK-9182.

I guess you might be thinking of the change of some codes at 
{{DataSourceStrategy}} or making this with {{CatalystScan}}, right?

In case of {{CatalystScan}}, I opened the issue here already, 
https://issues.apache.org/jira/browse/SPARK-12126.

In case of {{DataSourceStrategy}}, We might need to deal with {{Cast}} first.

> Push down WHERE clause arithmetic operator to JDBC layer
> 
>
> Key: SPARK-12506
> URL: https://issues.apache.org/jira/browse/SPARK-12506
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Huaxin Gao
>
> For arithmetic operator in WHERE clause such as
> select * from table where c1 + c2 > 10
> Currently where c1 + c2 >10 is done at spark layer. 
> Will push this to JDBC layer so it will be done in database. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11600) Spark MLlib 1.6 QA umbrella

2015-12-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072584#comment-15072584
 ] 

Sean Owen commented on SPARK-11600:
---

Shouldn't these be resolved as of 1.6.0? the unfinished children need to be 
edited to not suggest they're going into 1.6 and then closed as wont-fix or 
left open

> Spark MLlib 1.6 QA umbrella
> ---
>
> Key: SPARK-11600
> URL: https://issues.apache.org/jira/browse/SPARK-11600
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This JIRA lists tasks for the next MLlib release's QA period.
> h2. API
> * Check binary API compatibility (SPARK-11601)
> * Audit new public APIs (from the generated html doc)
> ** Scala (SPARK-11602)
> ** Java compatibility (SPARK-11605)
> ** Python coverage (SPARK-11604)
> * Check Experimental, DeveloperApi tags (SPARK-11603)
> h2. Algorithms and performance
> *Performance*
> * _List any other missing performance tests from spark-perf here_
> * ALS.recommendAll (SPARK-7457)
> * perf-tests in Python (SPARK-7539)
> * perf-tests for transformers (SPARK-2838)
> * MultilayerPerceptron (SPARK-11911)
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide (SPARK-11606)
> * For major components, create JIRAs for example code (SPARK-9670)
> * Update Programming Guide for 1.6 (towards end of QA) (SPARK-11608)
> * Update website (SPARK-11607)
> * Merge duplicate content under examples/ (SPARK-11685)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12515) Minor clarification on DataFrameReader.jdbc doc

2015-12-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12515:
--
   Assignee: Felix Cheung  (was: Apache Spark)
   Priority: Trivial  (was: Minor)
Component/s: Documentation
 Issue Type: Improvement  (was: Bug)

> Minor clarification on DataFrameReader.jdbc doc
> ---
>
> Key: SPARK-12515
> URL: https://issues.apache.org/jira/browse/SPARK-12515
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12224) R support for JDBC source

2015-12-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12224:


Assignee: Apache Spark

> R support for JDBC source
> -
>
> Key: SPARK-12224
> URL: https://issues.apache.org/jira/browse/SPARK-12224
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism

2015-12-28 Thread Cazen Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072730#comment-15072730
 ] 

Cazen Lee commented on SPARK-12537:
---

I misunderstood. recreate pull request(10497)

> Add option to accept quoting of all character backslash quoting mechanism
> -
>
> Key: SPARK-12537
> URL: https://issues.apache.org/jira/browse/SPARK-12537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Cazen Lee
>Assignee: Apache Spark
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> {code:title=JSON File|borderStyle=solid}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> {code}
> corrupt_record(returns null)
> {code}
> scala> df.show
> ++-+-+
> | _corrupt_record| name|price|
> ++-+-+
> |null|Cazen Lee|  $10|
> |{"name": "John Do...| null| null|
> |null|Tracy|  $10|
> ++-+-+
> {code}
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> {code}
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +-+-+
> | name|price|
> +-+-+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |Tracy|  $10|
> +-+-+
> {code}
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12538) bucketed table support

2015-12-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-12538:

Description: cc [~nongli] , please attach the design doc.

> bucketed table support
> --
>
> Key: SPARK-12538
> URL: https://issues.apache.org/jira/browse/SPARK-12538
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>
> cc [~nongli] , please attach the design doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12539) support writing bucketed table

2015-12-28 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-12539:
---

 Summary: support writing bucketed table
 Key: SPARK-12539
 URL: https://issues.apache.org/jira/browse/SPARK-12539
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12538) bucketed table support

2015-12-28 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-12538:
---

 Summary: bucketed table support
 Key: SPARK-12538
 URL: https://issues.apache.org/jira/browse/SPARK-12538
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12539) support writing bucketed table

2015-12-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12539:


Assignee: Apache Spark

> support writing bucketed table
> --
>
> Key: SPARK-12539
> URL: https://issues.apache.org/jira/browse/SPARK-12539
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism

2015-12-28 Thread Cazen Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cazen Lee updated SPARK-12537:
--
Description: 
We can provides the option to choose JSON parser can be enabled to accept 
quoting of all character or not.

For example, if JSON file that includes not listed by JSON backslash quoting 
specification, it returns corrupt_record

JSON File
{quote}
{"name": "Cazen Lee", "price": "$10"}
{"name": "John Doe", "price": "\$20"}
{"name": "Tracy", "price": "$10"}
{quote}

{quote}
scala> df.show
++-+-+
| _corrupt_record| name|price|
++-+-+
|null|Cazen Lee|  $10|
|{"name": "John Do...| null| null|
|null|Tracy|  $10|
++-+-+
{quote}

And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
option like below

{quote}
scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
"true").json("/user/Cazen/test/test2.txt")
df: org.apache.spark.sql.DataFrame = [name: string, price: string]

scala> df.show
+-+-+
| name|price|
+-+-+
|Cazen Lee|  $10|
| John Doe|  $20|
|Tracy|  $10|
+-+-+
{quote}

This issue similar to HIVE-11825, HIVE-12717.


  was:
We can provides the option to choose JSON parser can be enabled to accept 
quoting of all character or not.

For example, if JSON file that includes not listed by JSON backslash quoting 
specification, it returns corrupt_record

JSON File

{"name": "Cazen Lee", "price": "$10"}
{"name": "John Doe", "price": "\$20"}
{"name": "Tracy", "price": "$10"}



scala> df.show
++-+-+
| _corrupt_record| name|price|
++-+-+
|null|Cazen Lee|  $10|
|{"name": "John Do...| null| null|
|null|Tracy|  $10|
++-+-+


And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
option like below


scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
"true").json("/user/Cazen/test/test2.txt")
df: org.apache.spark.sql.DataFrame = [name: string, price: string]

scala> df.show
+-+-+
| name|price|
+-+-+
|Cazen Lee|  $10|
| John Doe|  $20|
|Tracy|  $10|
+-+-+


This issue similar to HIVE-11825, HIVE-12717.



> Add option to accept quoting of all character backslash quoting mechanism
> -
>
> Key: SPARK-12537
> URL: https://issues.apache.org/jira/browse/SPARK-12537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Cazen Lee
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> JSON File
> {quote}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> {quote}
> {quote}
> scala> df.show
> ++-+-+
> | _corrupt_record| name|price|
> ++-+-+
> |null|Cazen Lee|  $10|
> |{"name": "John Do...| null| null|
> |null|Tracy|  $10|
> ++-+-+
> {quote}
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> {quote}
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +-+-+
> | name|price|
> +-+-+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |Tracy|  $10|
> +-+-+
> {quote}
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism

2015-12-28 Thread Cazen Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072715#comment-15072715
 ] 

Cazen Lee commented on SPARK-12537:
---

Good Day Owen! This is Cazen

The Problem is character with unstandard backslash(\) in the data

And It affects the extraneous column("name" data is valid but returns null) 

{code}
scala> df.select("name").show
+-+
| name|
+-+
|Cazen Lee|
| null|
|Tracy|
+-+
{code}

JacksonParser do not want unstandard backslash in the body, so it cause 
exception(Unrecognized character escape) by strict rule when parsing

It looks like a bug in user's view. Most of them concentrate upon data context, 
not row by row

So, I suggest add option that parser can handling this. 

> Add option to accept quoting of all character backslash quoting mechanism
> -
>
> Key: SPARK-12537
> URL: https://issues.apache.org/jira/browse/SPARK-12537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Cazen Lee
>Assignee: Apache Spark
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> {code:title=JSON File|borderStyle=solid}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> {code}
> corrupt_record(returns null)
> {code}
> scala> df.show
> ++-+-+
> | _corrupt_record| name|price|
> ++-+-+
> |null|Cazen Lee|  $10|
> |{"name": "John Do...| null| null|
> |null|Tracy|  $10|
> ++-+-+
> {code}
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> {code}
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +-+-+
> | name|price|
> +-+-+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |Tracy|  $10|
> +-+-+
> {code}
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism

2015-12-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072725#comment-15072725
 ] 

Apache Spark commented on SPARK-12537:
--

User 'Cazen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10496

> Add option to accept quoting of all character backslash quoting mechanism
> -
>
> Key: SPARK-12537
> URL: https://issues.apache.org/jira/browse/SPARK-12537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Cazen Lee
>Assignee: Apache Spark
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> {code:title=JSON File|borderStyle=solid}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> {code}
> corrupt_record(returns null)
> {code}
> scala> df.show
> ++-+-+
> | _corrupt_record| name|price|
> ++-+-+
> |null|Cazen Lee|  $10|
> |{"name": "John Do...| null| null|
> |null|Tracy|  $10|
> ++-+-+
> {code}
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> {code}
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +-+-+
> | name|price|
> +-+-+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |Tracy|  $10|
> +-+-+
> {code}
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism

2015-12-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072733#comment-15072733
 ] 

Sean Owen commented on SPARK-12537:
---

I understand that. I'm wondering when such JSON would come up in practice and 
what it looks like. That is, why does Jackson generally reject it by default?

> Add option to accept quoting of all character backslash quoting mechanism
> -
>
> Key: SPARK-12537
> URL: https://issues.apache.org/jira/browse/SPARK-12537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Cazen Lee
>Assignee: Apache Spark
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> {code:title=JSON File|borderStyle=solid}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> {code}
> corrupt_record(returns null)
> {code}
> scala> df.show
> ++-+-+
> | _corrupt_record| name|price|
> ++-+-+
> |null|Cazen Lee|  $10|
> |{"name": "John Do...| null| null|
> |null|Tracy|  $10|
> ++-+-+
> {code}
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> {code}
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +-+-+
> | name|price|
> +-+-+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |Tracy|  $10|
> +-+-+
> {code}
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism

2015-12-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12537:
--
Priority: Minor  (was: Major)

Can you be more specific about what the problem is in practice? what JSON 
causes this?

> Add option to accept quoting of all character backslash quoting mechanism
> -
>
> Key: SPARK-12537
> URL: https://issues.apache.org/jira/browse/SPARK-12537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Cazen Lee
>Priority: Minor
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> {code:title=JSON File|borderStyle=solid}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> 
> corrupt_record(returns null)
> 
> scala> df.show
> ++-+-+
> | _corrupt_record| name|price|
> ++-+-+
> |null|Cazen Lee|  $10|
> |{"name": "John Do...| null| null|
> |null|Tracy|  $10|
> ++-+-+
> 
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> 
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +-+-+
> | name|price|
> +-+-+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |Tracy|  $10|
> +-+-+
> 
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism

2015-12-28 Thread Cazen Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cazen Lee updated SPARK-12537:
--
   Priority: Major  (was: Minor)
Description: 
We can provides the option to choose JSON parser can be enabled to accept 
quoting of all character or not.

For example, if JSON file that includes not listed by JSON backslash quoting 
specification, it returns corrupt_record

{code:title=JSON File|borderStyle=solid}
{"name": "Cazen Lee", "price": "$10"}
{"name": "John Doe", "price": "\$20"}
{"name": "Tracy", "price": "$10"}
{code}

corrupt_record(returns null)
{code}
scala> df.show
++-+-+
| _corrupt_record| name|price|
++-+-+
|null|Cazen Lee|  $10|
|{"name": "John Do...| null| null|
|null|Tracy|  $10|
++-+-+
{code}

And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
option like below

{code}
scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
"true").json("/user/Cazen/test/test2.txt")
df: org.apache.spark.sql.DataFrame = [name: string, price: string]

scala> df.show
+-+-+
| name|price|
+-+-+
|Cazen Lee|  $10|
| John Doe|  $20|
|Tracy|  $10|
+-+-+
{code}

This issue similar to HIVE-11825, HIVE-12717.


  was:
We can provides the option to choose JSON parser can be enabled to accept 
quoting of all character or not.

For example, if JSON file that includes not listed by JSON backslash quoting 
specification, it returns corrupt_record

{code:title=JSON File|borderStyle=solid}
{"name": "Cazen Lee", "price": "$10"}
{"name": "John Doe", "price": "\$20"}
{"name": "Tracy", "price": "$10"}


corrupt_record(returns null)

scala> df.show
++-+-+
| _corrupt_record| name|price|
++-+-+
|null|Cazen Lee|  $10|
|{"name": "John Do...| null| null|
|null|Tracy|  $10|
++-+-+


And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
option like below


scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
"true").json("/user/Cazen/test/test2.txt")
df: org.apache.spark.sql.DataFrame = [name: string, price: string]

scala> df.show
+-+-+
| name|price|
+-+-+
|Cazen Lee|  $10|
| John Doe|  $20|
|Tracy|  $10|
+-+-+


This issue similar to HIVE-11825, HIVE-12717.


> Add option to accept quoting of all character backslash quoting mechanism
> -
>
> Key: SPARK-12537
> URL: https://issues.apache.org/jira/browse/SPARK-12537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Cazen Lee
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> {code:title=JSON File|borderStyle=solid}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> {code}
> corrupt_record(returns null)
> {code}
> scala> df.show
> ++-+-+
> | _corrupt_record| name|price|
> ++-+-+
> |null|Cazen Lee|  $10|
> |{"name": "John Do...| null| null|
> |null|Tracy|  $10|
> ++-+-+
> {code}
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> {code}
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +-+-+
> | name|price|
> +-+-+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |Tracy|  $10|
> +-+-+
> {code}
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism

2015-12-28 Thread Cazen Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cazen Lee updated SPARK-12537:
--
Description: 
We can provides the option to choose JSON parser can be enabled to accept 
quoting of all character or not.

For example, if JSON file that includes not listed by JSON backslash quoting 
specification, it returns corrupt_record

{code:title=JSON File|borderStyle=solid}
{"name": "Cazen Lee", "price": "$10"}
{"name": "John Doe", "price": "\$20"}
{"name": "Tracy", "price": "$10"}


corrupt_record(returns null)

scala> df.show
++-+-+
| _corrupt_record| name|price|
++-+-+
|null|Cazen Lee|  $10|
|{"name": "John Do...| null| null|
|null|Tracy|  $10|
++-+-+


And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
option like below


scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
"true").json("/user/Cazen/test/test2.txt")
df: org.apache.spark.sql.DataFrame = [name: string, price: string]

scala> df.show
+-+-+
| name|price|
+-+-+
|Cazen Lee|  $10|
| John Doe|  $20|
|Tracy|  $10|
+-+-+


This issue similar to HIVE-11825, HIVE-12717.

  was:
We can provides the option to choose JSON parser can be enabled to accept 
quoting of all character or not.

For example, if JSON file that includes not listed by JSON backslash quoting 
specification, it returns corrupt_record

JSON File
{quote}
{"name": "Cazen Lee", "price": "$10"}
{"name": "John Doe", "price": "\$20"}
{"name": "Tracy", "price": "$10"}
{quote}

{quote}
scala> df.show
++-+-+
| _corrupt_record| name|price|
++-+-+
|null|Cazen Lee|  $10|
|{"name": "John Do...| null| null|
|null|Tracy|  $10|
++-+-+
{quote}

And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
option like below

{quote}
scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
"true").json("/user/Cazen/test/test2.txt")
df: org.apache.spark.sql.DataFrame = [name: string, price: string]

scala> df.show
+-+-+
| name|price|
+-+-+
|Cazen Lee|  $10|
| John Doe|  $20|
|Tracy|  $10|
+-+-+
{quote}

This issue similar to HIVE-11825, HIVE-12717.



> Add option to accept quoting of all character backslash quoting mechanism
> -
>
> Key: SPARK-12537
> URL: https://issues.apache.org/jira/browse/SPARK-12537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Cazen Lee
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> {code:title=JSON File|borderStyle=solid}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> 
> corrupt_record(returns null)
> 
> scala> df.show
> ++-+-+
> | _corrupt_record| name|price|
> ++-+-+
> |null|Cazen Lee|  $10|
> |{"name": "John Do...| null| null|
> |null|Tracy|  $10|
> ++-+-+
> 
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> 
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +-+-+
> | name|price|
> +-+-+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |Tracy|  $10|
> +-+-+
> 
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8616) SQLContext doesn't handle tricky column names when loading from JDBC

2015-12-28 Thread PJ Fanning (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072697#comment-15072697
 ] 

PJ Fanning commented on SPARK-8616:
---

Seems to duplicate the 'In Progress' task, SPARK-12437.

> SQLContext doesn't handle tricky column names when loading from JDBC
> 
>
> Key: SPARK-8616
> URL: https://issues.apache.org/jira/browse/SPARK-8616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Ubuntu 14.04, Sqlite 3.8.7, Spark 1.4.0
>Reporter: Gergely Svigruha
>
> Reproduce:
>  - create a table in a relational database (in my case sqlite) with a column 
> name containing a space:
>  CREATE TABLE my_table (id INTEGER, "tricky column" TEXT);
>  - try to create a DataFrame using that table:
> sqlContext.read.format("jdbc").options(Map(
>   "url" -> "jdbs:sqlite:...",
>   "dbtable" -> "my_table")).load()
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> column: tricky)
> According to the SQL spec this should be valid:
> http://savage.net.au/SQL/sql-99.bnf.html#delimited%20identifier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism

2015-12-28 Thread Cazen Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072779#comment-15072779
 ] 

Cazen Lee commented on SPARK-12537:
---

Hmm I hope that this example is a good description..

Assume that gathering large amount of search word logs from customers device. 
And the logs stick together in device until wifi connection

Logs has nested JSON structure like below

{code}
{
"deviceId": "Hcd8sdId8sdfC",
"searchingWord": [{
"timestamp": 134053453,
"search": "Cazen Lee"
}, {
"timestamp": 134053455,
"search": "John D\oe"
}, {
"timestamp": 134053457,
"search": "wordwordword"
}]
}
{code}

In this situation, distinct(deviceId) will 0 instead of 1 because user mistype 
"John Doe" to "John D\oe". If device connecting wifi after 2weeks later for 
vacation, one row will grow up 100MB, and whole logs are gone(parse err)

HIVE-11825 has similar situation, too. User can make various keyword in the log.

I'm sure it's OK to Jackson parser reject this example by default. It' not 
standard(JSON specification). And this example is not regular situation

But I think that's a little harsh. If user can set option to handling, it would 
be  helpful.

> Add option to accept quoting of all character backslash quoting mechanism
> -
>
> Key: SPARK-12537
> URL: https://issues.apache.org/jira/browse/SPARK-12537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Cazen Lee
>Assignee: Apache Spark
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> {code:title=JSON File|borderStyle=solid}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> {code}
> corrupt_record(returns null)
> {code}
> scala> df.show
> ++-+-+
> | _corrupt_record| name|price|
> ++-+-+
> |null|Cazen Lee|  $10|
> |{"name": "John Do...| null| null|
> |null|Tracy|  $10|
> ++-+-+
> {code}
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> {code}
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +-+-+
> | name|price|
> +-+-+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |Tracy|  $10|
> +-+-+
> {code}
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism

2015-12-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072802#comment-15072802
 ] 

Sean Owen commented on SPARK-12537:
---

According to the JSON spec at http://json.org/, "John D\oe" is not a typo; it 
denotes a 9-character string which includes a backslash. That is, only a few 
special characters can be escaped it seems. It's not valid to consider this the 
string "John Doe" since it isn't. You'd mangle "correct" strings like "Shrek \ 
Soundtrack" to "Shrek  Soundtrack". The consequences of the app problem aren't 
really relevant to Spark here; the app has generated badly formed data. I'd 
understand accepting it if there were no downsides but there are.

> Add option to accept quoting of all character backslash quoting mechanism
> -
>
> Key: SPARK-12537
> URL: https://issues.apache.org/jira/browse/SPARK-12537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Cazen Lee
>Assignee: Apache Spark
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> {code:title=JSON File|borderStyle=solid}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> {code}
> corrupt_record(returns null)
> {code}
> scala> df.show
> ++-+-+
> | _corrupt_record| name|price|
> ++-+-+
> |null|Cazen Lee|  $10|
> |{"name": "John Do...| null| null|
> |null|Tracy|  $10|
> ++-+-+
> {code}
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> {code}
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +-+-+
> | name|price|
> +-+-+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |Tracy|  $10|
> +-+-+
> {code}
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9505) DataFrames : Mysql JDBC not support column names with special characters

2015-12-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9505.
--
Resolution: Duplicate

> DataFrames : Mysql JDBC not support column names with special characters
> 
>
> Key: SPARK-9505
> URL: https://issues.apache.org/jira/browse/SPARK-9505
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.3.0
>Reporter: Pangjiu
>
> HI all,
> I had above issue on connect to mySQL database through SQLContext. If the 
> mySQL table's column name contains special characters like #[ ] %,  it throw 
> exception : "You have an error in your SQL syntax".
> Below is coding:
> Class.forName("com.mysql.jdbc.Driver").newInstance()
> val url = "jdbc:mysql://localhost:3306/sakila?user=root=xxx"
> val driver = "com.mysql.jdbc.Driver"
> val sqlContext = new SQLContext(sc)
> val output = { sqlContext.load("jdbc", Map 
>   (
>   "url" -> url,
>   "driver" -> driver,
>   "dbtable" -> "(SELECT `ID`, `NAME%` 
> FROM `agent`) AS tableA "
>   )
>   )
> }
> Hope dataframes for sqlContext can support for special characters very soon 
> as this become a stopper  now.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9505) DataFrames : Mysql JDBC not support column names with special characters

2015-12-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9505:
-
Priority: Major  (was: Blocker)

> DataFrames : Mysql JDBC not support column names with special characters
> 
>
> Key: SPARK-9505
> URL: https://issues.apache.org/jira/browse/SPARK-9505
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.3.0
>Reporter: Pangjiu
>
> HI all,
> I had above issue on connect to mySQL database through SQLContext. If the 
> mySQL table's column name contains special characters like #[ ] %,  it throw 
> exception : "You have an error in your SQL syntax".
> Below is coding:
> Class.forName("com.mysql.jdbc.Driver").newInstance()
> val url = "jdbc:mysql://localhost:3306/sakila?user=root=xxx"
> val driver = "com.mysql.jdbc.Driver"
> val sqlContext = new SQLContext(sc)
> val output = { sqlContext.load("jdbc", Map 
>   (
>   "url" -> url,
>   "driver" -> driver,
>   "dbtable" -> "(SELECT `ID`, `NAME%` 
> FROM `agent`) AS tableA "
>   )
>   )
> }
> Hope dataframes for sqlContext can support for special characters very soon 
> as this become a stopper  now.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12539) support writing bucketed table

2015-12-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072766#comment-15072766
 ] 

Apache Spark commented on SPARK-12539:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/10498

> support writing bucketed table
> --
>
> Key: SPARK-12539
> URL: https://issues.apache.org/jira/browse/SPARK-12539
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism

2015-12-28 Thread Cazen Lee (JIRA)
Cazen Lee created SPARK-12537:
-

 Summary: Add option to accept quoting of all character backslash 
quoting mechanism
 Key: SPARK-12537
 URL: https://issues.apache.org/jira/browse/SPARK-12537
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.2
Reporter: Cazen Lee


We can provides the option to choose JSON parser can be enabled to accept 
quoting of all character or not.

For example, if JSON file that includes not listed by JSON backslash quoting 
specification, it returns corrupt_record

JSON File

{"name": "Cazen Lee", "price": "$10"}
{"name": "John Doe", "price": "\$20"}
{"name": "Tracy", "price": "$10"}



scala> df.show
++-+-+
| _corrupt_record| name|price|
++-+-+
|null|Cazen Lee|  $10|
|{"name": "John Do...| null| null|
|null|Tracy|  $10|
++-+-+


And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
option like below


scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
"true").json("/user/Cazen/test/test2.txt")
df: org.apache.spark.sql.DataFrame = [name: string, price: string]

scala> df.show
+-+-+
| name|price|
+-+-+
|Cazen Lee|  $10|
| John Doe|  $20|
|Tracy|  $10|
+-+-+


This issue similar to HIVE-11825, HIVE-12717.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism

2015-12-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12537:


Assignee: Apache Spark

> Add option to accept quoting of all character backslash quoting mechanism
> -
>
> Key: SPARK-12537
> URL: https://issues.apache.org/jira/browse/SPARK-12537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Cazen Lee
>Assignee: Apache Spark
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> {code:title=JSON File|borderStyle=solid}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> {code}
> corrupt_record(returns null)
> {code}
> scala> df.show
> ++-+-+
> | _corrupt_record| name|price|
> ++-+-+
> |null|Cazen Lee|  $10|
> |{"name": "John Do...| null| null|
> |null|Tracy|  $10|
> ++-+-+
> {code}
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> {code}
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +-+-+
> | name|price|
> +-+-+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |Tracy|  $10|
> +-+-+
> {code}
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12218) Invalid splitting of nested AND expressions in Data Source filter API

2015-12-28 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072901#comment-15072901
 ] 

Yin Huai commented on SPARK-12218:
--

Just a note. https://github.com/apache/spark/pull/10377 is a follow-up 
optimization for ORC. I only merged it in branch master and it will be released 
with 2.0.0 because it is not a bug fix.

> Invalid splitting of nested AND expressions in Data Source filter API
> -
>
> Key: SPARK-12218
> URL: https://issues.apache.org/jira/browse/SPARK-12218
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Irakli Machabeli
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.5.3, 1.6.0, 2.0.0
>
>
> Two identical queries produce different results
> In [2]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( 
> PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 
> 'PreviouslyChargedOff'))").count()
> Out[2]: 18
> In [3]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and ( 
> not(PaymentsReceived=0) or not (ExplicitRoll in ('PreviouslyPaidOff', 
> 'PreviouslyChargedOff')))").count()
> Out[3]: 28



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12491) UDAF result differs in SQL if alias is used

2015-12-28 Thread Tristan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072902#comment-15072902
 ] 

Tristan commented on SPARK-12491:
-

REPL with logical plans:

{code:None|borderStyle=solid}
scala> import com.pipeline.spark._
import com.pipeline.spark._

scala> sqlContext.udf.register("gm", new GeometricMean)
res0: org.apache.spark.sql.expressions.UserDefinedAggregateFunction = 
com.pipeline.spark.GeometricMean@497031ea

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> val ids = sqlContext.range(1, 20)
ids: org.apache.spark.sql.DataFrame = [id: bigint]

scala> ids.registerTempTable("ids")

scala> val df = sqlContext.sql("select id, id % 3 as group_id from ids")
df: org.apache.spark.sql.DataFrame = [id: bigint, group_id: bigint]

scala> df.registerTempTable("simple")

scala> val q = sqlContext.sql("select group_id, gm(id) from simple group by 
group_id")
q: org.apache.spark.sql.DataFrame = [group_id: bigint, _c1: double]

scala> q.explain(true)
== Parsed Logical Plan ==
'Aggregate ['group_id], [unresolvedalias('group_id),unresolvedalias('gm('id))]
 'UnresolvedRelation [simple], None

== Analyzed Logical Plan ==
group_id: bigint, _c1: double
Aggregate [group_id#1L], [group_id#1L,(GeometricMean(cast(id#0L as 
double)),mode=Complete,isDistinct=false) AS _c1#12]
 Subquery simple
  Project [id#0L,(id#0L % cast(3 as bigint)) AS group_id#1L]
   Subquery ids
LogicalRDD [id#0L], MapPartitionsRDD[3] at range at :25

== Optimized Logical Plan ==
Aggregate [group_id#1L], [group_id#1L,(GeometricMean(cast(id#0L as 
double)),mode=Complete,isDistinct=false) AS _c1#12]
 Project [id#0L,(id#0L % 3) AS group_id#1L]
  LogicalRDD [id#0L], MapPartitionsRDD[3] at range at :25

== Physical Plan ==
SortBasedAggregate(key=[group_id#1L], functions=[(GeometricMean(cast(id#0L as 
double)),mode=Final,isDistinct=false)], output=[group_id#1L,_c1#12])
 ConvertToSafe
  TungstenSort [group_id#1L ASC], false, 0
   TungstenExchange hashpartitioning(group_id#1L)
ConvertToUnsafe
 SortBasedAggregate(key=[group_id#1L], functions=[(GeometricMean(cast(id#0L 
as double)),mode=Partial,isDistinct=false)], 
output=[group_id#1L,count#14L,product#15])
  ConvertToSafe
   TungstenSort [group_id#1L ASC], false, 0
TungstenProject [id#0L,(id#0L % 3) AS group_id#1L]
 Scan PhysicalRDD[id#0L]

Code Generation: true

scala> q.show()
++---+
|group_id|_c1|
++---+
|   0|0.0|
|   1|0.0|
|   2|0.0|
++---+


scala> val q2 = sqlContext.sql("select group_id, gm(id) as geomean from simple 
group by group_id")
q2: org.apache.spark.sql.DataFrame = [group_id: bigint, geomean: double]

scala> q2.explain(true)
== Parsed Logical Plan ==
'Aggregate ['group_id], [unresolvedalias('group_id),unresolvedalias('gm('id) AS 
geomean#19)]
 'UnresolvedRelation [simple], None

== Analyzed Logical Plan ==
group_id: bigint, geomean: double
Aggregate [group_id#1L], [group_id#1L,(GeometricMean(cast(id#0L as 
double)),mode=Complete,isDistinct=false) AS geomean#19]
 Subquery simple
  Project [id#0L,(id#0L % cast(3 as bigint)) AS group_id#1L]
   Subquery ids
LogicalRDD [id#0L], MapPartitionsRDD[3] at range at :25

== Optimized Logical Plan ==
Aggregate [group_id#1L], [group_id#1L,(GeometricMean(cast(id#0L as 
double)),mode=Complete,isDistinct=false) AS geomean#19]
 Project [id#0L,(id#0L % 3) AS group_id#1L]
  LogicalRDD [id#0L], MapPartitionsRDD[3] at range at :25

== Physical Plan ==
SortBasedAggregate(key=[group_id#1L], functions=[(GeometricMean(cast(id#0L as 
double)),mode=Final,isDistinct=false)], output=[group_id#1L,geomean#19])
 ConvertToSafe
  TungstenSort [group_id#1L ASC], false, 0
   TungstenExchange hashpartitioning(group_id#1L)
ConvertToUnsafe
 SortBasedAggregate(key=[group_id#1L], functions=[(GeometricMean(cast(id#0L 
as double)),mode=Partial,isDistinct=false)], 
output=[group_id#1L,count#30L,product#31])
  ConvertToSafe
   TungstenSort [group_id#1L ASC], false, 0
TungstenProject [id#0L,(id#0L % 3) AS group_id#1L]
 Scan PhysicalRDD[id#0L]

Code Generation: true

 
scala> q2.show()
++-+
|group_id|  geomean|
++-+
|   0|8.981385496571725|
|   1|7.301716979342118|
|   2|7.706253151292568|
++-+
{code}

And here is the UDAF spec:

{code:None|borderStyle=solid}
package com.pipeline.spark

import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._


class GeometricMean extends UserDefinedAggregateFunction {
  // This is the input fields for your aggregate function.
  def inputSchema: org.apache.spark.sql.types.StructType = 
StructType(StructField("value", DoubleType) :: Nil)

  // This is the internal fields you 

[jira] [Updated] (SPARK-12491) UDAF result differs in SQL if alias is used

2015-12-28 Thread Tristan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tristan updated SPARK-12491:

Description: 
Using the GeometricMean UDAF example 
(https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html),
 I found the following discrepancy in results:

{code}
scala> sqlContext.sql("select group_id, gm(id) from simple group by 
group_id").show()
++---+
|group_id|_c1|
++---+
|   0|0.0|
|   1|0.0|
|   2|0.0|
++---+


scala> sqlContext.sql("select group_id, gm(id) as GeometricMean from simple 
group by group_id").show()
++-+
|group_id|GeometricMean|
++-+
|   0|8.981385496571725|
|   1|7.301716979342118|
|   2|7.706253151292568|
++-+
{code}

  was:
Using the GeometricMean UDAF example 
(https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html),
 I found the following discrepancy in results:

scala> sqlContext.sql("select group_id, gm(id) from simple group by 
group_id").show()
++---+
|group_id|_c1|
++---+
|   0|0.0|
|   1|0.0|
|   2|0.0|
++---+


scala> sqlContext.sql("select group_id, gm(id) as GeometricMean from simple 
group by group_id").show()
++-+
|group_id|GeometricMean|
++-+
|   0|8.981385496571725|
|   1|7.301716979342118|
|   2|7.706253151292568|
++-+


> UDAF result differs in SQL if alias is used
> ---
>
> Key: SPARK-12491
> URL: https://issues.apache.org/jira/browse/SPARK-12491
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Tristan
>
> Using the GeometricMean UDAF example 
> (https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html),
>  I found the following discrepancy in results:
> {code}
> scala> sqlContext.sql("select group_id, gm(id) from simple group by 
> group_id").show()
> ++---+
> |group_id|_c1|
> ++---+
> |   0|0.0|
> |   1|0.0|
> |   2|0.0|
> ++---+
> scala> sqlContext.sql("select group_id, gm(id) as GeometricMean from simple 
> group by group_id").show()
> ++-+
> |group_id|GeometricMean|
> ++-+
> |   0|8.981385496571725|
> |   1|7.301716979342118|
> |   2|7.706253151292568|
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10873) can't sort columns on history page

2015-12-28 Thread Zhuo Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15062794#comment-15062794
 ] 

Zhuo Liu edited comment on SPARK-10873 at 12/28/15 3:51 PM:


Hi Alex, I started working on this, hopefully I will have a pull request up 
next weeks. Thanks [~ajbozarth]


was (Author: zhuoliu):
Hi Alex, I started working on this, hopefully I will have a pull request up 
next week. Thanks [~ajbozarth]

> can't sort columns on history page
> --
>
> Key: SPARK-10873
> URL: https://issues.apache.org/jira/browse/SPARK-10873
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Thomas Graves
>Assignee: Zhuo Liu
>
> Starting with 1.5.1 the history server page isn't allowing sorting by column



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11185) Add more task metrics to the "all Stages Page"

2015-12-28 Thread Derek Dagit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072870#comment-15072870
 ] 

Derek Dagit commented on SPARK-11185:
-

[~pwendell], [~kayousterhout], [~andrewor14], [~irashid]

Let's figure out how we might address this use case. Do we think there is a way 
to present this information without cluttering the UI?



> Add more task metrics to the "all Stages Page"
> --
>
> Key: SPARK-11185
> URL: https://issues.apache.org/jira/browse/SPARK-11185
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> The "All Stages Page" on the History page could have more information about 
> the stage to allow users to quickly see which stage potentially has long 
> tasks. Indicator or skewed data or bad nodes, etc.  
> Currently to get this information you have to click on every stage.  If you 
> have a hundreds of stages this can be very cumbersome.
> For instance pulling out the max task time and the median to the all stages 
> page would allow me to see the difference and if the max task time is much 
> greater then the median this stage may have had tasks with problems.  
> We already had some discussion about this under 
> https://github.com/apache/spark/pull/9051



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12319) Address endian specific problems surfaced in 1.6

2015-12-28 Thread Tim Preece (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073145#comment-15073145
 ] 

Tim Preece commented on SPARK-12319:


Hi,
The failing test is already checked in. It is:
"org.apache.spark.sql.DatasetAggregatorSuite.typed aggregation: class input 
with reordering"

The test only explicitly fails on Big Endian platforms. This is because an 
integer takes an 8 byte slot in the Unsafe row. When the data corruption occurs 
the BE integer ends up with the wrong value. I added print statements which 
shows the data corruption on Little Endian  as well, it just happens not to 
effect the value of the LE integer, since the LE integer is in the other 
4-bytes of the 8-byte slot.

> Address endian specific problems surfaced in 1.6
> 
>
> Key: SPARK-12319
> URL: https://issues.apache.org/jira/browse/SPARK-12319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Problems apparent on BE, LE could be impacted too
>Reporter: Adam Roberts
>Priority: Critical
>
> JIRA to cover endian specific problems - since testing 1.6 I've noticed 
> problems with DataFrames on BE platforms, e.g. 
> https://issues.apache.org/jira/browse/SPARK-9858
> [~joshrosen] [~yhuai]
> Current progress: using com.google.common.io.LittleEndianDataInputStream and 
> com.google.common.io.LittleEndianDataOutputStream within UnsafeRowSerializer 
> fixes three test failures in ExchangeCoordinatorSuite but I'm concerned 
> around performance/wider functional implications
> "org.apache.spark.sql.DatasetAggregatorSuite.typed aggregation: class input 
> with reordering" fails as we expect "one, 1" but instead get "one, 9" - we 
> believe the issue lies within BitSetMethods.java, specifically around: return 
> (wi << 6) + subIndex + java.lang.Long.numberOfTrailingZeros(word); 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12441) Fixing missingInput in all Logical/Physical operators

2015-12-28 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12441.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10393
[https://github.com/apache/spark/pull/10393]

> Fixing missingInput in all Logical/Physical operators
> -
>
> Key: SPARK-12441
> URL: https://issues.apache.org/jira/browse/SPARK-12441
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Xiao Li
> Fix For: 2.0.0
>
>
> The value of missingInput in 
> Generate/MapPartitions/AppendColumns/MapGroups/CoGroup is incorrect. 
> {code}
> val df = Seq((1, "a b c"), (2, "a b"), (3, "a")).toDF("number", "letters")
> val df2 =
>   df.explode('letters) {
> case Row(letters: String) => letters.split(" ").map(Tuple1(_)).toSeq
>   }
> df2.explain(true)
> {code}
> {code}
> == Parsed Logical Plan ==
> 'Generate UserDefinedGenerator('letters), true, false, None
> +- Project [_1#0 AS number#2,_2#1 AS letters#3]
>+- LocalRelation [_1#0,_2#1], [[1,a b c],[2,a b],[3,a]]
> == Analyzed Logical Plan ==
> number: int, letters: string, _1: string
> Generate UserDefinedGenerator(letters#3), true, false, None, [_1#8]
> +- Project [_1#0 AS number#2,_2#1 AS letters#3]
>+- LocalRelation [_1#0,_2#1], [[1,a b c],[2,a b],[3,a]]
> == Optimized Logical Plan ==
> Generate UserDefinedGenerator(letters#3), true, false, None, [_1#8]
> +- LocalRelation [number#2,letters#3], [[1,a b c],[2,a b],[3,a]]
> == Physical Plan ==
> !Generate UserDefinedGenerator(letters#3), true, false, 
> [number#2,letters#3,_1#8]
> +- LocalTableScan [number#2,letters#3], [[1,a b c],[2,a b],[3,a]]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12494) Array out of bound Exception in KMeans Yarn Mode

2015-12-28 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12494:
--
Priority: Major  (was: Blocker)

> Array out of bound Exception in KMeans Yarn Mode
> 
>
> Key: SPARK-12494
> URL: https://issues.apache.org/jira/browse/SPARK-12494
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Anandraj
>
> Hi,
> I am try to run k-means clustering on the word2vec data. I tested the code in 
> local mode with small data. Clustering completes fine. But, when I run with 
> same data on Yarn Cluster mode, it fails below error. 
> 15/12/23 00:49:01 ERROR yarn.ApplicationMaster: User class threw exception: 
> java.lang.ArrayIndexOutOfBoundsException: 0
> java.lang.ArrayIndexOutOfBoundsException: 0
>   at 
> scala.collection.mutable.WrappedArray$ofRef.apply(WrappedArray.scala:126)
>   at 
> org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:377)
>   at 
> org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:377)
>   at scala.Array$.tabulate(Array.scala:331)
>   at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:377)
>   at 
> org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:249)
>   at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:213)
>   at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:520)
>   at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:531)
>   at 
> com.tempurer.intelligence.adhocjobs.spark.kMeans$delayedInit$body.apply(kMeans.scala:24)
>   at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
>   at 
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
>   at scala.App$class.main(App.scala:71)
>   at 
> com.tempurer.intelligence.adhocjobs.spark.kMeans$.main(kMeans.scala:9)
>   at com.tempurer.intelligence.adhocjobs.spark.kMeans.main(kMeans.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:525)
> 15/12/23 00:49:01 INFO yarn.ApplicationMaster: Final app status: FAILED, 
> exitCode: 15, (reason: User class threw exception: 
> java.lang.ArrayIndexOutOfBoundsException: 0)
> In Local mode with large data(2375849 vectors of size 200) , the first 
> sampling stage completes. Second stage suspends execution without any error 
> message. No Active execution in progress. I could only see the below warning 
> message
> 15/12/23 01:24:13 INFO TaskSetManager: Finished task 9.0 in stage 1.0 (TID 
> 37) in 29 ms on localhost (4/34)
> 15/12/23 01:24:14 WARN SparkContext: Requesting executors is only supported 
> in coarse-grained mode
> 15/12/23 01:24:14 WARN ExecutorAllocationManager: Unable to reach the cluster 
> manager to request 2 total executors!
> 15/12/23 01:24:15 WARN SparkContext: Requesting executors is only supported 
> in coarse-grained mode
> 15/12/23 01:24:15 WARN ExecutorAllocationManager: Unable to reach the cluster 
> manager to request 3 total executors!
> 15/12/23 01:24:16 WARN SparkContext: Requesting executors is only supported 
> in coarse-grained mode
> 15/12/23 01:24:16 WARN ExecutorAllocationManager: Unable to reach the cluster 
> manager to request 4 total executors!
> 15/12/23 01:24:17 WARN SparkContext: Requesting executors is only supported 
> in coarse-grained mode
> 15/12/23 01:24:17 WARN ExecutorAllocationManager: Unable to reach the cluster 
> manager to request 5 total executors!
> 15/12/23 01:24:18 WARN SparkContext: Requesting executors is only supported 
> in coarse-grained mode
> 15/12/23 01:24:18 WARN ExecutorAllocationManager: Unable to reach the cluster 
> manager to request 6 total executors!
> 15/12/23 01:24:19 WARN SparkContext: Requesting executors is only supported 
> in coarse-grained mode
> 15/12/23 01:24:19 WARN ExecutorAllocationManager: Unable to reach the cluster 
> manager to request 7 total executors!
> 15/12/23 01:24:20 WARN SparkContext: Requesting executors is only supported 
> in coarse-grained mode
> 15/12/23 01:24:20 WARN ExecutorAllocationManager: Unable to reach the cluster 
> manager to request 8 total 

[jira] [Assigned] (SPARK-12486) Executors are not always terminated successfully by the worker.

2015-12-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12486:


Assignee: Apache Spark

> Executors are not always terminated successfully by the worker.
> ---
>
> Key: SPARK-12486
> URL: https://issues.apache.org/jira/browse/SPARK-12486
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Nong Li
>Assignee: Apache Spark
>
> There are cases when the executor is not killed successfully by the worker. 
> One way this can happen is if the executor is in a bad state, fails to 
> heartbeat and the master tells the worker to kill the executor. The executor 
> is in such a bad state that the kill request is ignored. This seems to be 
> able to happen if the executor is in heavy GC.
> The cause of this is that the Process.destroy() API is not forceful enough. 
> In Java8, a new API, destroyForcibly() was added. We should use that if 
> available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12491) UDAF result differs in SQL if alias is used

2015-12-28 Thread Tristan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073191#comment-15073191
 ] 

Tristan commented on SPARK-12491:
-

1.6.0rc4 works in cluster mode also. Thank you!

> UDAF result differs in SQL if alias is used
> ---
>
> Key: SPARK-12491
> URL: https://issues.apache.org/jira/browse/SPARK-12491
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Tristan
> Attachments: UDAF_GM.zip
>
>
> Using the GeometricMean UDAF example 
> (https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html),
>  I found the following discrepancy in results:
> {code}
> scala> sqlContext.sql("select group_id, gm(id) from simple group by 
> group_id").show()
> ++---+
> |group_id|_c1|
> ++---+
> |   0|0.0|
> |   1|0.0|
> |   2|0.0|
> ++---+
> scala> sqlContext.sql("select group_id, gm(id) as GeometricMean from simple 
> group by group_id").show()
> ++-+
> |group_id|GeometricMean|
> ++-+
> |   0|8.981385496571725|
> |   1|7.301716979342118|
> |   2|7.706253151292568|
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10789) Cluster mode SparkSubmit classpath only includes Spark assembly

2015-12-28 Thread Jonathan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073193#comment-15073193
 ] 

Jonathan Kelly commented on SPARK-10789:


Here's another patch that can be applied to v1.6.0: 
https://issues.apache.org/jira/secure/attachment/12779704/SPARK-10789.v1.6.0.diff

> Cluster mode SparkSubmit classpath only includes Spark assembly
> ---
>
> Key: SPARK-10789
> URL: https://issues.apache.org/jira/browse/SPARK-10789
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Jonathan Kelly
> Attachments: SPARK-10789.diff, SPARK-10789.v1.6.0.diff
>
>
> When using cluster deploy mode, the classpath of the SparkSubmit process that 
> gets launched only includes the Spark assembly and not 
> spark.driver.extraClassPath. This is of course by design, since the driver 
> actually runs on the cluster and not inside the SparkSubmit process.
> However, if the SparkSubmit process, minimal as it may be, needs any extra 
> libraries that are not part of the Spark assembly, there is no good way to 
> include them. (I say "no good way" because including them in the 
> SPARK_CLASSPATH environment variable does cause the SparkSubmit process to 
> include them, but this is not acceptable because this environment variable 
> has long been deprecated, and it prevents the use of 
> spark.driver.extraClassPath.)
> An example of when this matters is on Amazon EMR when using an S3 path for 
> the application JAR and running in yarn-cluster mode. The SparkSubmit process 
> needs the EmrFileSystem implementation and its dependencies in the classpath 
> in order to download the application JAR from S3, so it fails with a 
> ClassNotFoundException. (EMR currently gets around this by setting 
> SPARK_CLASSPATH, but as mentioned above this is less than ideal.)
> I have tried modifying SparkSubmitCommandBuilder to include the driver extra 
> classpath whether it's client mode or cluster mode, and this seems to work, 
> but I don't know if there is any downside to this.
> Example that fails on emr-4.0.0 (if you switch to setting 
> spark.(driver,executor).extraClassPath instead of SPARK_CLASSPATH): 
> spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.JavaWordCount s3://my-bucket/spark-examples.jar 
> s3://my-bucket/word-count-input.txt
> Resulting Exception:
> Exception in thread "main" java.lang.RuntimeException: 
> java.lang.ClassNotFoundException: Class 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2626)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2639)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:233)
>   at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:327)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:366)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:364)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:364)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
>   at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)
>   at org.apache.spark.deploy.yarn.Client$.main(Client.scala:966)
>   at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at 

[jira] [Updated] (SPARK-10789) Cluster mode SparkSubmit classpath only includes Spark assembly

2015-12-28 Thread Jonathan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Kelly updated SPARK-10789:
---
Attachment: SPARK-10789.v1.6.0.diff

Here's another patch that can be applied to v1.6.0.

> Cluster mode SparkSubmit classpath only includes Spark assembly
> ---
>
> Key: SPARK-10789
> URL: https://issues.apache.org/jira/browse/SPARK-10789
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Jonathan Kelly
> Attachments: SPARK-10789.diff, SPARK-10789.v1.6.0.diff
>
>
> When using cluster deploy mode, the classpath of the SparkSubmit process that 
> gets launched only includes the Spark assembly and not 
> spark.driver.extraClassPath. This is of course by design, since the driver 
> actually runs on the cluster and not inside the SparkSubmit process.
> However, if the SparkSubmit process, minimal as it may be, needs any extra 
> libraries that are not part of the Spark assembly, there is no good way to 
> include them. (I say "no good way" because including them in the 
> SPARK_CLASSPATH environment variable does cause the SparkSubmit process to 
> include them, but this is not acceptable because this environment variable 
> has long been deprecated, and it prevents the use of 
> spark.driver.extraClassPath.)
> An example of when this matters is on Amazon EMR when using an S3 path for 
> the application JAR and running in yarn-cluster mode. The SparkSubmit process 
> needs the EmrFileSystem implementation and its dependencies in the classpath 
> in order to download the application JAR from S3, so it fails with a 
> ClassNotFoundException. (EMR currently gets around this by setting 
> SPARK_CLASSPATH, but as mentioned above this is less than ideal.)
> I have tried modifying SparkSubmitCommandBuilder to include the driver extra 
> classpath whether it's client mode or cluster mode, and this seems to work, 
> but I don't know if there is any downside to this.
> Example that fails on emr-4.0.0 (if you switch to setting 
> spark.(driver,executor).extraClassPath instead of SPARK_CLASSPATH): 
> spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.JavaWordCount s3://my-bucket/spark-examples.jar 
> s3://my-bucket/word-count-input.txt
> Resulting Exception:
> Exception in thread "main" java.lang.RuntimeException: 
> java.lang.ClassNotFoundException: Class 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2626)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2639)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:233)
>   at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:327)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:366)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:364)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:364)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
>   at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)
>   at org.apache.spark.deploy.yarn.Client$.main(Client.scala:966)
>   at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at 

[jira] [Issue Comment Deleted] (SPARK-10789) Cluster mode SparkSubmit classpath only includes Spark assembly

2015-12-28 Thread Jonathan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Kelly updated SPARK-10789:
---
Comment: was deleted

(was: Here's another patch that can be applied to v1.6.0.)

> Cluster mode SparkSubmit classpath only includes Spark assembly
> ---
>
> Key: SPARK-10789
> URL: https://issues.apache.org/jira/browse/SPARK-10789
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Jonathan Kelly
> Attachments: SPARK-10789.diff, SPARK-10789.v1.6.0.diff
>
>
> When using cluster deploy mode, the classpath of the SparkSubmit process that 
> gets launched only includes the Spark assembly and not 
> spark.driver.extraClassPath. This is of course by design, since the driver 
> actually runs on the cluster and not inside the SparkSubmit process.
> However, if the SparkSubmit process, minimal as it may be, needs any extra 
> libraries that are not part of the Spark assembly, there is no good way to 
> include them. (I say "no good way" because including them in the 
> SPARK_CLASSPATH environment variable does cause the SparkSubmit process to 
> include them, but this is not acceptable because this environment variable 
> has long been deprecated, and it prevents the use of 
> spark.driver.extraClassPath.)
> An example of when this matters is on Amazon EMR when using an S3 path for 
> the application JAR and running in yarn-cluster mode. The SparkSubmit process 
> needs the EmrFileSystem implementation and its dependencies in the classpath 
> in order to download the application JAR from S3, so it fails with a 
> ClassNotFoundException. (EMR currently gets around this by setting 
> SPARK_CLASSPATH, but as mentioned above this is less than ideal.)
> I have tried modifying SparkSubmitCommandBuilder to include the driver extra 
> classpath whether it's client mode or cluster mode, and this seems to work, 
> but I don't know if there is any downside to this.
> Example that fails on emr-4.0.0 (if you switch to setting 
> spark.(driver,executor).extraClassPath instead of SPARK_CLASSPATH): 
> spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.JavaWordCount s3://my-bucket/spark-examples.jar 
> s3://my-bucket/word-count-input.txt
> Resulting Exception:
> Exception in thread "main" java.lang.RuntimeException: 
> java.lang.ClassNotFoundException: Class 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2626)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2639)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:233)
>   at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:327)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:366)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:364)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:364)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
>   at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)
>   at org.apache.spark.deploy.yarn.Client$.main(Client.scala:966)
>   at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 

[jira] [Commented] (SPARK-10789) Cluster mode SparkSubmit classpath only includes Spark assembly

2015-12-28 Thread Jonathan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073194#comment-15073194
 ] 

Jonathan Kelly commented on SPARK-10789:


An alternative to this change could be to have another setting that allows you 
to configure the SparkSubmit classpath separately from the driver/executor 
classpaths. That way we wouldn't necessarily need to set the SparkSubmit 
classpath to include *all* of the libraries set in the driver classpath, which 
is the behavior this patch currently causes.

Does anybody in the community have any thoughts/opinions on either of these 
approaches?

> Cluster mode SparkSubmit classpath only includes Spark assembly
> ---
>
> Key: SPARK-10789
> URL: https://issues.apache.org/jira/browse/SPARK-10789
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Jonathan Kelly
> Attachments: SPARK-10789.diff, SPARK-10789.v1.6.0.diff
>
>
> When using cluster deploy mode, the classpath of the SparkSubmit process that 
> gets launched only includes the Spark assembly and not 
> spark.driver.extraClassPath. This is of course by design, since the driver 
> actually runs on the cluster and not inside the SparkSubmit process.
> However, if the SparkSubmit process, minimal as it may be, needs any extra 
> libraries that are not part of the Spark assembly, there is no good way to 
> include them. (I say "no good way" because including them in the 
> SPARK_CLASSPATH environment variable does cause the SparkSubmit process to 
> include them, but this is not acceptable because this environment variable 
> has long been deprecated, and it prevents the use of 
> spark.driver.extraClassPath.)
> An example of when this matters is on Amazon EMR when using an S3 path for 
> the application JAR and running in yarn-cluster mode. The SparkSubmit process 
> needs the EmrFileSystem implementation and its dependencies in the classpath 
> in order to download the application JAR from S3, so it fails with a 
> ClassNotFoundException. (EMR currently gets around this by setting 
> SPARK_CLASSPATH, but as mentioned above this is less than ideal.)
> I have tried modifying SparkSubmitCommandBuilder to include the driver extra 
> classpath whether it's client mode or cluster mode, and this seems to work, 
> but I don't know if there is any downside to this.
> Example that fails on emr-4.0.0 (if you switch to setting 
> spark.(driver,executor).extraClassPath instead of SPARK_CLASSPATH): 
> spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.JavaWordCount s3://my-bucket/spark-examples.jar 
> s3://my-bucket/word-count-input.txt
> Resulting Exception:
> Exception in thread "main" java.lang.RuntimeException: 
> java.lang.ClassNotFoundException: Class 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2626)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2639)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:233)
>   at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:327)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:366)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:364)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:364)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
>   at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)
>   at org.apache.spark.deploy.yarn.Client$.main(Client.scala:966)
>   at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> 

[jira] [Updated] (SPARK-12494) Array out of bound Exception in KMeans Yarn Mode

2015-12-28 Thread Anandraj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anandraj updated SPARK-12494:
-
Attachment: vectors1.tar.gz

Sample data for reproducing the K-Means error in yarn cluster mode. Program 
works in local mode but fails in yarn cluster mode. 

> Array out of bound Exception in KMeans Yarn Mode
> 
>
> Key: SPARK-12494
> URL: https://issues.apache.org/jira/browse/SPARK-12494
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Anandraj
> Attachments: vectors1.tar.gz
>
>
> Hi,
> I am try to run k-means clustering on the word2vec data. I tested the code in 
> local mode with small data. Clustering completes fine. But, when I run with 
> same data on Yarn Cluster mode, it fails below error. 
> 15/12/23 00:49:01 ERROR yarn.ApplicationMaster: User class threw exception: 
> java.lang.ArrayIndexOutOfBoundsException: 0
> java.lang.ArrayIndexOutOfBoundsException: 0
>   at 
> scala.collection.mutable.WrappedArray$ofRef.apply(WrappedArray.scala:126)
>   at 
> org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:377)
>   at 
> org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:377)
>   at scala.Array$.tabulate(Array.scala:331)
>   at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:377)
>   at 
> org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:249)
>   at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:213)
>   at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:520)
>   at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:531)
>   at 
> com.tempurer.intelligence.adhocjobs.spark.kMeans$delayedInit$body.apply(kMeans.scala:24)
>   at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
>   at 
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
>   at scala.App$class.main(App.scala:71)
>   at 
> com.tempurer.intelligence.adhocjobs.spark.kMeans$.main(kMeans.scala:9)
>   at com.tempurer.intelligence.adhocjobs.spark.kMeans.main(kMeans.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:525)
> 15/12/23 00:49:01 INFO yarn.ApplicationMaster: Final app status: FAILED, 
> exitCode: 15, (reason: User class threw exception: 
> java.lang.ArrayIndexOutOfBoundsException: 0)
> In Local mode with large data(2375849 vectors of size 200) , the first 
> sampling stage completes. Second stage suspends execution without any error 
> message. No Active execution in progress. I could only see the below warning 
> message
> 15/12/23 01:24:13 INFO TaskSetManager: Finished task 9.0 in stage 1.0 (TID 
> 37) in 29 ms on localhost (4/34)
> 15/12/23 01:24:14 WARN SparkContext: Requesting executors is only supported 
> in coarse-grained mode
> 15/12/23 01:24:14 WARN ExecutorAllocationManager: Unable to reach the cluster 
> manager to request 2 total executors!
> 15/12/23 01:24:15 WARN SparkContext: Requesting executors is only supported 
> in coarse-grained mode
> 15/12/23 01:24:15 WARN ExecutorAllocationManager: Unable to reach the cluster 
> manager to request 3 total executors!
> 15/12/23 01:24:16 WARN SparkContext: Requesting executors is only supported 
> in coarse-grained mode
> 15/12/23 01:24:16 WARN ExecutorAllocationManager: Unable to reach the cluster 
> manager to request 4 total executors!
> 15/12/23 01:24:17 WARN SparkContext: Requesting executors is only supported 
> in coarse-grained mode
> 15/12/23 01:24:17 WARN ExecutorAllocationManager: Unable to reach the cluster 
> manager to request 5 total executors!
> 15/12/23 01:24:18 WARN SparkContext: Requesting executors is only supported 
> in coarse-grained mode
> 15/12/23 01:24:18 WARN ExecutorAllocationManager: Unable to reach the cluster 
> manager to request 6 total executors!
> 15/12/23 01:24:19 WARN SparkContext: Requesting executors is only supported 
> in coarse-grained mode
> 15/12/23 01:24:19 WARN ExecutorAllocationManager: Unable to reach the cluster 
> manager to request 7 total executors!
> 15/12/23 01:24:20 WARN SparkContext: Requesting executors is 

[jira] [Comment Edited] (SPARK-12494) Array out of bound Exception in KMeans Yarn Mode

2015-12-28 Thread Anandraj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073246#comment-15073246
 ] 

Anandraj edited comment on SPARK-12494 at 12/28/15 11:11 PM:
-

I couldn't reply over the Christmas break. Please find the sample data 
attached. 

vectors1.tar.gz -> Sample data for reproducing the K-Means error in yarn 
cluster mode. Program works in local mode but fails in yarn cluster mode. 


was (Author: anandr...@gmail.com):
Sample data for reproducing the K-Means error in yarn cluster mode. Program 
works in local mode but fails in yarn cluster mode. 

> Array out of bound Exception in KMeans Yarn Mode
> 
>
> Key: SPARK-12494
> URL: https://issues.apache.org/jira/browse/SPARK-12494
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Anandraj
> Attachments: vectors1.tar.gz
>
>
> Hi,
> I am try to run k-means clustering on the word2vec data. I tested the code in 
> local mode with small data. Clustering completes fine. But, when I run with 
> same data on Yarn Cluster mode, it fails below error. 
> 15/12/23 00:49:01 ERROR yarn.ApplicationMaster: User class threw exception: 
> java.lang.ArrayIndexOutOfBoundsException: 0
> java.lang.ArrayIndexOutOfBoundsException: 0
>   at 
> scala.collection.mutable.WrappedArray$ofRef.apply(WrappedArray.scala:126)
>   at 
> org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:377)
>   at 
> org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:377)
>   at scala.Array$.tabulate(Array.scala:331)
>   at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:377)
>   at 
> org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:249)
>   at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:213)
>   at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:520)
>   at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:531)
>   at 
> com.tempurer.intelligence.adhocjobs.spark.kMeans$delayedInit$body.apply(kMeans.scala:24)
>   at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
>   at 
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
>   at scala.App$class.main(App.scala:71)
>   at 
> com.tempurer.intelligence.adhocjobs.spark.kMeans$.main(kMeans.scala:9)
>   at com.tempurer.intelligence.adhocjobs.spark.kMeans.main(kMeans.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:525)
> 15/12/23 00:49:01 INFO yarn.ApplicationMaster: Final app status: FAILED, 
> exitCode: 15, (reason: User class threw exception: 
> java.lang.ArrayIndexOutOfBoundsException: 0)
> In Local mode with large data(2375849 vectors of size 200) , the first 
> sampling stage completes. Second stage suspends execution without any error 
> message. No Active execution in progress. I could only see the below warning 
> message
> 15/12/23 01:24:13 INFO TaskSetManager: Finished task 9.0 in stage 1.0 (TID 
> 37) in 29 ms on localhost (4/34)
> 15/12/23 01:24:14 WARN SparkContext: Requesting executors is only supported 
> in coarse-grained mode
> 15/12/23 01:24:14 WARN ExecutorAllocationManager: Unable to reach the cluster 
> manager to request 2 total executors!
> 15/12/23 01:24:15 WARN SparkContext: Requesting executors is only supported 
> in coarse-grained mode
> 15/12/23 01:24:15 WARN ExecutorAllocationManager: Unable to reach the cluster 
> manager to request 3 total executors!
> 15/12/23 01:24:16 WARN SparkContext: Requesting executors is only supported 
> in coarse-grained mode
> 15/12/23 01:24:16 WARN ExecutorAllocationManager: Unable to reach the cluster 
> manager to request 4 total executors!
> 15/12/23 01:24:17 WARN SparkContext: Requesting executors is only supported 
> in coarse-grained mode
> 15/12/23 01:24:17 WARN ExecutorAllocationManager: Unable to reach the cluster 
> manager to request 5 total executors!
> 15/12/23 01:24:18 WARN SparkContext: Requesting executors is only supported 
> in coarse-grained mode
> 15/12/23 01:24:18 WARN ExecutorAllocationManager: Unable to reach the cluster 

[jira] [Assigned] (SPARK-12525) Fix compiler warnings in Kinesis ASL module due to @transient annotations

2015-12-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-12525:
--

Assignee: Josh Rosen  (was: Apache Spark)

> Fix compiler warnings in Kinesis ASL module due to @transient annotations
> -
>
> Key: SPARK-12525
> URL: https://issues.apache.org/jira/browse/SPARK-12525
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> The Scala 2.11 SBT build currently fails for Spark 1.6.0 and master due to 
> warnings about the {{@transient}} annotation:
> {code}
> [error] [warn] 
> /Users/joshrosen/Documents/spark/extras/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala:73:
>  no valid targets for annotation on value sc - it is discarded unused. You 
> may specify targets with meta-annotations, e.g. @(transient @param)
> [error] [warn] @transient sc: SparkContext,
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12525) Fix compiler warnings in Kinesis ASL module due to @transient annotations

2015-12-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-12525.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10479
[https://github.com/apache/spark/pull/10479]

> Fix compiler warnings in Kinesis ASL module due to @transient annotations
> -
>
> Key: SPARK-12525
> URL: https://issues.apache.org/jira/browse/SPARK-12525
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> The Scala 2.11 SBT build currently fails for Spark 1.6.0 and master due to 
> warnings about the {{@transient}} annotation:
> {code}
> [error] [warn] 
> /Users/joshrosen/Documents/spark/extras/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala:73:
>  no valid targets for annotation on value sc - it is discarded unused. You 
> may specify targets with meta-annotations, e.g. @(transient @param)
> [error] [warn] @transient sc: SparkContext,
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs

2015-12-28 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073236#comment-15073236
 ] 

Ilya Ganelin commented on SPARK-12488:
--

I'll submit a dataset that causes this when I have a moment. Thanks!



Thank you,
Ilya Ganelin





> LDA describeTopics() Generates Invalid Term IDs
> ---
>
> Key: SPARK-12488
> URL: https://issues.apache.org/jira/browse/SPARK-12488
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Ilya Ganelin
>
> When running the LDA model, and using the describeTopics function, invalid 
> values appear in the termID list that is returned:
> The below example generates 10 topics on a data set with a vocabulary of 685.
> {code}
> // Set LDA parameters
> val numTopics = 10
> val lda = new LDA().setK(numTopics).setMaxIterations(10)
> val ldaModel = lda.run(docTermVector)
> val distModel = 
> ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted.reverse
> res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 
> 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 
> 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 
> 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 
> 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 
> 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 
> 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 
> 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 
> 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 
> 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 
> 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53...
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted
> res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, 
> -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, 
> -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, 
> -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, 
> -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, 
> -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, 
> -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, 
> -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, 
> -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
> 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 
> 38, 39, 40, 41, 42, 43, 44, 45, 4...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12489) Fix minor issues found by Findbugs

2015-12-28 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-12489.
--
Resolution: Fixed
  Assignee: Shixiong Zhu

> Fix minor issues found by Findbugs
> --
>
> Key: SPARK-12489
> URL: https://issues.apache.org/jira/browse/SPARK-12489
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Core, SQL
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 1.6.1, 2.0.0
>
>
> Just used FindBugs to scan the codes and fixed some real issues:
> 1. Close `java.sql.Statement`
> 2. Fix incorrect `asInstanceOf`.
> 3. Remove unnecessary `synchronized` and `ReentrantLock`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12489) Fix minor issues found by Findbugs

2015-12-28 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-12489:
-
Affects Version/s: 1.6.0

> Fix minor issues found by Findbugs
> --
>
> Key: SPARK-12489
> URL: https://issues.apache.org/jira/browse/SPARK-12489
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Core, SQL
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Priority: Minor
> Fix For: 1.6.1, 2.0.0
>
>
> Just used FindBugs to scan the codes and fixed some real issues:
> 1. Close `java.sql.Statement`
> 2. Fix incorrect `asInstanceOf`.
> 3. Remove unnecessary `synchronized` and `ReentrantLock`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12489) Fix minor issues found by Findbugs

2015-12-28 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-12489:
-
Fix Version/s: 2.0.0
   1.6.1

> Fix minor issues found by Findbugs
> --
>
> Key: SPARK-12489
> URL: https://issues.apache.org/jira/browse/SPARK-12489
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Core, SQL
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Priority: Minor
> Fix For: 1.6.1, 2.0.0
>
>
> Just used FindBugs to scan the codes and fixed some real issues:
> 1. Close `java.sql.Statement`
> 2. Fix incorrect `asInstanceOf`.
> 3. Remove unnecessary `synchronized` and `ReentrantLock`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12491) UDAF result differs in SQL if alias is used

2015-12-28 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-12491:
--
Attachment: UDAF_GM.zip

Code for the attached .jar.

> UDAF result differs in SQL if alias is used
> ---
>
> Key: SPARK-12491
> URL: https://issues.apache.org/jira/browse/SPARK-12491
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Tristan
> Attachments: UDAF_GM.zip
>
>
> Using the GeometricMean UDAF example 
> (https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html),
>  I found the following discrepancy in results:
> {code}
> scala> sqlContext.sql("select group_id, gm(id) from simple group by 
> group_id").show()
> ++---+
> |group_id|_c1|
> ++---+
> |   0|0.0|
> |   1|0.0|
> |   2|0.0|
> ++---+
> scala> sqlContext.sql("select group_id, gm(id) as GeometricMean from simple 
> group by group_id").show()
> ++-+
> |group_id|GeometricMean|
> ++-+
> |   0|8.981385496571725|
> |   1|7.301716979342118|
> |   2|7.706253151292568|
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2015-12-28 Thread Mario Briggs (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073035#comment-15073035
 ] 

Mario Briggs commented on SPARK-12177:
--

On the issue of lots of duplicate code from the original, 1 thought was whether 
we need to upgrade the older Receiver based approach to the new Consumer API? 
The Direct approach has so many benefits over the older Receiver based approach 
and i can't think of a drawback, that one might make the argument that we dont 
upgrade the latter at all, it remains on the older kafka consumer API and get 
deprecated over a long period time. Thoughts ?

If we do go the above way, then there is very trivial overlap of code between 
original and this new consumer implementation. The public API signatures are 
different and do not clash and hence can be added to the existing KafkaUtils 
class.

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties

2015-12-28 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073036#comment-15073036
 ] 

Peng Cheng commented on SPARK-10625:


I think the pull request has been merged. Can it be closed now?

> Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds 
> unserializable objects into connection properties
> --
>
> Key: SPARK-10625
> URL: https://issues.apache.org/jira/browse/SPARK-10625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
> Environment: Ubuntu 14.04
>Reporter: Peng Cheng
>  Labels: jdbc, spark, sparksql
>
> Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by 
> adding new objects into the connection properties, which is then reused by 
> Spark to be deployed to workers. When some of these new objects are unable to 
> be serializable it will trigger an org.apache.spark.SparkException: Task not 
> serializable. The following test code snippet demonstrate this problem by 
> using a modified H2 driver:
>   test("INSERT to JDBC Datasource with UnserializableH2Driver") {
> object UnserializableH2Driver extends org.h2.Driver {
>   override def connect(url: String, info: Properties): Connection = {
> val result = super.connect(url, info)
> info.put("unserializableDriver", this)
> result
>   }
>   override def getParentLogger: Logger = ???
> }
> import scala.collection.JavaConversions._
> val oldDrivers = 
> DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq
> oldDrivers.foreach{
>   DriverManager.deregisterDriver
> }
> DriverManager.registerDriver(UnserializableH2Driver)
> sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE")
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count)
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", 
> properties).collect()(0).length)
> DriverManager.deregisterDriver(UnserializableH2Driver)
> oldDrivers.foreach{
>   DriverManager.registerDriver
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties

2015-12-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073046#comment-15073046
 ] 

Sean Owen commented on SPARK-10625:
---

It has not been merged. https://github.com/apache/spark/pull/8785

> Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds 
> unserializable objects into connection properties
> --
>
> Key: SPARK-10625
> URL: https://issues.apache.org/jira/browse/SPARK-10625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
> Environment: Ubuntu 14.04
>Reporter: Peng Cheng
>  Labels: jdbc, spark, sparksql
>
> Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by 
> adding new objects into the connection properties, which is then reused by 
> Spark to be deployed to workers. When some of these new objects are unable to 
> be serializable it will trigger an org.apache.spark.SparkException: Task not 
> serializable. The following test code snippet demonstrate this problem by 
> using a modified H2 driver:
>   test("INSERT to JDBC Datasource with UnserializableH2Driver") {
> object UnserializableH2Driver extends org.h2.Driver {
>   override def connect(url: String, info: Properties): Connection = {
> val result = super.connect(url, info)
> info.put("unserializableDriver", this)
> result
>   }
>   override def getParentLogger: Logger = ???
> }
> import scala.collection.JavaConversions._
> val oldDrivers = 
> DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq
> oldDrivers.foreach{
>   DriverManager.deregisterDriver
> }
> DriverManager.registerDriver(UnserializableH2Driver)
> sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE")
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count)
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", 
> properties).collect()(0).length)
> DriverManager.deregisterDriver(UnserializableH2Driver)
> oldDrivers.foreach{
>   DriverManager.registerDriver
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12527) Add private val after @transient for kinesis-asl module

2015-12-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-12527:
--

Assignee: Josh Rosen  (was: Apache Spark)

> Add private val after @transient for kinesis-asl module
> ---
>
> Key: SPARK-12527
> URL: https://issues.apache.org/jira/browse/SPARK-12527
> Project: Spark
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Josh Rosen
>
> In SBT build using Scala 2.11, the following warnings were reported which 
> resulted in build failure 
> (https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/SPARK-branch-1.6-COMPILE-SBT-SCALA-2.11/3/consoleFull):
> {code}
> [error] [warn] 
> /dev/shm/spark-workspaces/8/extras/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisInputDStream.scala:33:
>  no valid targets for annotation on  value _ssc - it is discarded unused. You 
> may specify targets with meta-annotations, e.g. @(transient @param)
> [error] [warn] @transient _ssc: StreamingContext,
> [error] [warn]
> [error] [warn] 
> /dev/shm/spark-workspaces/8/extras/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala:73:
>  no valid targets for annotation   on value sc - it is discarded unused. You 
> may specify targets with meta-annotations, e.g. @(transient @param)
> [error] [warn] @transient sc: SparkContext,
> [error] [warn]
> [error] [warn] 
> /dev/shm/spark-workspaces/8/extras/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala:76:
>  no valid targets for annotation   on value blockIds - it is discarded 
> unused. You may specify targets with meta-annotations, e.g. @(transient 
> @param)
> [error] [warn] @transient blockIds: Array[BlockId],
> [error] [warn]
> [error] [warn] 
> /dev/shm/spark-workspaces/8/extras/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala:78:
>  no valid targets for annotation   on value isBlockIdValid - it is discarded 
> unused. You may specify targets with meta-annotations, e.g. @(transient 
> @param)
> [error] [warn] @transient isBlockIdValid: Array[Boolean] = Array.empty,
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11460) Locality waits should be based on task set creation time, not last launch time

2015-12-28 Thread Casimir IV Jagiellon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073091#comment-15073091
 ] 

Casimir IV Jagiellon commented on SPARK-11460:
--

Sorry, I accidentally put 11460 in my PR commit message.  Mine is 12460. I 
fixed the problem.  Sorry for the confusion. 

> Locality waits should be based on task set creation time, not last launch time
> --
>
> Key: SPARK-11460
> URL: https://issues.apache.org/jira/browse/SPARK-11460
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 
> 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1
> Environment: YARN
>Reporter: Shengyue Ji
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Spark waits for spark.locality.waits period before going from RACK_LOCAL to 
> ANY when selecting an executor for assignment. The timeout was essentially 
> reset each time a new assignment is made.
> We were running Spark streaming on Kafka with a 10 second batch window on 32 
> Kafka partitions with 16 executors. All executors were in the ANY group. At 
> one point one RACK_LOCAL executor was added and all tasks were assigned to 
> it. Each task took about 0.6 second to process, resetting the 
> spark.locality.wait timeout (3000ms) repeatedly. This caused the whole 
> process to under utilize resources and created an increasing backlog.
> spark.locality.wait should be based on the task set creation time, not last 
> launch time so that after 3000ms of initial creation, all executors can get 
> tasks assigned to them.
> We are specifying a zero timeout for now as a workaround to disable locality 
> optimization. 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L556



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12231) Failed to generate predicate Error when using dropna

2015-12-28 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12231.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10388
[https://github.com/apache/spark/pull/10388]

> Failed to generate predicate Error when using dropna
> 
>
> Key: SPARK-12231
> URL: https://issues.apache.org/jira/browse/SPARK-12231
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2, 1.6.0
> Environment: python version: 2.7.9
> os: ubuntu 14.04
>Reporter: yahsuan, chang
> Fix For: 2.0.0
>
>
> code to reproduce error
> # write.py
> {code}
> import pyspark
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> df = sqlc.range(10)
> df1 = df.withColumn('a', df['id'] * 2)
> df1.write.partitionBy('id').parquet('./data')
> {code}
> # read.py
> {code}
> import pyspark
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> df2 = sqlc.read.parquet('./data')
> df2.dropna().count()
> {code}
> $ spark-submit write.py
> $ spark-submit read.py
> # error message
> {code}
> 15/12/08 17:20:34 ERROR Filter: Failed to generate predicate, fallback to 
> interpreted org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: a#0L
> ...
> {code}
> If write data without partitionBy, the error won't happen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12231) Failed to generate predicate Error when using dropna

2015-12-28 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12231:
-
Assignee: kevin yu

> Failed to generate predicate Error when using dropna
> 
>
> Key: SPARK-12231
> URL: https://issues.apache.org/jira/browse/SPARK-12231
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2, 1.6.0
> Environment: python version: 2.7.9
> os: ubuntu 14.04
>Reporter: yahsuan, chang
>Assignee: kevin yu
> Fix For: 2.0.0
>
>
> code to reproduce error
> # write.py
> {code}
> import pyspark
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> df = sqlc.range(10)
> df1 = df.withColumn('a', df['id'] * 2)
> df1.write.partitionBy('id').parquet('./data')
> {code}
> # read.py
> {code}
> import pyspark
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> df2 = sqlc.read.parquet('./data')
> df2.dropna().count()
> {code}
> $ spark-submit write.py
> $ spark-submit read.py
> # error message
> {code}
> 15/12/08 17:20:34 ERROR Filter: Failed to generate predicate, fallback to 
> interpreted org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: a#0L
> ...
> {code}
> If write data without partitionBy, the error won't happen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12517) No default RDD name for ones created by sc.textFile

2015-12-28 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-12517:
---
Target Version/s: 1.6.1, 2.0.0  (was: 1.5.1, 1.5.2)

> No default RDD name for ones created by sc.textFile 
> 
>
> Key: SPARK-12517
> URL: https://issues.apache.org/jira/browse/SPARK-12517
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.2
>Reporter: yaron weinsberg
>Priority: Minor
>  Labels: easyfix
> Fix For: 1.6.1, 2.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Having a default name for an RDD created from a file is very handy. 
> The feature was first added at commit: 7b877b2 but was later removed 
> (probably by mistake) at commit: fc8b581. 
> This change sets the default path of RDDs created via sc.textFile(...) to the 
> path argument.
> Here is the symptom:
> Using spark-1.5.2-bin-hadoop2.6:
> scala> sc.textFile("/home/root/.bashrc").name
> res5: String = null
> scala> sc.binaryFiles("/home/root/.bashrc").name
> res6: String = /home/root/.bashrc
> while using Spark 1.3.1:
> scala> sc.textFile("/home/root/.bashrc").name
> res0: String = /home/root/.bashrc
> scala> sc.binaryFiles("/home/root/.bashrc").name
> res1: String = /home/root/.bashrc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5885) Add VectorAssembler

2015-12-28 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073123#comment-15073123
 ] 

Joseph K. Bradley commented on SPARK-5885:
--

I'd currently recommend using DataFrame operations to handle null values.   To 
clarify, is your case that some of your Vectors are null (no vector at all), or 
that some elements within a Vector are null?  Also, what is the ideal behavior 
for VectorAssembler for your use case?

> Add VectorAssembler
> ---
>
> Key: SPARK-5885
> URL: https://issues.apache.org/jira/browse/SPARK-5885
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.4.0
>
>
> `VectorAssembler` takes a list of columns (of type double/int/vector) and 
> merge them into a single vector column.
> {code}
> val va = new VectorAssembler()
>   .setInputCols("userFeatures", "dayOfWeek", "timeOfDay")
>   .setOutputCol("features")
> {code}
> In the first version, it should be okay if it doesn't handle ML attributes 
> (SPARK-4588).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >