from:"Egor Pahomov \(JIRA\)"

[jira] [Commented] (SPARK-20004) Spark thrift server ovewrites spark.app.name

2017-03-19 Thread Egor Pahomov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931800#comment-15931800
 ] 

Egor Pahomov commented on SPARK-20004:
--

[~srowen], 1.6 and 2.0 allowed to specify app.name. I have several 
THriftServers running and I need to distinguish between them in Yarn UI

> Spark thrift server ovewrites spark.app.name
> 
>
> Key: SPARK-20004
> URL: https://issues.apache.org/jira/browse/SPARK-20004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Egor Pahomov
>Priority: Minor
>
> {code}
> export SPARK_YARN_APP_NAME="ODBC server $host"
> /spark/sbin/start-thriftserver.sh --conf spark.yarn.queue=spark.client.$host 
> --conf spark.app.name="ODBC server $host"
> {code}
> And spark-defaults.conf contains: 
> {code}
> spark.app.name "ODBC server spark01"
> {code}
> Still name in yarn is "Thrift JDBC/ODBC Server"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20005) There is no "Newline" in UI in describtion

2017-03-17 Thread Egor Pahomov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egor Pahomov updated SPARK-20005:
-
Description: There is no "newline" in UI in describtion: 
https://ibb.co/bLp2yv  (was: Just see the attachment)

> There is no "Newline" in UI in describtion 
> ---
>
> Key: SPARK-20005
> URL: https://issues.apache.org/jira/browse/SPARK-20005
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Egor Pahomov
>Priority: Minor
>
> There is no "newline" in UI in describtion: https://ibb.co/bLp2yv



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20005) There is no "Newline" in UI in describtion

2017-03-17 Thread Egor Pahomov (JIRA)

Egor Pahomov created SPARK-20005:


 Summary: There is no "Newline" in UI in describtion 
 Key: SPARK-20005
 URL: https://issues.apache.org/jira/browse/SPARK-20005
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.1.0
Reporter: Egor Pahomov
Priority: Minor


Just see the attachment



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20004) Spark thrift server ovewrites spark.app.name

2017-03-17 Thread Egor Pahomov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egor Pahomov updated SPARK-20004:
-
Summary: Spark thrift server ovewrites spark.app.name  (was: Spark thrift 
server ovverides spark.app.name)

> Spark thrift server ovewrites spark.app.name
> 
>
> Key: SPARK-20004
> URL: https://issues.apache.org/jira/browse/SPARK-20004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Egor Pahomov
>Priority: Minor
>
> {code}
> export SPARK_YARN_APP_NAME="ODBC server $host"
> /spark/sbin/start-thriftserver.sh --conf spark.yarn.queue=spark.client.$host 
> --conf spark.app.name="ODBC server $host"
> {code}
> And spark-defaults.conf contains: 
> {code}
> spark.app.name "ODBC server spark01"
> {code}
> Still name in yarn is "Thrift JDBC/ODBC Server"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20004) Spark thrift server ovverides spark.app.name

2017-03-17 Thread Egor Pahomov (JIRA)

Egor Pahomov created SPARK-20004:


 Summary: Spark thrift server ovverides spark.app.name
 Key: SPARK-20004
 URL: https://issues.apache.org/jira/browse/SPARK-20004
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Egor Pahomov
Priority: Minor


{code}
export SPARK_YARN_APP_NAME="ODBC server $host"

/spark/sbin/start-thriftserver.sh --conf spark.yarn.queue=spark.client.$host 
--conf spark.app.name="ODBC server $host"
{code}

And spark-defaults.conf contains: 
{code}
spark.app.name "ODBC server spark01"
{code}

Still name in yarn is "Thrift JDBC/ODBC Server"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-10 Thread Egor Pahomov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861869#comment-15861869
 ] 

Egor Pahomov commented on SPARK-19524:
--

[~sowen], probably yes. I don't know. "Should process only new files and ignore 
existing files in the directory" if you really think about it, than I agree 
than setting this field to false does not mean to process old files. IMHO, 
everything around this field seems to be poorly documented or architectured. 
Since there is no documentation about spark.streaming.minRememberDuration in 
http://spark.apache.org/docs/2.0.2/configuration.html#spark-streaming I do not 
feel very comfortable changing it. More than that, it would be strange to 
change it to process old files, when purpose of this field very different. And 
nevertheless I was given an API with newFilesOnly, about which I made false 
assumption, but not totally unreasonable, based on all accessible 
documentation. I was wrong, but it still feels like a trap, I walked into, 
which can easily not be there. 

> newFilesOnly does not work according to docs. 
> --
>
> Key: SPARK-19524
> URL: https://issues.apache.org/jira/browse/SPARK-19524
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
>  says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
>  not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-10 Thread Egor Pahomov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861600#comment-15861600
 ] 

Egor Pahomov commented on SPARK-19524:
--

[~sowen], 
Folder, which connect my streaming to:
{code}
[egor@hadoop2 test]$ date
Fri Feb 10 09:51:16 PST 2017
[egor@hadoop2 test]$ ls -al
total 445746
drwxr-xr-x 13 egor egor  4096 Feb  8 14:27 .
drwxr-xr-x 43 egor egor  4096 Feb  9 01:38 ..
-rw-r--r--  1 root jobexecutors241661 Dec  1 18:03 
clog.1480636981858.fl.log.gz
-rw-r--r--  1 egor egor387024 Feb  1 17:26 
clog.1485986399693.fl.log.gz
-rw-r--r--  1 egor egor 128983477 Feb  8 12:43 
clog.2017-01-03.1483431170180.9861.log.gz
-rw-r--r--  1 root jobexecutors  67422481 Dec  1 00:01 
clog.new-1.1480579205495.fl.log.gz
-rw-r--r--  1 egor egor287279 Feb  8 13:21 data2.log.gz
-rw-r--r--  1 egor egor 128983477 Feb  8 14:10 data300.log.gz
-rw-r--r--  1 egor egor 128983477 Feb  8 14:20 data365.log.gz
-rw-r--r--  1 egor egor287279 Feb  8 13:23 data3.log.gz
-rw-r--r--  1 egor egor287279 Feb  8 13:45 data4.log.gz
-rwxrwxr-x  1 egor egor287279 Feb  8 14:04 data5.log.gz
-rwxrwxr-x  1 egor egor287279 Feb  8 14:08 data6.log.gz
{code}
They way I connect: 
{code}
def f(path:Path): Boolean = {
  !path.getName.contains("tmp")
}

 val client_log_d_stream = ssc.fileStream[LongWritable, Text, 
TextInputFormat](input_folder, f _ , newFilesOnly = false)
{code}

Nothing is processed. Than I add file to directory and it processes it. But not 
the old ones

> newFilesOnly does not work according to docs. 
> --
>
> Key: SPARK-19524
> URL: https://issues.apache.org/jira/browse/SPARK-19524
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
>  says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
>  not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19523) Spark streaming+ insert into table leaves bunch of trash in table directory

2017-02-09 Thread Egor Pahomov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860025#comment-15860025
 ] 

Egor Pahomov commented on SPARK-19523:
--

Probably my bad. I have {code} rdd.sparkContext {code} inside lambda. I do not 
have {code} rdd.sqlContext {code}. where I can take it? 

> Spark streaming+ insert into table leaves bunch of trash in table directory
> ---
>
> Key: SPARK-19523
> URL: https://issues.apache.org/jira/browse/SPARK-19523
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, SQL
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>Priority: Minor
>
> I have very simple code, which transform coming json files into pq table:
> {code}
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.hadoop.fs.Path
> import org.apache.hadoop.io.{LongWritable, Text}
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
> import org.apache.spark.sql.SaveMode
> object Client_log {
>   def main(args: Array[String]): Unit = {
> val resultCols = new HiveContext(Spark.ssc.sparkContext).sql(s"select * 
> from temp.x_streaming where year=2015 and month=12 and day=1").dtypes
> var columns = resultCols.filter(x => 
> !Commons.stopColumns.contains(x._1)).map({ case (name, types) => {
>   s"""cast (get_json_object(s, '""" + '$' + s""".properties.${name}') as 
> ${Commons.mapType(types)}) as $name"""
> }
> })
> columns ++= List("'streaming' as sourcefrom")
> def f(path:Path): Boolean = {
>   true
> }
> val client_log_d_stream = Spark.ssc.fileStream[LongWritable, Text, 
> TextInputFormat]("/user/egor/test2", f _ , newFilesOnly = false)
> client_log_d_stream.foreachRDD(rdd => {
>   val localHiveContext = new HiveContext(rdd.sparkContext)
>   import localHiveContext.implicits._
>   var input = rdd.map(x => Record(x._2.toString)).toDF()
>   input = input.selectExpr(columns: _*)
>   input =
> SmallOperators.populate(input, resultCols)
>   input
> .write
> .mode(SaveMode.Append)
> .format("parquet")
> .insertInto("temp.x_streaming")
> })
> Spark.ssc.start()
> Spark.ssc.awaitTermination()
>   }
>   case class Record(s: String)
> }
> {code}
> This code generates a lot of trash directories in resalt table like:
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-00_298_7130707897870357017-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-00_309_6225285476054854579-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-06_305_2185311414031328806-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-06_309_6331022557673464922-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-12_334_1333065569942957405-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-12_387_3622176537686712754-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-18_339_1008134657443203932-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-18_421_3284019142681396277-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-24_291_5985064758831763168-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-24_300_6751765745457248879-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-30_314_2987765230093671316-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-30_331_2746678721907502111-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-36_311_1466065813702202959-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-36_317_7079974647544197072-1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-09 Thread Egor Pahomov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860023#comment-15860023
 ] 

Egor Pahomov commented on SPARK-19524:
--

I'm really confused. I expected "new" to be the files created after start of 
streaming job and old ones to be everything else in the folder. If we change 
definition of "new", than I believe everything consistent between each other. 
It's just I'm not sure that this "new" definition is very intuitive. I want to 
process everything in folder - existing and upcoming. I use this flag. And now 
it turns out, that this flag has it's own definition of "new". My be I'm not 
correct to call it a bug, but isn't it all very confusing for person, who does 
not really know who everything works inside? 

> newFilesOnly does not work according to docs. 
> --
>
> Key: SPARK-19524
> URL: https://issues.apache.org/jira/browse/SPARK-19524
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
>  says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
>  not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19523) Spark streaming+ insert into table leaves bunch of trash in table directory

2017-02-08 Thread Egor Pahomov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858913#comment-15858913
 ] 

Egor Pahomov edited comment on SPARK-19523 at 2/9/17 2:52 AM:
--

@zsxwing, I haven't found the way to do that. inside foreachRDD there is no 
build HiveContext as far as I know. And when I build HiveContext outside and 
just use it inside, it ERROR me, that I can't have multiple SparkContext in one 
JVM. WHich is reasonable - it sound like very bad idea to serialise spark 
context


was (Author: epahomov):
@zsxwing, I haven't found the way to that. inside foreachRDD there is no build 
HiveContext as far as I know. And when I build HiveContext outside and just use 
it inside, it ERROR me, that I can't have multiple SparkContext in one JVM. 
WHich is reasonable - it sound like very bad idea to serialise spark context

> Spark streaming+ insert into table leaves bunch of trash in table directory
> ---
>
> Key: SPARK-19523
> URL: https://issues.apache.org/jira/browse/SPARK-19523
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, SQL
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>Priority: Minor
>
> I have very simple code, which transform coming json files into pq table:
> {code}
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.hadoop.fs.Path
> import org.apache.hadoop.io.{LongWritable, Text}
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
> import org.apache.spark.sql.SaveMode
> object Client_log {
>   def main(args: Array[String]): Unit = {
> val resultCols = new HiveContext(Spark.ssc.sparkContext).sql(s"select * 
> from temp.x_streaming where year=2015 and month=12 and day=1").dtypes
> var columns = resultCols.filter(x => 
> !Commons.stopColumns.contains(x._1)).map({ case (name, types) => {
>   s"""cast (get_json_object(s, '""" + '$' + s""".properties.${name}') as 
> ${Commons.mapType(types)}) as $name"""
> }
> })
> columns ++= List("'streaming' as sourcefrom")
> def f(path:Path): Boolean = {
>   true
> }
> val client_log_d_stream = Spark.ssc.fileStream[LongWritable, Text, 
> TextInputFormat]("/user/egor/test2", f _ , newFilesOnly = false)
> client_log_d_stream.foreachRDD(rdd => {
>   val localHiveContext = new HiveContext(rdd.sparkContext)
>   import localHiveContext.implicits._
>   var input = rdd.map(x => Record(x._2.toString)).toDF()
>   input = input.selectExpr(columns: _*)
>   input =
> SmallOperators.populate(input, resultCols)
>   input
> .write
> .mode(SaveMode.Append)
> .format("parquet")
> .insertInto("temp.x_streaming")
> })
> Spark.ssc.start()
> Spark.ssc.awaitTermination()
>   }
>   case class Record(s: String)
> }
> {code}
> This code generates a lot of trash directories in resalt table like:
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-00_298_7130707897870357017-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-00_309_6225285476054854579-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-06_305_2185311414031328806-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-06_309_6331022557673464922-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-12_334_1333065569942957405-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-12_387_3622176537686712754-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-18_339_1008134657443203932-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-18_421_3284019142681396277-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-24_291_5985064758831763168-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-24_300_6751765745457248879-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-30_314_2987765230093671316-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-30_331_2746678721907502111-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-36_311_1466065813702202959-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-36_317_7079974647544197072-1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19523) Spark streaming+ insert into table leaves bunch of trash in table directory

2017-02-08 Thread Egor Pahomov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858913#comment-15858913
 ] 

Egor Pahomov commented on SPARK-19523:
--

@zsxwing, I haven't found the way to that. inside foreachRDD there is no build 
HiveContext as far as I know. And when I build HiveContext outside and just use 
it inside, it ERROR me, that I can't have multiple SparkContext in one JVM. 
WHich is reasonable - it sound like very bad idea to serialise spark context

> Spark streaming+ insert into table leaves bunch of trash in table directory
> ---
>
> Key: SPARK-19523
> URL: https://issues.apache.org/jira/browse/SPARK-19523
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, SQL
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>Priority: Minor
>
> I have very simple code, which transform coming json files into pq table:
> {code}
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.hadoop.fs.Path
> import org.apache.hadoop.io.{LongWritable, Text}
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
> import org.apache.spark.sql.SaveMode
> object Client_log {
>   def main(args: Array[String]): Unit = {
> val resultCols = new HiveContext(Spark.ssc.sparkContext).sql(s"select * 
> from temp.x_streaming where year=2015 and month=12 and day=1").dtypes
> var columns = resultCols.filter(x => 
> !Commons.stopColumns.contains(x._1)).map({ case (name, types) => {
>   s"""cast (get_json_object(s, '""" + '$' + s""".properties.${name}') as 
> ${Commons.mapType(types)}) as $name"""
> }
> })
> columns ++= List("'streaming' as sourcefrom")
> def f(path:Path): Boolean = {
>   true
> }
> val client_log_d_stream = Spark.ssc.fileStream[LongWritable, Text, 
> TextInputFormat]("/user/egor/test2", f _ , newFilesOnly = false)
> client_log_d_stream.foreachRDD(rdd => {
>   val localHiveContext = new HiveContext(rdd.sparkContext)
>   import localHiveContext.implicits._
>   var input = rdd.map(x => Record(x._2.toString)).toDF()
>   input = input.selectExpr(columns: _*)
>   input =
> SmallOperators.populate(input, resultCols)
>   input
> .write
> .mode(SaveMode.Append)
> .format("parquet")
> .insertInto("temp.x_streaming")
> })
> Spark.ssc.start()
> Spark.ssc.awaitTermination()
>   }
>   case class Record(s: String)
> }
> {code}
> This code generates a lot of trash directories in resalt table like:
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-00_298_7130707897870357017-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-00_309_6225285476054854579-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-06_305_2185311414031328806-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-06_309_6331022557673464922-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-12_334_1333065569942957405-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-12_387_3622176537686712754-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-18_339_1008134657443203932-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-18_421_3284019142681396277-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-24_291_5985064758831763168-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-24_300_6751765745457248879-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-30_314_2987765230093671316-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-30_331_2746678721907502111-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-36_311_1466065813702202959-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-36_317_7079974647544197072-1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-08 Thread Egor Pahomov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858912#comment-15858912
 ] 

Egor Pahomov commented on SPARK-19524:
--

[~uncleGen], sorry, I haven't understood. 

> newFilesOnly does not work according to docs. 
> --
>
> Key: SPARK-19524
> URL: https://issues.apache.org/jira/browse/SPARK-19524
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
>  says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
>  not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-08 Thread Egor Pahomov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858910#comment-15858910
 ] 

Egor Pahomov commented on SPARK-19524:
--

[~srowen], based on documentation, which says "newFilesOnly - Should process 
only new files and ignore existing files in the directory", I expect, that 
files which are already exist in folder to which I connect streaming, would be 
processed. It's not true. In reality files, which were created after time X 
would be procesed. Time X:

{code}
private val durationToRemember = slideDuration * numBatchesToRemember

val modTimeIgnoreThreshold = math.max(
initialModTimeIgnoreThreshold,   // initial threshold based on 
newFilesOnly setting
currentTime - durationToRemember.milliseconds  // trailing end of the 
remember window
  )
{code}

First, this code contradicts with documentation as far as I understand it. 
Second, this code contradicts with the name "newFilesOnly" itself. There is 
probably motivation behind this code, but I think the only way to find this 
motivation - git history and go over all tickets. 

Sorry, that I wasn't more specific first time. 


> newFilesOnly does not work according to docs. 
> --
>
> Key: SPARK-19524
> URL: https://issues.apache.org/jira/browse/SPARK-19524
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
>  says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
>  not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-08 Thread Egor Pahomov (JIRA)

Egor Pahomov created SPARK-19524:


 Summary: newFilesOnly does not work according to docs. 
 Key: SPARK-19524
 URL: https://issues.apache.org/jira/browse/SPARK-19524
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.0.2
Reporter: Egor Pahomov


Docs says:

newFilesOnly
Should process only new files and ignore existing files in the directory

It's not working. 

http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
 says, that it shouldn't work as expected. 

https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
 not clear at all in terms, what code tries to do




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19523) Spark streaming+ insert into table leaves bunch of trash in table directory

2017-02-08 Thread Egor Pahomov (JIRA)

Egor Pahomov created SPARK-19523:


 Summary: Spark streaming+ insert into table leaves bunch of trash 
in table directory
 Key: SPARK-19523
 URL: https://issues.apache.org/jira/browse/SPARK-19523
 Project: Spark
  Issue Type: Bug
  Components: SQL, Structured Streaming
Affects Versions: 2.0.2
Reporter: Egor Pahomov


I have very simple code, which transform coming json files into pq table:

{code}
import org.apache.spark.sql.hive.HiveContext
import org.apache.hadoop.fs.Path
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.spark.sql.SaveMode

object Client_log {

  def main(args: Array[String]): Unit = {


val resultCols = new HiveContext(Spark.ssc.sparkContext).sql(s"select * 
from temp.x_streaming where year=2015 and month=12 and day=1").dtypes

var columns = resultCols.filter(x => 
!Commons.stopColumns.contains(x._1)).map({ case (name, types) => {
  s"""cast (get_json_object(s, '""" + '$' + s""".properties.${name}') as 
${Commons.mapType(types)}) as $name"""
}
})

columns ++= List("'streaming' as sourcefrom")

def f(path:Path): Boolean = {
  true
}

val client_log_d_stream = Spark.ssc.fileStream[LongWritable, Text, 
TextInputFormat]("/user/egor/test2", f _ , newFilesOnly = false)
client_log_d_stream.foreachRDD(rdd => {


  val localHiveContext = new HiveContext(rdd.sparkContext)

  import localHiveContext.implicits._

  var input = rdd.map(x => Record(x._2.toString)).toDF()

  input = input.selectExpr(columns: _*)

  input =
SmallOperators.populate(input, resultCols)

  input
.write
.mode(SaveMode.Append)
.format("parquet")
.insertInto("temp.x_streaming")

})

Spark.ssc.start()
Spark.ssc.awaitTermination()

  }

  case class Record(s: String)

}

{code}

This code generates a lot of trash directories in resalt table like:

drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
.hive-staging_hive_2017-02-08_14-15-00_298_7130707897870357017-1
drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
.hive-staging_hive_2017-02-08_14-15-00_309_6225285476054854579-1
drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
.hive-staging_hive_2017-02-08_14-15-06_305_2185311414031328806-1
drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
.hive-staging_hive_2017-02-08_14-15-06_309_6331022557673464922-1
drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
.hive-staging_hive_2017-02-08_14-15-12_334_1333065569942957405-1
drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
.hive-staging_hive_2017-02-08_14-15-12_387_3622176537686712754-1
drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
.hive-staging_hive_2017-02-08_14-15-18_339_1008134657443203932-1
drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
.hive-staging_hive_2017-02-08_14-15-18_421_3284019142681396277-1
drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
.hive-staging_hive_2017-02-08_14-15-24_291_5985064758831763168-1
drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
.hive-staging_hive_2017-02-08_14-15-24_300_6751765745457248879-1
drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
.hive-staging_hive_2017-02-08_14-15-30_314_2987765230093671316-1
drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
.hive-staging_hive_2017-02-08_14-15-30_331_2746678721907502111-1
drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
.hive-staging_hive_2017-02-08_14-15-36_311_1466065813702202959-1
drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
.hive-staging_hive_2017-02-08_14-15-36_317_7079974647544197072-1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18930) Inserting in partitioned table - partitioned field should be last in select statement.

2016-12-29 Thread Egor Pahomov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786368#comment-15786368
 ] 

Egor Pahomov commented on SPARK-18930:
--

I'm not sure, that such restriction buried in documentation is OK. Basically 
the problem - I've created correct schema for table. I've correctly inserted 
into it. But for some reason I need to keep order of columns in select statement

> Inserting in partitioned table - partitioned field should be last in select 
> statement. 
> ---
>
> Key: SPARK-18930
> URL: https://issues.apache.org/jira/browse/SPARK-18930
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> CREATE TABLE temp.test_partitioning_4 (
>   num string
>  ) 
> PARTITIONED BY (
>   day string)
>   stored as parquet
> INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day)
> select day, count(*) as num from 
> hss.session where year=2016 and month=4 
> group by day
> Resulted schema on HDFS: /temp.db/test_partitioning_3/day=62456298, 
> emp.db/test_partitioning_3/day=69094345
> As you can imagine these numbers are num of records. But! When I do select * 
> from  temp.test_partitioning_4 data is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18931) Create empty staging directory in partitioned table on insert

2016-12-19 Thread Egor Pahomov (JIRA)

Egor Pahomov created SPARK-18931:


 Summary: Create empty staging directory in partitioned table on 
insert
 Key: SPARK-18931
 URL: https://issues.apache.org/jira/browse/SPARK-18931
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
Reporter: Egor Pahomov


CREATE TABLE temp.test_partitioning_4 (
  num string
 ) 
PARTITIONED BY (
  day string)
  stored as parquet

On every 

INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day)
select day, count(*) as num from 
hss.session where year=2016 and month=4 
group by day


new directory 
".hive-staging_hive_2016-12-19_15-55-11_298_3412488541559534475-4" created on 
HDFS.  It's big issue, because I insert every day and bunch of empty dirs on 
HDFS is very bad for HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18930) Inserting in partitioned table - partitioned field should be last in select statement.

2016-12-19 Thread Egor Pahomov (JIRA)

Egor Pahomov created SPARK-18930:


 Summary: Inserting in partitioned table - partitioned field should 
be last in select statement. 
 Key: SPARK-18930
 URL: https://issues.apache.org/jira/browse/SPARK-18930
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
Reporter: Egor Pahomov


CREATE TABLE temp.test_partitioning_4 (
  num string
 ) 
PARTITIONED BY (
  day string)
  stored as parquet


INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day)
select day, count(*) as num from 
hss.session where year=2016 and month=4 
group by day

Resulted schema on HDFS: /temp.db/test_partitioning_3/day=62456298, 
emp.db/test_partitioning_3/day=69094345

As you can imagine these numbers are num of records. But! When I do select * 
from  temp.test_partitioning_4 data is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-18336) SQL started to fail with OOM and etc. after move from 1.6.2 to 2.0.2

2016-11-09 Thread Egor Pahomov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egor Pahomov closed SPARK-18336.

Resolution: Invalid

Moving from 1.6 to 2.0 forced me use spark-submit instead of using spark 
assembly jar. I though, that if I define driver memory in spark-defaults.conf 
spark-submit would read it before creating java machine and would create driver 
with configured memory. I was wrong - I needed to specify it though parameter. 
It's counterintuitive, but I was wrong to make such assumption. 

> SQL started to fail with OOM and etc. after move from 1.6.2 to 2.0.2
> 
>
> Key: SPARK-18336
> URL: https://issues.apache.org/jira/browse/SPARK-18336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> I had several(~100) quires, which were run one after another in single spark 
> context. I can provide code of runner - it's very simple. It worked fine on 
> 1.6.2, than I moved to 2551d959a6c9fb27a54d38599a2301d735532c24 (branch-2.0 
> on 31.10.2016 17:04:12). It started to fail with OOM and other errors. When I 
> separate my 100 quires to 2 sets and run set after set it works fine. I would 
> suspect problems with memory on driver, but nothing points to that. 
> My conf: 
> {code}
> lazy val sparkConfTemplate = new SparkConf()
> .setMaster("yarn-client")
> .setAppName(appName)
> .set("spark.executor.memory", "25g")
> .set("spark.executor.instances", "40")
> .set("spark.dynamicAllocation.enabled", "false")
> .set("spark.yarn.executor.memoryOverhead", "3000")
> .set("spark.executor.cores", "6")
> .set("spark.driver.memory", "25g")
> .set("spark.driver.cores", "5")
> .set("spark.yarn.am.memory", "20g")
> .set("spark.shuffle.io.numConnectionsPerPeer", "5")
> .set("spark.sql.autoBroadcastJoinThreshold", "10")
> .set("spark.network.timeout", "4000s")
> .set("spark.driver.maxResultSize", "5g")
> .set("spark.sql.parquet.compression.codec", "gzip")
> .set("spark.kryoserializer.buffer.max", "1200m")
> .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> .set("spark.yarn.driver.memoryOverhead", "1000")
> .set("spark.scheduler.mode", "FIFO")
> .set("spark.sql.broadcastTimeout", "2")
> .set("spark.akka.frameSize", "200")
> .set("spark.sql.shuffle.partitions", partitions)
> .set("spark.network.timeout", "1000s")
> 
> .setJars(List(this.getClass.getProtectionDomain().getCodeSource().getLocation().toURI().getPath()))
> {code}
> Errors, which started to happen:
> {code}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f04c6cf3ea8, pid=17479, tid=139658116687616
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode 
> linux-amd64 compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x64bea8]  
> InstanceKlass::oop_follow_contents(ParCompactionManager*, oopDesc*)+0x88
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /home/egor/hs_err_pid17479.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> {code}
> {code}
> Exception in thread "refresh progress" java.lang.OutOfMemoryError: Java heap 
> space
>   at scala.collection.immutable.Iterable$.newBuilder(Iterable.scala:44)
>   at scala.collection.Iterable$.newBuilder(Iterable.scala:50)
>   at 
> scala.collection.generic.GenericTraversableTemplate$class.genericBuilder(GenericTraversableTemplate.scala:70)
>   at 
> scala.collection.AbstractTraversable.genericBuilder(Traversable.scala:104)
>   at 
> scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57)
>   at 
> scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52)
>   at 
> scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.SparkStatusTracker.getActiveStageIds(SparkStatusTracker.scala:61)
>   at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:66)
>   at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:54)
>   at java.util.TimerThread.mainLoop(Unknown Source)
>   at

[jira] [Commented] (SPARK-18336) SQL started to fail with OOM and etc. after move from 1.6.2 to 2.0.2

2016-11-08 Thread Egor Pahomov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15648912#comment-15648912
 ] 

Egor Pahomov commented on SPARK-18336:
--

[~srowen], I've read everything in documentation about new memory management 
and tryied next config:
{code}
.set("spark.executor.memory", "28g")
.set("spark.executor.instances", "50")
.set("spark.dynamicAllocation.enabled", "false")
.set("spark.yarn.executor.memoryOverhead", "3000")
.set("spark.executor.cores", "6")
.set("spark.driver.memory", "25g")
.set("spark.driver.cores", "5")
.set("spark.yarn.am.memory", "20g")
.set("spark.shuffle.io.numConnectionsPerPeer", "5")
.set("spark.sql.autoBroadcastJoinThreshold", "30485760")
.set("spark.network.timeout", "4000s")
.set("spark.driver.maxResultSize", "5g")
.set("spark.sql.parquet.compression.codec", "gzip")
.set("spark.kryoserializer.buffer.max", "1200m")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.yarn.driver.memoryOverhead", "1000")
.set("spark.memory.storageFraction", "0.2")
.set("spark.memory.fraction", "0.8")
.set("spark.sql.parquet.cacheMetadata", "false")
.set("spark.scheduler.mode", "FIFO")
.set("spark.sql.broadcastTimeout", "2")
.set("spark.akka.frameSize", "200")
.set("spark.sql.shuffle.partitions", partitions)
.set("spark.network.timeout", "1000s")
{code}


Is there a manual - "I'm happy with how it was working before, can I configure 
my job be exactly like old times" ? Can I read somewere what happened to 
Tungsten project, beacause it seemed like big deal and now it turned of by 
default?

> SQL started to fail with OOM and etc. after move from 1.6.2 to 2.0.2
> 
>
> Key: SPARK-18336
> URL: https://issues.apache.org/jira/browse/SPARK-18336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> I had several(~100) quires, which were run one after another in single spark 
> context. I can provide code of runner - it's very simple. It worked fine on 
> 1.6.2, than I moved to 2551d959a6c9fb27a54d38599a2301d735532c24 (branch-2.0 
> on 31.10.2016 17:04:12). It started to fail with OOM and other errors. When I 
> separate my 100 quires to 2 sets and run set after set it works fine. I would 
> suspect problems with memory on driver, but nothing points to that. 
> My conf: 
> {code}
> lazy val sparkConfTemplate = new SparkConf()
> .setMaster("yarn-client")
> .setAppName(appName)
> .set("spark.executor.memory", "25g")
> .set("spark.executor.instances", "40")
> .set("spark.dynamicAllocation.enabled", "false")
> .set("spark.yarn.executor.memoryOverhead", "3000")
> .set("spark.executor.cores", "6")
> .set("spark.driver.memory", "25g")
> .set("spark.driver.cores", "5")
> .set("spark.yarn.am.memory", "20g")
> .set("spark.shuffle.io.numConnectionsPerPeer", "5")
> .set("spark.sql.autoBroadcastJoinThreshold", "10")
> .set("spark.network.timeout", "4000s")
> .set("spark.driver.maxResultSize", "5g")
> .set("spark.sql.parquet.compression.codec", "gzip")
> .set("spark.kryoserializer.buffer.max", "1200m")
> .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> .set("spark.yarn.driver.memoryOverhead", "1000")
> .set("spark.scheduler.mode", "FIFO")
> .set("spark.sql.broadcastTimeout", "2")
> .set("spark.akka.frameSize", "200")
> .set("spark.sql.shuffle.partitions", partitions)
> .set("spark.network.timeout", "1000s")
> 
> .setJars(List(this.getClass.getProtectionDomain().getCodeSource().getLocation().toURI().getPath()))
> {code}
> Errors, which started to happen:
> {code}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f04c6cf3ea8, pid=17479, tid=139658116687616
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode 
> linux-amd64 compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x64bea8]  
> InstanceKlass::oop_follow_contents(ParCompactionManager*, oopDesc*)+0x88
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /home/egor/hs_err_pid17479.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> {code}
> {code}
> Exception in thread "refresh progress" java.lang.OutOfMemoryError: Java heap 
> space
>   at scala.collection.immutable.Iterable$.newBuilder(Iterable.scala:44)
>   at

[jira] [Created] (SPARK-18336) SQL started to fail with OOM and etc. after move from 1.6.2 to 2.0.2

2016-11-07 Thread Egor Pahomov (JIRA)

Egor Pahomov created SPARK-18336:


 Summary: SQL started to fail with OOM and etc. after move from 
1.6.2 to 2.0.2
 Key: SPARK-18336
 URL: https://issues.apache.org/jira/browse/SPARK-18336
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
Reporter: Egor Pahomov


I had several(~100) quires, which were run one after another in single spark 
context. I can provide code of runner - it's very simple. It worked fine on 
1.6.2, than I moved to 2551d959a6c9fb27a54d38599a2301d735532c24 (branch-2.0 on 
31.10.2016 17:04:12). It started to fail with OOM and other errors. When I 
separate my 100 quires to 2 sets and run set after set it works fine. I would 
suspect problems with memory on driver, but nothing points to that. 

My conf: 
{code}
lazy val sparkConfTemplate = new SparkConf()
.setMaster("yarn-client")
.setAppName(appName)
.set("spark.executor.memory", "25g")
.set("spark.executor.instances", "40")
.set("spark.dynamicAllocation.enabled", "false")
.set("spark.yarn.executor.memoryOverhead", "3000")
.set("spark.executor.cores", "6")
.set("spark.driver.memory", "25g")
.set("spark.driver.cores", "5")
.set("spark.yarn.am.memory", "20g")
.set("spark.shuffle.io.numConnectionsPerPeer", "5")
.set("spark.sql.autoBroadcastJoinThreshold", "10")
.set("spark.network.timeout", "4000s")
.set("spark.driver.maxResultSize", "5g")
.set("spark.sql.parquet.compression.codec", "gzip")
.set("spark.kryoserializer.buffer.max", "1200m")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.yarn.driver.memoryOverhead", "1000")
.set("spark.scheduler.mode", "FIFO")
.set("spark.sql.broadcastTimeout", "2")
.set("spark.akka.frameSize", "200")
.set("spark.sql.shuffle.partitions", partitions)
.set("spark.network.timeout", "1000s")

.setJars(List(this.getClass.getProtectionDomain().getCodeSource().getLocation().toURI().getPath()))
{code}

Errors, which started to happen:

{code}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7f04c6cf3ea8, pid=17479, tid=139658116687616
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# V  [libjvm.so+0x64bea8]  
InstanceKlass::oop_follow_contents(ParCompactionManager*, oopDesc*)+0x88
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/egor/hs_err_pid17479.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
{code}

{code}
Exception in thread "refresh progress" java.lang.OutOfMemoryError: Java heap 
space
at scala.collection.immutable.Iterable$.newBuilder(Iterable.scala:44)
at scala.collection.Iterable$.newBuilder(Iterable.scala:50)
at 
scala.collection.generic.GenericTraversableTemplate$class.genericBuilder(GenericTraversableTemplate.scala:70)
at 
scala.collection.AbstractTraversable.genericBuilder(Traversable.scala:104)
at 
scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57)
at 
scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52)
at 
scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.SparkStatusTracker.getActiveStageIds(SparkStatusTracker.scala:61)
at 
org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:66)
at 
org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:54)
at java.util.TimerThread.mainLoop(Unknown Source)
at java.util.TimerThread.run(Unknown Source)
java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedMethodAccessor78.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at 
org.apache.spark.sql.hive.client.Shim_v0_13.getAllPartitions(HiveShim.scala:431)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitions$1.apply(HiveClientImpl.scala:538)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitions$1.apply(HiveClientImpl.scala:535)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:280)

[jira] [Created] (SPARK-17616) Getting "java.lang.RuntimeException: Distinct columns cannot exist in Aggregate "

2016-09-20 Thread Egor Pahomov (JIRA)

Egor Pahomov created SPARK-17616:


 Summary: Getting "java.lang.RuntimeException: Distinct columns 
cannot exist in Aggregate "
 Key: SPARK-17616
 URL: https://issues.apache.org/jira/browse/SPARK-17616
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Egor Pahomov
Priority: Minor


I execute:

{code}
select platform, 
collect_set(user_auth) as paid_types,
count(distinct sessionid) as sessions
from non_hss.session
where
event = 'stop' and platform != 'testplatform' and
not (month = MONTH(current_date()) AND year = YEAR(current_date()) and 
day = day(current_date())) and
(
(month >= MONTH(add_months(CURRENT_DATE(), -5)) AND year = 
YEAR(add_months(CURRENT_DATE(), -5)))
OR
(month <= MONTH(add_months(CURRENT_DATE(), -5)) AND year > 
YEAR(add_months(CURRENT_DATE(), -5)))
)
group by platform
{code}

I get:

{code}
java.lang.RuntimeException: Distinct columns cannot exist in Aggregate operator 
containing aggregate functions which don't support partial aggregation.
{code}

IT WORKED IN 1.6.2. I've read error 5 times, and read code once. I still don't 
understand what I do incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17615) Getting "java.lang.RuntimeException: Distinct columns cannot exist in Aggregate "

2016-09-20 Thread Egor Pahomov (JIRA)

Egor Pahomov created SPARK-17615:


 Summary: Getting "java.lang.RuntimeException: Distinct columns 
cannot exist in Aggregate "
 Key: SPARK-17615
 URL: https://issues.apache.org/jira/browse/SPARK-17615
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Egor Pahomov
Priority: Minor


I execute:

{code}
select platform, 
collect_set(user_auth) as paid_types,
count(distinct sessionid) as sessions
from non_hss.session
where
event = 'stop' and platform != 'testplatform' and
not (month = MONTH(current_date()) AND year = YEAR(current_date()) and 
day = day(current_date())) and
(
(month >= MONTH(add_months(CURRENT_DATE(), -5)) AND year = 
YEAR(add_months(CURRENT_DATE(), -5)))
OR
(month <= MONTH(add_months(CURRENT_DATE(), -5)) AND year > 
YEAR(add_months(CURRENT_DATE(), -5)))
)
group by platform
{code}

I get:

{code}
java.lang.RuntimeException: Distinct columns cannot exist in Aggregate operator 
containing aggregate functions which don't support partial aggregation.
{code}

IT WORKED IN 1.6.2. I've read error 5 times, and read code once. I still don't 
understand what I do incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary

2016-09-15 Thread Egor Pahomov (JIRA)

Egor Pahomov created SPARK-17557:


 Summary: SQL query on parquet table 
java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary
 Key: SPARK-17557
 URL: https://issues.apache.org/jira/browse/SPARK-17557
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Egor Pahomov


Working on 1.6.2, broken on 2.0

{code}
select * from logs.a where year=2016 and month=9 and day=14 limit 100
{code}

{code}
java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17364) Can not query hive table starting with number

2016-09-01 Thread Egor Pahomov (JIRA)

Egor Pahomov created SPARK-17364:


 Summary: Can not query hive table starting with number
 Key: SPARK-17364
 URL: https://issues.apache.org/jira/browse/SPARK-17364
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: Egor Pahomov


I can do it with spark-1.6.2

{code}
SELECT * from  temp.20160826_ip_list limit 100
{code}

{code}
Error: org.apache.spark.sql.catalyst.parser.ParseException: 
extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', 
'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 
'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 
'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 
'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'INTERSECT', 'TO', 
'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 
'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 
'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 
'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 
'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 
'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 
'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 
'FORMATTED', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 
'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 
'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 
'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 
'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 
'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 
'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 
'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 
'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 
'CURRENT_TIMESTAMP', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19)

== SQL ==
SELECT * from  temp.20160826_ip_list limit 100
---^^^

SQLState:  null
ErrorCode: 0
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException

2016-08-22 Thread Egor Pahomov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432040#comment-15432040
 ] 

Egor Pahomov commented on SPARK-16334:
--

(just reason for reopen)

> [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
> -
>
> Key: SPARK-16334
> URL: https://issues.apache.org/jira/browse/SPARK-16334
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Assignee: Sameer Agarwal
>Priority: Critical
>  Labels: sql
> Fix For: 2.0.1, 2.1.0
>
>
> Query:
> {code}
> select * from blabla where user_id = 415706251
> {code}
> Error:
> {code}
> 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 
> (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934
> at 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Work on 1.6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException

2016-08-22 Thread Egor Pahomov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egor Pahomov reopened SPARK-16334:
--

Seems like a lot of people still have a problem even after suggested fix

> [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
> -
>
> Key: SPARK-16334
> URL: https://issues.apache.org/jira/browse/SPARK-16334
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Assignee: Sameer Agarwal
>Priority: Critical
>  Labels: sql
> Fix For: 2.0.1, 2.1.0
>
>
> Query:
> {code}
> select * from blabla where user_id = 415706251
> {code}
> Error:
> {code}
> 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 
> (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934
> at 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Work on 1.6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException

2016-07-19 Thread Egor Pahomov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384736#comment-15384736
 ] 

Egor Pahomov commented on SPARK-16334:
--

[~sameerag] that particular case on which I experienced the problem - now 
working for me. It's hard for me to reproduce what exactly made the bug appear 
before - would not go into such much trouble. 

Thanks

> [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
> -
>
> Key: SPARK-16334
> URL: https://issues.apache.org/jira/browse/SPARK-16334
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Critical
>  Labels: sql
>
> Query:
> {code}
> select * from blabla where user_id = 415706251
> {code}
> Error:
> {code}
> 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 
> (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934
> at 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Work on 1.6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException

2016-07-15 Thread Egor Pahomov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15380249#comment-15380249
 ] 

Egor Pahomov commented on SPARK-16334:
--

Sure, would test on Monday.

> [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
> -
>
> Key: SPARK-16334
> URL: https://issues.apache.org/jira/browse/SPARK-16334
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Critical
>  Labels: sql
>
> Query:
> {code}
> select * from blabla where user_id = 415706251
> {code}
> Error:
> {code}
> 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 
> (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934
> at 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Work on 1.6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16548) java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data

2016-07-14 Thread Egor Pahomov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15378112#comment-15378112
 ] 

Egor Pahomov commented on SPARK-16548:
--

[~sowen] if your concern is "silently" we can wright errors to logs. WDYT?

> java.io.CharConversionException: Invalid UTF-32 character  prevents me from 
> querying my data
> 
>
> Key: SPARK-16548
> URL: https://issues.apache.org/jira/browse/SPARK-16548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Egor Pahomov
>Priority: Minor
>
> Basically, when I query my json data I get 
> {code}
> java.io.CharConversionException: Invalid UTF-32 character 0x7b2265(above 
> 10)  at char #192, byte #771)
>   at 
> com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189)
>   at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1855)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:571)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:142)
> {code}
> I do not like it. If you can not process one json among 100500 please return 
> null, do not fail everything. I have dirty one line fix, and I understand how 
> I can make it more reasonable. What is our position - what behaviour we wanna 
> get?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16548) java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data

2016-07-14 Thread Egor Pahomov (JIRA)

Egor Pahomov created SPARK-16548:


 Summary: java.io.CharConversionException: Invalid UTF-32 character 
 prevents me from querying my data
 Key: SPARK-16548
 URL: https://issues.apache.org/jira/browse/SPARK-16548
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1
Reporter: Egor Pahomov
Priority: Minor


Basically, when I query my json data I get 
{code}
java.io.CharConversionException: Invalid UTF-32 character 0x7b2265(above 
10)  at char #192, byte #771)
at 
com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189)
at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150)
at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153)
at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1855)
at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:571)
at 
org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:142)
{code}

I do not like it. If you can not process one json among 100500 please return 
null, do not fail everything. I have dirty one line fix, and I understand how I 
can make it more reasonable. What is our position - what behaviour we wanna get?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException

2016-06-30 Thread Egor Pahomov (JIRA)

Egor Pahomov created SPARK-16334:


 Summary: [SQL] SQL query on parquet table 
java.lang.ArrayIndexOutOfBoundsException
 Key: SPARK-16334
 URL: https://issues.apache.org/jira/browse/SPARK-16334
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Egor Pahomov
Priority: Critical


Query:

{code}
select * from blabla where user_id = 415706251
{code}

Error:

{code}
16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 
(TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934
at 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}


Work on 1.6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException

2016-06-30 Thread Egor Pahomov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egor Pahomov updated SPARK-16334:
-
Labels: sql  (was: )

> [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
> -
>
> Key: SPARK-16334
> URL: https://issues.apache.org/jira/browse/SPARK-16334
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Critical
>  Labels: sql
>
> Query:
> {code}
> select * from blabla where user_id = 415706251
> {code}
> Error:
> {code}
> 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 
> (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934
> at 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Work on 1.6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16228) "Percentile" needs explicit cast to double

2016-06-27 Thread Egor Pahomov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351697#comment-15351697
 ] 

Egor Pahomov commented on SPARK-16228:
--

[~srowen] "blocker" is questionable, I agree. I just believe, that everything 
which prevent you from moving from 1.6.1 to 2.0 without major code changes is 
"blocker". It's just if I move and such bug would be there a lot of my analysts 
notebook would be invalid. 

> "Percentile" needs explicit cast to double
> --
>
> Key: SPARK-16228
> URL: https://issues.apache.org/jira/browse/SPARK-16228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>
> {quote}
>  select percentile(cast(id as bigint), cast(0.5 as double)) from temp.bla
> {quote}
> Works.
> {quote}
>  select percentile(cast(id as bigint), 0.5 ) from temp.bla
> {quote}
> Throws
> {quote}
> Error in query: No handler for Hive UDF 
> 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': 
> org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method 
> for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (bigint, 
> decimal(38,18)). Possible choices: _FUNC_(bigint, array)  
> _FUNC_(bigint, double)  ; line 1 pos 7
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16228) "Percentile" needs explicit cast to double

2016-06-27 Thread Egor Pahomov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egor Pahomov updated SPARK-16228:
-
Description: 
{quote}
 select percentile(cast(id as bigint), cast(0.5 as double)) from temp.bla
{quote}

Works.

{quote}
 select percentile(cast(id as bigint), 0.5 ) from temp.bla
{quote}

Throws

{quote}
Error in query: No handler for Hive UDF 
'org.apache.hadoop.hive.ql.udf.UDAFPercentile': 
org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method 
for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (bigint, 
decimal(38,18)). Possible choices: _FUNC_(bigint, array)  
_FUNC_(bigint, double)  ; line 1 pos 7
{quote}

  was:

 select percentile(cast(id as bigint), cast(0.5 as double)) from temp.bla


Works.


 select percentile(cast(id as bigint), 0.5 ) from temp.bla


Throws


Error in query: No handler for Hive UDF 
'org.apache.hadoop.hive.ql.udf.UDAFPercentile': 
org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method 
for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (bigint, 
decimal(38,18)). Possible choices: _FUNC_(bigint, array)  
_FUNC_(bigint, double)  ; line 1 pos 7



> "Percentile" needs explicit cast to double
> --
>
> Key: SPARK-16228
> URL: https://issues.apache.org/jira/browse/SPARK-16228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Blocker
>
> {quote}
>  select percentile(cast(id as bigint), cast(0.5 as double)) from temp.bla
> {quote}
> Works.
> {quote}
>  select percentile(cast(id as bigint), 0.5 ) from temp.bla
> {quote}
> Throws
> {quote}
> Error in query: No handler for Hive UDF 
> 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': 
> org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method 
> for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (bigint, 
> decimal(38,18)). Possible choices: _FUNC_(bigint, array)  
> _FUNC_(bigint, double)  ; line 1 pos 7
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16228) "Percentile" needs explicit cast to double

2016-06-27 Thread Egor Pahomov (JIRA)

Egor Pahomov created SPARK-16228:


 Summary: "Percentile" needs explicit cast to double
 Key: SPARK-16228
 URL: https://issues.apache.org/jira/browse/SPARK-16228
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Egor Pahomov
Priority: Blocker



 select percentile(cast(id as bigint), cast(0.5 as double)) from temp.bla


Works.


 select percentile(cast(id as bigint), 0.5 ) from temp.bla


Throws


Error in query: No handler for Hive UDF 
'org.apache.hadoop.hive.ql.udf.UDAFPercentile': 
org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method 
for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (bigint, 
decimal(38,18)). Possible choices: _FUNC_(bigint, array)  
_FUNC_(bigint, double)  ; line 1 pos 7




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15934) Return binary mode in ThriftServer

2016-06-13 Thread Egor Pahomov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328814#comment-15328814
 ] 

Egor Pahomov commented on SPARK-15934:
--

Sure, let me create a pull request tomorrow. I would test, that everything 
working with all tools I mentioned - Tableau, DataGrip, Squirrel. 

> Return binary mode in ThriftServer
> --
>
> Key: SPARK-15934
> URL: https://issues.apache.org/jira/browse/SPARK-15934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>
> In spark-2.0.0 preview binary mode was turned off (SPARK-15095). 
> It was greatly irresponsible step due to the fact, that in 1.6.1 binary mode 
> was default and it turned off in 2.0.0.
> Just to describe magnitude of harm not fixing this bug would do in my 
> organization:
> * Tableau works only though Thrift Server and only with binary format. 
> Tableau would not work with spark-2.0.0 at all!
> * I have bunch of analysts in my organization with configured sql 
> clients(DataGrip and Squirrel). I would need to go one by one to change 
> connection string for them(DataGrip). Squirrel simply do not work with http - 
> some jar hell in my case.
> * let me not mention all other stuff which connects to our data 
> infrastructure through ThriftServer as gateway. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15934) Return binary mode in ThriftServer

2016-06-13 Thread Egor Pahomov (JIRA)

Egor Pahomov created SPARK-15934:


 Summary: Return binary mode in ThriftServer
 Key: SPARK-15934
 URL: https://issues.apache.org/jira/browse/SPARK-15934
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Egor Pahomov


In spark-2.0.0 preview binary mode was turned off (SPARK-15095). 
It was greatly irresponsible step due to the fact, that in 1.6.1 binary mode 
was default and it turned off in 2.0.0.

Just to describe magnitude of harm not fixing this bug would do in my 
organization:

* Tableau works only though Thrift Server and only with binary format. Tableau 
would not work with spark-2.0.0 at all!
* I have bunch of analysts in my organization with configured sql 
clients(DataGrip and Squirrel). I would need to go one by one to change 
connection string for them(DataGrip). Squirrel simply do not work with http - 
some jar hell in my case.
* let me not mention all other stuff which connects to our data infrastructure 
through ThriftServer as gateway. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-15409) Can't run spark odbc server on yarn the way I did it on 1.6.1

2016-05-19 Thread Egor Pahomov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egor Pahomov closed SPARK-15409.

Resolution: Invalid

> Can't run spark odbc server on yarn the way I did it on 1.6.1
> -
>
> Key: SPARK-15409
> URL: https://issues.apache.org/jira/browse/SPARK-15409
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Minor
>
> I'm getting error, while spark tries to run application in YARN - 
> {code}
> 16/05/19 10:33:20 INFO yarn.Client: Application report for 
> application_1463075121059_12094 (state: ACCEPTED)
> 16/05/19 10:33:21 INFO yarn.Client: Application report for 
> application_1463075121059_12094 (state: ACCEPTED)
> 16/05/19 10:33:22 INFO yarn.Client: Application report for 
> application_1463075121059_12094 (state: ACCEPTED)
> 16/05/19 10:33:23 WARN server.TransportChannelHandler: Exception in 
> connection from nod5-1-hadoop.anchorfree.net/192.168.12.128:56327
> java.lang.NoSuchMethodError: 
> java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
> at 
> org.apache.spark.rpc.netty.Dispatcher.postToAll(Dispatcher.scala:107)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.channelActive(NettyRpcEnv.scala:618)
> at 
> org.apache.spark.network.server.TransportRequestHandler.channelActive(TransportRequestHandler.java:86)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelActive(TransportChannelHandler.java:89)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:183)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:169)
> at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelActive(ChannelInboundHandlerAdapter.java:64)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelActive(IdleStateHandler.java:251)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:183)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:169)
> at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelActive(ChannelInboundHandlerAdapter.java:64)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:183)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:169)
> at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelActive(ChannelInboundHandlerAdapter.java:64)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:183)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:169)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelActive(DefaultChannelPipeline.java:817)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:454)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> 16/05/19 10:33:23 WARN channel.DefaultChannelPipeline: An exception was 
> thrown by a user handler's exceptionCaught() method while handling the 
> following exception:
> java.lang.NoSuchMethodError: 
> java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
> at 
> org.apache.spark.rpc.netty.Dispatcher.postToAll(Dispatcher.scala:107)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.channelActive(NettyRpcEnv.scala:618)
> at 
> org.apache.spark.network.server.TransportRequestHandler.channelActive(TransportRequestHandler.java:86)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelActive(TransportChannelHandler.java:89)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:183)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:169)
> at 
>

[jira] [Created] (SPARK-15409) Can't run spark odbc server on yarn the way I did it on 1.6.1

2016-05-19 Thread Egor Pahomov (JIRA)

Egor Pahomov created SPARK-15409:


 Summary: Can't run spark odbc server on yarn the way I did it on 
1.6.1
 Key: SPARK-15409
 URL: https://issues.apache.org/jira/browse/SPARK-15409
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.0.0
Reporter: Egor Pahomov
Priority: Minor


I'm getting error, while spark tries to run application in YARN - 

{code}
16/05/19 10:33:20 INFO yarn.Client: Application report for 
application_1463075121059_12094 (state: ACCEPTED)
16/05/19 10:33:21 INFO yarn.Client: Application report for 
application_1463075121059_12094 (state: ACCEPTED)
16/05/19 10:33:22 INFO yarn.Client: Application report for 
application_1463075121059_12094 (state: ACCEPTED)
16/05/19 10:33:23 WARN server.TransportChannelHandler: Exception in connection 
from nod5-1-hadoop.anchorfree.net/192.168.12.128:56327
java.lang.NoSuchMethodError: 
java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
at org.apache.spark.rpc.netty.Dispatcher.postToAll(Dispatcher.scala:107)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.channelActive(NettyRpcEnv.scala:618)
at 
org.apache.spark.network.server.TransportRequestHandler.channelActive(TransportRequestHandler.java:86)
at 
org.apache.spark.network.server.TransportChannelHandler.channelActive(TransportChannelHandler.java:89)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:183)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:169)
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelActive(ChannelInboundHandlerAdapter.java:64)
at 
io.netty.handler.timeout.IdleStateHandler.channelActive(IdleStateHandler.java:251)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:183)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:169)
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelActive(ChannelInboundHandlerAdapter.java:64)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:183)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:169)
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelActive(ChannelInboundHandlerAdapter.java:64)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:183)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:169)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelActive(DefaultChannelPipeline.java:817)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:454)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
16/05/19 10:33:23 WARN channel.DefaultChannelPipeline: An exception was thrown 
by a user handler's exceptionCaught() method while handling the following 
exception:
java.lang.NoSuchMethodError: 
java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
at org.apache.spark.rpc.netty.Dispatcher.postToAll(Dispatcher.scala:107)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.channelActive(NettyRpcEnv.scala:618)
at 
org.apache.spark.network.server.TransportRequestHandler.channelActive(TransportRequestHandler.java:86)
at 
org.apache.spark.network.server.TransportChannelHandler.channelActive(TransportChannelHandler.java:89)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:183)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:169)
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelActive(ChannelInboundHandlerAdapter.java:64)
at 
io.netty.handler.timeout.IdleStateHandler.channelActive(IdleStateHandler.java:251)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:183)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:169)
at

[jira] [Created] (SPARK-13654) get_json_object fails with java.io.CharConversionException

2016-03-03 Thread Egor Pahomov (JIRA)

Egor Pahomov created SPARK-13654:


 Summary: get_json_object fails with java.io.CharConversionException
 Key: SPARK-13654
 URL: https://issues.apache.org/jira/browse/SPARK-13654
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1
Reporter: Egor Pahomov


I execute next query on my data:
{code}
select count(distinct get_json_object(regexp_extract(line, "^\\p{ASCII}*$", 0), 
'$.event')) from 
(select line from logs.raw_client_log where year=2016 and month=2 and day>28 
and line rlike "^\\p{ASCII}*$" and line is not null) a 
{code}

And it fails with 
{code}
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
420 in stage 168.0 failed 4 times, most recent failure: Lost task 420.3 in 
stage 168.0 (TID 13064, nod5-2-hadoop.anchorfree.net): 
java.io.CharConversionException: Invalid UTF-32 character 0x6576656e(above 
10)  at char #47, byte #191)
at 
com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189)
at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150)
at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153)
at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1855)
at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:571)
at 
org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:142)
at 
org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:141)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2202)
at 
org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2.apply(jsonExpressions.scala:141)
at 
org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2.apply(jsonExpressions.scala:138)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2202)
at 
org.apache.spark.sql.catalyst.expressions.GetJsonObject.eval(jsonExpressions.scala:138)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.Expand$$anonfun$doExecute$1$$anonfun$3$$anon$1.next(Expand.scala:76)
at 
org.apache.spark.sql.execution.Expand$$anonfun$doExecute$1$$anonfun$3$$anon$1.next(Expand.scala:62)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
{code}

Basically Spark sells me the idea, that I have character 敮 in my data. But 
query 
{code}
select line from logs.raw_client_log where year=2016 and month=2 and day>27 and 
line rlike "敮"
{code} 
returns nothing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4403) Elastic allocation(spark.dynamicAllocation.enabled) results in task never being execued.

2014-11-15 Thread Egor Pahomov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egor Pahomov closed SPARK-4403.
---
Resolution: Invalid

 Elastic allocation(spark.dynamicAllocation.enabled) results in task never 
 being execued.
 

 Key: SPARK-4403
 URL: https://issues.apache.org/jira/browse/SPARK-4403
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.1.1
Reporter: Egor Pahomov
 Attachments: ipython_out


 I execute ipython notebook + pyspark with spark.dynamicAllocation.enabled = 
 true. Task never ends.
 Code:
 {code}
 import sys
 from random import random
 from operator import add
 partitions = 10
 n = 10 * partitions
 def f(_):
 x = random() * 2 - 1
 y = random() * 2 - 1
 return 1 if x ** 2 + y ** 2  1 else 0
 count = sc.parallelize(xrange(1, n + 1), partitions).map(f).reduce(add)
 print Pi is roughly %f % (4.0 * count / n)
 {code}
 {code}
 IPYTHON_ARGS=notebook --profile=ydf --port $IPYTHON_PORT --port-retries=0 
 --ip='*' --no-browser
 pyspark \
 --verbose \
 --master yarn-client \
 --conf spark.driver.port=$((RANDOM_PORT + 2)) \
 --conf spark.broadcast.port=$((RANDOM_PORT + 3)) \
 --conf spark.replClassServer.port=$((RANDOM_PORT + 4)) \
 --conf spark.blockManager.port=$((RANDOM_PORT + 5)) \
 --conf spark.executor.port=$((RANDOM_PORT + 6)) \
 --conf spark.fileserver.port=$((RANDOM_PORT + 7)) \
 --conf spark.shuffle.service.enabled=true \
 --conf spark.dynamicAllocation.enabled=true \
 --conf spark.dynamicAllocation.minExecutors=1 \
 --conf spark.dynamicAllocation.maxExecutors=10 \
 --conf spark.ui.port=$SPARK_UI_PORT
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4403) Elastic allocation(spark.dynamicAllocation.enabled) results in task never being execued.

2014-11-14 Thread Egor Pahomov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egor Pahomov updated SPARK-4403:

Attachment: ipython_out

 Elastic allocation(spark.dynamicAllocation.enabled) results in task never 
 being execued.
 

 Key: SPARK-4403
 URL: https://issues.apache.org/jira/browse/SPARK-4403
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.1.1
Reporter: Egor Pahomov
 Attachments: ipython_out


 I execute ipython notebook + pyspark with spark.dynamicAllocation.enabled = 
 true. Task never ends.
 Code:
 {code}
 import sys
 from random import random
 from operator import add
 partitions = 10
 n = 10 * partitions
 def f(_):
 x = random() * 2 - 1
 y = random() * 2 - 1
 return 1 if x ** 2 + y ** 2  1 else 0
 count = sc.parallelize(xrange(1, n + 1), partitions).map(f).reduce(add)
 print Pi is roughly %f % (4.0 * count / n)
 {code}
 {code}
 pyspark \
 --verbose \
 --master yarn-client \
 --conf spark.driver.port=$((RANDOM_PORT + 2)) \
 --conf spark.broadcast.port=$((RANDOM_PORT + 3)) \
 --conf spark.replClassServer.port=$((RANDOM_PORT + 4)) \
 --conf spark.blockManager.port=$((RANDOM_PORT + 5)) \
 --conf spark.executor.port=$((RANDOM_PORT + 6)) \
 --conf spark.fileserver.port=$((RANDOM_PORT + 7)) \
 --conf spark.shuffle.service.enabled=true \
 --conf spark.dynamicAllocation.enabled=true \
 --conf spark.dynamicAllocation.minExecutors=1 \
 --conf spark.dynamicAllocation.maxExecutors=10 \
 --conf spark.ui.port=$SPARK_UI_PORT
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2157) Can't write tight firewall rules for Spark

2014-06-18 Thread Egor Pahomov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14035554#comment-14035554
 ] 

Egor Pahomov commented on SPARK-2157:
-

Yep, I used patched version of spark. 

 Can't write tight firewall rules for Spark
 --

 Key: SPARK-2157
 URL: https://issues.apache.org/jira/browse/SPARK-2157
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Ash
Priority: Critical

 In order to run Spark in places with strict firewall rules, you need to be 
 able to specify every port that's used between all parts of the stack.
 Per the [network activity section of the 
 docs|http://spark.apache.org/docs/latest/spark-standalone.html#configuring-ports-for-network-security]
  most of the ports are configurable, but there are a few ports that aren't 
 configurable.
 We need to make every port configurable to a particular port, so that we can 
 run Spark in highly locked-down environments.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

44 matches

Mail list logo