[jira] [Commented] (SPARK-20004) Spark thrift server ovewrites spark.app.name
[ https://issues.apache.org/jira/browse/SPARK-20004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931800#comment-15931800 ] Egor Pahomov commented on SPARK-20004: -- [~srowen], 1.6 and 2.0 allowed to specify app.name. I have several THriftServers running and I need to distinguish between them in Yarn UI > Spark thrift server ovewrites spark.app.name > > > Key: SPARK-20004 > URL: https://issues.apache.org/jira/browse/SPARK-20004 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Egor Pahomov >Priority: Minor > > {code} > export SPARK_YARN_APP_NAME="ODBC server $host" > /spark/sbin/start-thriftserver.sh --conf spark.yarn.queue=spark.client.$host > --conf spark.app.name="ODBC server $host" > {code} > And spark-defaults.conf contains: > {code} > spark.app.name "ODBC server spark01" > {code} > Still name in yarn is "Thrift JDBC/ODBC Server" -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20005) There is no "Newline" in UI in describtion
[ https://issues.apache.org/jira/browse/SPARK-20005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Egor Pahomov updated SPARK-20005: - Description: There is no "newline" in UI in describtion: https://ibb.co/bLp2yv (was: Just see the attachment) > There is no "Newline" in UI in describtion > --- > > Key: SPARK-20005 > URL: https://issues.apache.org/jira/browse/SPARK-20005 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Egor Pahomov >Priority: Minor > > There is no "newline" in UI in describtion: https://ibb.co/bLp2yv -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20005) There is no "Newline" in UI in describtion
Egor Pahomov created SPARK-20005: Summary: There is no "Newline" in UI in describtion Key: SPARK-20005 URL: https://issues.apache.org/jira/browse/SPARK-20005 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.1.0 Reporter: Egor Pahomov Priority: Minor Just see the attachment -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20004) Spark thrift server ovewrites spark.app.name
[ https://issues.apache.org/jira/browse/SPARK-20004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Egor Pahomov updated SPARK-20004: - Summary: Spark thrift server ovewrites spark.app.name (was: Spark thrift server ovverides spark.app.name) > Spark thrift server ovewrites spark.app.name > > > Key: SPARK-20004 > URL: https://issues.apache.org/jira/browse/SPARK-20004 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Egor Pahomov >Priority: Minor > > {code} > export SPARK_YARN_APP_NAME="ODBC server $host" > /spark/sbin/start-thriftserver.sh --conf spark.yarn.queue=spark.client.$host > --conf spark.app.name="ODBC server $host" > {code} > And spark-defaults.conf contains: > {code} > spark.app.name "ODBC server spark01" > {code} > Still name in yarn is "Thrift JDBC/ODBC Server" -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20004) Spark thrift server ovverides spark.app.name
Egor Pahomov created SPARK-20004: Summary: Spark thrift server ovverides spark.app.name Key: SPARK-20004 URL: https://issues.apache.org/jira/browse/SPARK-20004 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Egor Pahomov Priority: Minor {code} export SPARK_YARN_APP_NAME="ODBC server $host" /spark/sbin/start-thriftserver.sh --conf spark.yarn.queue=spark.client.$host --conf spark.app.name="ODBC server $host" {code} And spark-defaults.conf contains: {code} spark.app.name "ODBC server spark01" {code} Still name in yarn is "Thrift JDBC/ODBC Server" -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.
[ https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861869#comment-15861869 ] Egor Pahomov commented on SPARK-19524: -- [~sowen], probably yes. I don't know. "Should process only new files and ignore existing files in the directory" if you really think about it, than I agree than setting this field to false does not mean to process old files. IMHO, everything around this field seems to be poorly documented or architectured. Since there is no documentation about spark.streaming.minRememberDuration in http://spark.apache.org/docs/2.0.2/configuration.html#spark-streaming I do not feel very comfortable changing it. More than that, it would be strange to change it to process old files, when purpose of this field very different. And nevertheless I was given an API with newFilesOnly, about which I made false assumption, but not totally unreasonable, based on all accessible documentation. I was wrong, but it still feels like a trap, I walked into, which can easily not be there. > newFilesOnly does not work according to docs. > -- > > Key: SPARK-19524 > URL: https://issues.apache.org/jira/browse/SPARK-19524 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > Docs says: > newFilesOnly > Should process only new files and ignore existing files in the directory > It's not working. > http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files > says, that it shouldn't work as expected. > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala > not clear at all in terms, what code tries to do -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.
[ https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861600#comment-15861600 ] Egor Pahomov commented on SPARK-19524: -- [~sowen], Folder, which connect my streaming to: {code} [egor@hadoop2 test]$ date Fri Feb 10 09:51:16 PST 2017 [egor@hadoop2 test]$ ls -al total 445746 drwxr-xr-x 13 egor egor 4096 Feb 8 14:27 . drwxr-xr-x 43 egor egor 4096 Feb 9 01:38 .. -rw-r--r-- 1 root jobexecutors241661 Dec 1 18:03 clog.1480636981858.fl.log.gz -rw-r--r-- 1 egor egor387024 Feb 1 17:26 clog.1485986399693.fl.log.gz -rw-r--r-- 1 egor egor 128983477 Feb 8 12:43 clog.2017-01-03.1483431170180.9861.log.gz -rw-r--r-- 1 root jobexecutors 67422481 Dec 1 00:01 clog.new-1.1480579205495.fl.log.gz -rw-r--r-- 1 egor egor287279 Feb 8 13:21 data2.log.gz -rw-r--r-- 1 egor egor 128983477 Feb 8 14:10 data300.log.gz -rw-r--r-- 1 egor egor 128983477 Feb 8 14:20 data365.log.gz -rw-r--r-- 1 egor egor287279 Feb 8 13:23 data3.log.gz -rw-r--r-- 1 egor egor287279 Feb 8 13:45 data4.log.gz -rwxrwxr-x 1 egor egor287279 Feb 8 14:04 data5.log.gz -rwxrwxr-x 1 egor egor287279 Feb 8 14:08 data6.log.gz {code} They way I connect: {code} def f(path:Path): Boolean = { !path.getName.contains("tmp") } val client_log_d_stream = ssc.fileStream[LongWritable, Text, TextInputFormat](input_folder, f _ , newFilesOnly = false) {code} Nothing is processed. Than I add file to directory and it processes it. But not the old ones > newFilesOnly does not work according to docs. > -- > > Key: SPARK-19524 > URL: https://issues.apache.org/jira/browse/SPARK-19524 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > Docs says: > newFilesOnly > Should process only new files and ignore existing files in the directory > It's not working. > http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files > says, that it shouldn't work as expected. > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala > not clear at all in terms, what code tries to do -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19523) Spark streaming+ insert into table leaves bunch of trash in table directory
[ https://issues.apache.org/jira/browse/SPARK-19523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860025#comment-15860025 ] Egor Pahomov commented on SPARK-19523: -- Probably my bad. I have {code} rdd.sparkContext {code} inside lambda. I do not have {code} rdd.sqlContext {code}. where I can take it? > Spark streaming+ insert into table leaves bunch of trash in table directory > --- > > Key: SPARK-19523 > URL: https://issues.apache.org/jira/browse/SPARK-19523 > Project: Spark > Issue Type: Improvement > Components: DStreams, SQL >Affects Versions: 2.0.2 >Reporter: Egor Pahomov >Priority: Minor > > I have very simple code, which transform coming json files into pq table: > {code} > import org.apache.spark.sql.hive.HiveContext > import org.apache.hadoop.fs.Path > import org.apache.hadoop.io.{LongWritable, Text} > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat > import org.apache.spark.sql.SaveMode > object Client_log { > def main(args: Array[String]): Unit = { > val resultCols = new HiveContext(Spark.ssc.sparkContext).sql(s"select * > from temp.x_streaming where year=2015 and month=12 and day=1").dtypes > var columns = resultCols.filter(x => > !Commons.stopColumns.contains(x._1)).map({ case (name, types) => { > s"""cast (get_json_object(s, '""" + '$' + s""".properties.${name}') as > ${Commons.mapType(types)}) as $name""" > } > }) > columns ++= List("'streaming' as sourcefrom") > def f(path:Path): Boolean = { > true > } > val client_log_d_stream = Spark.ssc.fileStream[LongWritable, Text, > TextInputFormat]("/user/egor/test2", f _ , newFilesOnly = false) > client_log_d_stream.foreachRDD(rdd => { > val localHiveContext = new HiveContext(rdd.sparkContext) > import localHiveContext.implicits._ > var input = rdd.map(x => Record(x._2.toString)).toDF() > input = input.selectExpr(columns: _*) > input = > SmallOperators.populate(input, resultCols) > input > .write > .mode(SaveMode.Append) > .format("parquet") > .insertInto("temp.x_streaming") > }) > Spark.ssc.start() > Spark.ssc.awaitTermination() > } > case class Record(s: String) > } > {code} > This code generates a lot of trash directories in resalt table like: > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-00_298_7130707897870357017-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-00_309_6225285476054854579-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-06_305_2185311414031328806-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-06_309_6331022557673464922-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-12_334_1333065569942957405-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-12_387_3622176537686712754-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-18_339_1008134657443203932-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-18_421_3284019142681396277-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-24_291_5985064758831763168-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-24_300_6751765745457248879-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-30_314_2987765230093671316-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-30_331_2746678721907502111-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-36_311_1466065813702202959-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-36_317_7079974647544197072-1 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.
[ https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860023#comment-15860023 ] Egor Pahomov commented on SPARK-19524: -- I'm really confused. I expected "new" to be the files created after start of streaming job and old ones to be everything else in the folder. If we change definition of "new", than I believe everything consistent between each other. It's just I'm not sure that this "new" definition is very intuitive. I want to process everything in folder - existing and upcoming. I use this flag. And now it turns out, that this flag has it's own definition of "new". My be I'm not correct to call it a bug, but isn't it all very confusing for person, who does not really know who everything works inside? > newFilesOnly does not work according to docs. > -- > > Key: SPARK-19524 > URL: https://issues.apache.org/jira/browse/SPARK-19524 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > Docs says: > newFilesOnly > Should process only new files and ignore existing files in the directory > It's not working. > http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files > says, that it shouldn't work as expected. > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala > not clear at all in terms, what code tries to do -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19523) Spark streaming+ insert into table leaves bunch of trash in table directory
[ https://issues.apache.org/jira/browse/SPARK-19523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858913#comment-15858913 ] Egor Pahomov edited comment on SPARK-19523 at 2/9/17 2:52 AM: -- @zsxwing, I haven't found the way to do that. inside foreachRDD there is no build HiveContext as far as I know. And when I build HiveContext outside and just use it inside, it ERROR me, that I can't have multiple SparkContext in one JVM. WHich is reasonable - it sound like very bad idea to serialise spark context was (Author: epahomov): @zsxwing, I haven't found the way to that. inside foreachRDD there is no build HiveContext as far as I know. And when I build HiveContext outside and just use it inside, it ERROR me, that I can't have multiple SparkContext in one JVM. WHich is reasonable - it sound like very bad idea to serialise spark context > Spark streaming+ insert into table leaves bunch of trash in table directory > --- > > Key: SPARK-19523 > URL: https://issues.apache.org/jira/browse/SPARK-19523 > Project: Spark > Issue Type: Improvement > Components: DStreams, SQL >Affects Versions: 2.0.2 >Reporter: Egor Pahomov >Priority: Minor > > I have very simple code, which transform coming json files into pq table: > {code} > import org.apache.spark.sql.hive.HiveContext > import org.apache.hadoop.fs.Path > import org.apache.hadoop.io.{LongWritable, Text} > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat > import org.apache.spark.sql.SaveMode > object Client_log { > def main(args: Array[String]): Unit = { > val resultCols = new HiveContext(Spark.ssc.sparkContext).sql(s"select * > from temp.x_streaming where year=2015 and month=12 and day=1").dtypes > var columns = resultCols.filter(x => > !Commons.stopColumns.contains(x._1)).map({ case (name, types) => { > s"""cast (get_json_object(s, '""" + '$' + s""".properties.${name}') as > ${Commons.mapType(types)}) as $name""" > } > }) > columns ++= List("'streaming' as sourcefrom") > def f(path:Path): Boolean = { > true > } > val client_log_d_stream = Spark.ssc.fileStream[LongWritable, Text, > TextInputFormat]("/user/egor/test2", f _ , newFilesOnly = false) > client_log_d_stream.foreachRDD(rdd => { > val localHiveContext = new HiveContext(rdd.sparkContext) > import localHiveContext.implicits._ > var input = rdd.map(x => Record(x._2.toString)).toDF() > input = input.selectExpr(columns: _*) > input = > SmallOperators.populate(input, resultCols) > input > .write > .mode(SaveMode.Append) > .format("parquet") > .insertInto("temp.x_streaming") > }) > Spark.ssc.start() > Spark.ssc.awaitTermination() > } > case class Record(s: String) > } > {code} > This code generates a lot of trash directories in resalt table like: > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-00_298_7130707897870357017-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-00_309_6225285476054854579-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-06_305_2185311414031328806-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-06_309_6331022557673464922-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-12_334_1333065569942957405-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-12_387_3622176537686712754-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-18_339_1008134657443203932-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-18_421_3284019142681396277-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-24_291_5985064758831763168-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-24_300_6751765745457248879-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-30_314_2987765230093671316-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-30_331_2746678721907502111-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-36_311_1466065813702202959-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-36_317_7079974647544197072-1 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19523) Spark streaming+ insert into table leaves bunch of trash in table directory
[ https://issues.apache.org/jira/browse/SPARK-19523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858913#comment-15858913 ] Egor Pahomov commented on SPARK-19523: -- @zsxwing, I haven't found the way to that. inside foreachRDD there is no build HiveContext as far as I know. And when I build HiveContext outside and just use it inside, it ERROR me, that I can't have multiple SparkContext in one JVM. WHich is reasonable - it sound like very bad idea to serialise spark context > Spark streaming+ insert into table leaves bunch of trash in table directory > --- > > Key: SPARK-19523 > URL: https://issues.apache.org/jira/browse/SPARK-19523 > Project: Spark > Issue Type: Improvement > Components: DStreams, SQL >Affects Versions: 2.0.2 >Reporter: Egor Pahomov >Priority: Minor > > I have very simple code, which transform coming json files into pq table: > {code} > import org.apache.spark.sql.hive.HiveContext > import org.apache.hadoop.fs.Path > import org.apache.hadoop.io.{LongWritable, Text} > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat > import org.apache.spark.sql.SaveMode > object Client_log { > def main(args: Array[String]): Unit = { > val resultCols = new HiveContext(Spark.ssc.sparkContext).sql(s"select * > from temp.x_streaming where year=2015 and month=12 and day=1").dtypes > var columns = resultCols.filter(x => > !Commons.stopColumns.contains(x._1)).map({ case (name, types) => { > s"""cast (get_json_object(s, '""" + '$' + s""".properties.${name}') as > ${Commons.mapType(types)}) as $name""" > } > }) > columns ++= List("'streaming' as sourcefrom") > def f(path:Path): Boolean = { > true > } > val client_log_d_stream = Spark.ssc.fileStream[LongWritable, Text, > TextInputFormat]("/user/egor/test2", f _ , newFilesOnly = false) > client_log_d_stream.foreachRDD(rdd => { > val localHiveContext = new HiveContext(rdd.sparkContext) > import localHiveContext.implicits._ > var input = rdd.map(x => Record(x._2.toString)).toDF() > input = input.selectExpr(columns: _*) > input = > SmallOperators.populate(input, resultCols) > input > .write > .mode(SaveMode.Append) > .format("parquet") > .insertInto("temp.x_streaming") > }) > Spark.ssc.start() > Spark.ssc.awaitTermination() > } > case class Record(s: String) > } > {code} > This code generates a lot of trash directories in resalt table like: > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-00_298_7130707897870357017-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-00_309_6225285476054854579-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-06_305_2185311414031328806-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-06_309_6331022557673464922-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-12_334_1333065569942957405-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-12_387_3622176537686712754-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-18_339_1008134657443203932-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-18_421_3284019142681396277-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-24_291_5985064758831763168-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-24_300_6751765745457248879-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-30_314_2987765230093671316-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-30_331_2746678721907502111-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-36_311_1466065813702202959-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-36_317_7079974647544197072-1 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.
[ https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858912#comment-15858912 ] Egor Pahomov commented on SPARK-19524: -- [~uncleGen], sorry, I haven't understood. > newFilesOnly does not work according to docs. > -- > > Key: SPARK-19524 > URL: https://issues.apache.org/jira/browse/SPARK-19524 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > Docs says: > newFilesOnly > Should process only new files and ignore existing files in the directory > It's not working. > http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files > says, that it shouldn't work as expected. > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala > not clear at all in terms, what code tries to do -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.
[ https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858910#comment-15858910 ] Egor Pahomov commented on SPARK-19524: -- [~srowen], based on documentation, which says "newFilesOnly - Should process only new files and ignore existing files in the directory", I expect, that files which are already exist in folder to which I connect streaming, would be processed. It's not true. In reality files, which were created after time X would be procesed. Time X: {code} private val durationToRemember = slideDuration * numBatchesToRemember val modTimeIgnoreThreshold = math.max( initialModTimeIgnoreThreshold, // initial threshold based on newFilesOnly setting currentTime - durationToRemember.milliseconds // trailing end of the remember window ) {code} First, this code contradicts with documentation as far as I understand it. Second, this code contradicts with the name "newFilesOnly" itself. There is probably motivation behind this code, but I think the only way to find this motivation - git history and go over all tickets. Sorry, that I wasn't more specific first time. > newFilesOnly does not work according to docs. > -- > > Key: SPARK-19524 > URL: https://issues.apache.org/jira/browse/SPARK-19524 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > Docs says: > newFilesOnly > Should process only new files and ignore existing files in the directory > It's not working. > http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files > says, that it shouldn't work as expected. > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala > not clear at all in terms, what code tries to do -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19524) newFilesOnly does not work according to docs.
Egor Pahomov created SPARK-19524: Summary: newFilesOnly does not work according to docs. Key: SPARK-19524 URL: https://issues.apache.org/jira/browse/SPARK-19524 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.0.2 Reporter: Egor Pahomov Docs says: newFilesOnly Should process only new files and ignore existing files in the directory It's not working. http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files says, that it shouldn't work as expected. https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala not clear at all in terms, what code tries to do -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19523) Spark streaming+ insert into table leaves bunch of trash in table directory
Egor Pahomov created SPARK-19523: Summary: Spark streaming+ insert into table leaves bunch of trash in table directory Key: SPARK-19523 URL: https://issues.apache.org/jira/browse/SPARK-19523 Project: Spark Issue Type: Bug Components: SQL, Structured Streaming Affects Versions: 2.0.2 Reporter: Egor Pahomov I have very simple code, which transform coming json files into pq table: {code} import org.apache.spark.sql.hive.HiveContext import org.apache.hadoop.fs.Path import org.apache.hadoop.io.{LongWritable, Text} import org.apache.hadoop.mapreduce.lib.input.TextInputFormat import org.apache.spark.sql.SaveMode object Client_log { def main(args: Array[String]): Unit = { val resultCols = new HiveContext(Spark.ssc.sparkContext).sql(s"select * from temp.x_streaming where year=2015 and month=12 and day=1").dtypes var columns = resultCols.filter(x => !Commons.stopColumns.contains(x._1)).map({ case (name, types) => { s"""cast (get_json_object(s, '""" + '$' + s""".properties.${name}') as ${Commons.mapType(types)}) as $name""" } }) columns ++= List("'streaming' as sourcefrom") def f(path:Path): Boolean = { true } val client_log_d_stream = Spark.ssc.fileStream[LongWritable, Text, TextInputFormat]("/user/egor/test2", f _ , newFilesOnly = false) client_log_d_stream.foreachRDD(rdd => { val localHiveContext = new HiveContext(rdd.sparkContext) import localHiveContext.implicits._ var input = rdd.map(x => Record(x._2.toString)).toDF() input = input.selectExpr(columns: _*) input = SmallOperators.populate(input, resultCols) input .write .mode(SaveMode.Append) .format("parquet") .insertInto("temp.x_streaming") }) Spark.ssc.start() Spark.ssc.awaitTermination() } case class Record(s: String) } {code} This code generates a lot of trash directories in resalt table like: drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 .hive-staging_hive_2017-02-08_14-15-00_298_7130707897870357017-1 drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 .hive-staging_hive_2017-02-08_14-15-00_309_6225285476054854579-1 drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 .hive-staging_hive_2017-02-08_14-15-06_305_2185311414031328806-1 drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 .hive-staging_hive_2017-02-08_14-15-06_309_6331022557673464922-1 drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 .hive-staging_hive_2017-02-08_14-15-12_334_1333065569942957405-1 drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 .hive-staging_hive_2017-02-08_14-15-12_387_3622176537686712754-1 drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 .hive-staging_hive_2017-02-08_14-15-18_339_1008134657443203932-1 drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 .hive-staging_hive_2017-02-08_14-15-18_421_3284019142681396277-1 drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 .hive-staging_hive_2017-02-08_14-15-24_291_5985064758831763168-1 drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 .hive-staging_hive_2017-02-08_14-15-24_300_6751765745457248879-1 drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 .hive-staging_hive_2017-02-08_14-15-30_314_2987765230093671316-1 drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 .hive-staging_hive_2017-02-08_14-15-30_331_2746678721907502111-1 drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 .hive-staging_hive_2017-02-08_14-15-36_311_1466065813702202959-1 drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 .hive-staging_hive_2017-02-08_14-15-36_317_7079974647544197072-1 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18930) Inserting in partitioned table - partitioned field should be last in select statement.
[ https://issues.apache.org/jira/browse/SPARK-18930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786368#comment-15786368 ] Egor Pahomov commented on SPARK-18930: -- I'm not sure, that such restriction buried in documentation is OK. Basically the problem - I've created correct schema for table. I've correctly inserted into it. But for some reason I need to keep order of columns in select statement > Inserting in partitioned table - partitioned field should be last in select > statement. > --- > > Key: SPARK-18930 > URL: https://issues.apache.org/jira/browse/SPARK-18930 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > CREATE TABLE temp.test_partitioning_4 ( > num string > ) > PARTITIONED BY ( > day string) > stored as parquet > INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day) > select day, count(*) as num from > hss.session where year=2016 and month=4 > group by day > Resulted schema on HDFS: /temp.db/test_partitioning_3/day=62456298, > emp.db/test_partitioning_3/day=69094345 > As you can imagine these numbers are num of records. But! When I do select * > from temp.test_partitioning_4 data is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18931) Create empty staging directory in partitioned table on insert
Egor Pahomov created SPARK-18931: Summary: Create empty staging directory in partitioned table on insert Key: SPARK-18931 URL: https://issues.apache.org/jira/browse/SPARK-18931 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Reporter: Egor Pahomov CREATE TABLE temp.test_partitioning_4 ( num string ) PARTITIONED BY ( day string) stored as parquet On every INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day) select day, count(*) as num from hss.session where year=2016 and month=4 group by day new directory ".hive-staging_hive_2016-12-19_15-55-11_298_3412488541559534475-4" created on HDFS. It's big issue, because I insert every day and bunch of empty dirs on HDFS is very bad for HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18930) Inserting in partitioned table - partitioned field should be last in select statement.
Egor Pahomov created SPARK-18930: Summary: Inserting in partitioned table - partitioned field should be last in select statement. Key: SPARK-18930 URL: https://issues.apache.org/jira/browse/SPARK-18930 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Reporter: Egor Pahomov CREATE TABLE temp.test_partitioning_4 ( num string ) PARTITIONED BY ( day string) stored as parquet INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day) select day, count(*) as num from hss.session where year=2016 and month=4 group by day Resulted schema on HDFS: /temp.db/test_partitioning_3/day=62456298, emp.db/test_partitioning_3/day=69094345 As you can imagine these numbers are num of records. But! When I do select * from temp.test_partitioning_4 data is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18336) SQL started to fail with OOM and etc. after move from 1.6.2 to 2.0.2
[ https://issues.apache.org/jira/browse/SPARK-18336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Egor Pahomov closed SPARK-18336. Resolution: Invalid Moving from 1.6 to 2.0 forced me use spark-submit instead of using spark assembly jar. I though, that if I define driver memory in spark-defaults.conf spark-submit would read it before creating java machine and would create driver with configured memory. I was wrong - I needed to specify it though parameter. It's counterintuitive, but I was wrong to make such assumption. > SQL started to fail with OOM and etc. after move from 1.6.2 to 2.0.2 > > > Key: SPARK-18336 > URL: https://issues.apache.org/jira/browse/SPARK-18336 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > I had several(~100) quires, which were run one after another in single spark > context. I can provide code of runner - it's very simple. It worked fine on > 1.6.2, than I moved to 2551d959a6c9fb27a54d38599a2301d735532c24 (branch-2.0 > on 31.10.2016 17:04:12). It started to fail with OOM and other errors. When I > separate my 100 quires to 2 sets and run set after set it works fine. I would > suspect problems with memory on driver, but nothing points to that. > My conf: > {code} > lazy val sparkConfTemplate = new SparkConf() > .setMaster("yarn-client") > .setAppName(appName) > .set("spark.executor.memory", "25g") > .set("spark.executor.instances", "40") > .set("spark.dynamicAllocation.enabled", "false") > .set("spark.yarn.executor.memoryOverhead", "3000") > .set("spark.executor.cores", "6") > .set("spark.driver.memory", "25g") > .set("spark.driver.cores", "5") > .set("spark.yarn.am.memory", "20g") > .set("spark.shuffle.io.numConnectionsPerPeer", "5") > .set("spark.sql.autoBroadcastJoinThreshold", "10") > .set("spark.network.timeout", "4000s") > .set("spark.driver.maxResultSize", "5g") > .set("spark.sql.parquet.compression.codec", "gzip") > .set("spark.kryoserializer.buffer.max", "1200m") > .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") > .set("spark.yarn.driver.memoryOverhead", "1000") > .set("spark.scheduler.mode", "FIFO") > .set("spark.sql.broadcastTimeout", "2") > .set("spark.akka.frameSize", "200") > .set("spark.sql.shuffle.partitions", partitions) > .set("spark.network.timeout", "1000s") > > .setJars(List(this.getClass.getProtectionDomain().getCodeSource().getLocation().toURI().getPath())) > {code} > Errors, which started to happen: > {code} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7f04c6cf3ea8, pid=17479, tid=139658116687616 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build > 1.8.0_60-b27) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode > linux-amd64 compressed oops) > # Problematic frame: > # V [libjvm.so+0x64bea8] > InstanceKlass::oop_follow_contents(ParCompactionManager*, oopDesc*)+0x88 > # > # Failed to write core dump. Core dumps have been disabled. To enable core > dumping, try "ulimit -c unlimited" before starting Java again > # > # An error report file with more information is saved as: > # /home/egor/hs_err_pid17479.log > # > # If you would like to submit a bug report, please visit: > # http://bugreport.java.com/bugreport/crash.jsp > # > {code} > {code} > Exception in thread "refresh progress" java.lang.OutOfMemoryError: Java heap > space > at scala.collection.immutable.Iterable$.newBuilder(Iterable.scala:44) > at scala.collection.Iterable$.newBuilder(Iterable.scala:50) > at > scala.collection.generic.GenericTraversableTemplate$class.genericBuilder(GenericTraversableTemplate.scala:70) > at > scala.collection.AbstractTraversable.genericBuilder(Traversable.scala:104) > at > scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57) > at > scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52) > at > scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.SparkStatusTracker.getActiveStageIds(SparkStatusTracker.scala:61) > at > org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:66) > at > org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:54) > at java.util.TimerThread.mainLoop(Unknown Source) > at
[jira] [Commented] (SPARK-18336) SQL started to fail with OOM and etc. after move from 1.6.2 to 2.0.2
[ https://issues.apache.org/jira/browse/SPARK-18336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15648912#comment-15648912 ] Egor Pahomov commented on SPARK-18336: -- [~srowen], I've read everything in documentation about new memory management and tryied next config: {code} .set("spark.executor.memory", "28g") .set("spark.executor.instances", "50") .set("spark.dynamicAllocation.enabled", "false") .set("spark.yarn.executor.memoryOverhead", "3000") .set("spark.executor.cores", "6") .set("spark.driver.memory", "25g") .set("spark.driver.cores", "5") .set("spark.yarn.am.memory", "20g") .set("spark.shuffle.io.numConnectionsPerPeer", "5") .set("spark.sql.autoBroadcastJoinThreshold", "30485760") .set("spark.network.timeout", "4000s") .set("spark.driver.maxResultSize", "5g") .set("spark.sql.parquet.compression.codec", "gzip") .set("spark.kryoserializer.buffer.max", "1200m") .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.yarn.driver.memoryOverhead", "1000") .set("spark.memory.storageFraction", "0.2") .set("spark.memory.fraction", "0.8") .set("spark.sql.parquet.cacheMetadata", "false") .set("spark.scheduler.mode", "FIFO") .set("spark.sql.broadcastTimeout", "2") .set("spark.akka.frameSize", "200") .set("spark.sql.shuffle.partitions", partitions) .set("spark.network.timeout", "1000s") {code} Is there a manual - "I'm happy with how it was working before, can I configure my job be exactly like old times" ? Can I read somewere what happened to Tungsten project, beacause it seemed like big deal and now it turned of by default? > SQL started to fail with OOM and etc. after move from 1.6.2 to 2.0.2 > > > Key: SPARK-18336 > URL: https://issues.apache.org/jira/browse/SPARK-18336 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > I had several(~100) quires, which were run one after another in single spark > context. I can provide code of runner - it's very simple. It worked fine on > 1.6.2, than I moved to 2551d959a6c9fb27a54d38599a2301d735532c24 (branch-2.0 > on 31.10.2016 17:04:12). It started to fail with OOM and other errors. When I > separate my 100 quires to 2 sets and run set after set it works fine. I would > suspect problems with memory on driver, but nothing points to that. > My conf: > {code} > lazy val sparkConfTemplate = new SparkConf() > .setMaster("yarn-client") > .setAppName(appName) > .set("spark.executor.memory", "25g") > .set("spark.executor.instances", "40") > .set("spark.dynamicAllocation.enabled", "false") > .set("spark.yarn.executor.memoryOverhead", "3000") > .set("spark.executor.cores", "6") > .set("spark.driver.memory", "25g") > .set("spark.driver.cores", "5") > .set("spark.yarn.am.memory", "20g") > .set("spark.shuffle.io.numConnectionsPerPeer", "5") > .set("spark.sql.autoBroadcastJoinThreshold", "10") > .set("spark.network.timeout", "4000s") > .set("spark.driver.maxResultSize", "5g") > .set("spark.sql.parquet.compression.codec", "gzip") > .set("spark.kryoserializer.buffer.max", "1200m") > .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") > .set("spark.yarn.driver.memoryOverhead", "1000") > .set("spark.scheduler.mode", "FIFO") > .set("spark.sql.broadcastTimeout", "2") > .set("spark.akka.frameSize", "200") > .set("spark.sql.shuffle.partitions", partitions) > .set("spark.network.timeout", "1000s") > > .setJars(List(this.getClass.getProtectionDomain().getCodeSource().getLocation().toURI().getPath())) > {code} > Errors, which started to happen: > {code} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7f04c6cf3ea8, pid=17479, tid=139658116687616 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build > 1.8.0_60-b27) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode > linux-amd64 compressed oops) > # Problematic frame: > # V [libjvm.so+0x64bea8] > InstanceKlass::oop_follow_contents(ParCompactionManager*, oopDesc*)+0x88 > # > # Failed to write core dump. Core dumps have been disabled. To enable core > dumping, try "ulimit -c unlimited" before starting Java again > # > # An error report file with more information is saved as: > # /home/egor/hs_err_pid17479.log > # > # If you would like to submit a bug report, please visit: > # http://bugreport.java.com/bugreport/crash.jsp > # > {code} > {code} > Exception in thread "refresh progress" java.lang.OutOfMemoryError: Java heap > space > at scala.collection.immutable.Iterable$.newBuilder(Iterable.scala:44) > at
[jira] [Created] (SPARK-18336) SQL started to fail with OOM and etc. after move from 1.6.2 to 2.0.2
Egor Pahomov created SPARK-18336: Summary: SQL started to fail with OOM and etc. after move from 1.6.2 to 2.0.2 Key: SPARK-18336 URL: https://issues.apache.org/jira/browse/SPARK-18336 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Reporter: Egor Pahomov I had several(~100) quires, which were run one after another in single spark context. I can provide code of runner - it's very simple. It worked fine on 1.6.2, than I moved to 2551d959a6c9fb27a54d38599a2301d735532c24 (branch-2.0 on 31.10.2016 17:04:12). It started to fail with OOM and other errors. When I separate my 100 quires to 2 sets and run set after set it works fine. I would suspect problems with memory on driver, but nothing points to that. My conf: {code} lazy val sparkConfTemplate = new SparkConf() .setMaster("yarn-client") .setAppName(appName) .set("spark.executor.memory", "25g") .set("spark.executor.instances", "40") .set("spark.dynamicAllocation.enabled", "false") .set("spark.yarn.executor.memoryOverhead", "3000") .set("spark.executor.cores", "6") .set("spark.driver.memory", "25g") .set("spark.driver.cores", "5") .set("spark.yarn.am.memory", "20g") .set("spark.shuffle.io.numConnectionsPerPeer", "5") .set("spark.sql.autoBroadcastJoinThreshold", "10") .set("spark.network.timeout", "4000s") .set("spark.driver.maxResultSize", "5g") .set("spark.sql.parquet.compression.codec", "gzip") .set("spark.kryoserializer.buffer.max", "1200m") .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.yarn.driver.memoryOverhead", "1000") .set("spark.scheduler.mode", "FIFO") .set("spark.sql.broadcastTimeout", "2") .set("spark.akka.frameSize", "200") .set("spark.sql.shuffle.partitions", partitions) .set("spark.network.timeout", "1000s") .setJars(List(this.getClass.getProtectionDomain().getCodeSource().getLocation().toURI().getPath())) {code} Errors, which started to happen: {code} # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x7f04c6cf3ea8, pid=17479, tid=139658116687616 # # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops) # Problematic frame: # V [libjvm.so+0x64bea8] InstanceKlass::oop_follow_contents(ParCompactionManager*, oopDesc*)+0x88 # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /home/egor/hs_err_pid17479.log # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # {code} {code} Exception in thread "refresh progress" java.lang.OutOfMemoryError: Java heap space at scala.collection.immutable.Iterable$.newBuilder(Iterable.scala:44) at scala.collection.Iterable$.newBuilder(Iterable.scala:50) at scala.collection.generic.GenericTraversableTemplate$class.genericBuilder(GenericTraversableTemplate.scala:70) at scala.collection.AbstractTraversable.genericBuilder(Traversable.scala:104) at scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57) at scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52) at scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229) at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.SparkStatusTracker.getActiveStageIds(SparkStatusTracker.scala:61) at org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:66) at org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:54) at java.util.TimerThread.mainLoop(Unknown Source) at java.util.TimerThread.run(Unknown Source) java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor78.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.spark.sql.hive.client.Shim_v0_13.getAllPartitions(HiveShim.scala:431) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitions$1.apply(HiveClientImpl.scala:538) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitions$1.apply(HiveClientImpl.scala:535) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:280)
[jira] [Created] (SPARK-17616) Getting "java.lang.RuntimeException: Distinct columns cannot exist in Aggregate "
Egor Pahomov created SPARK-17616: Summary: Getting "java.lang.RuntimeException: Distinct columns cannot exist in Aggregate " Key: SPARK-17616 URL: https://issues.apache.org/jira/browse/SPARK-17616 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Egor Pahomov Priority: Minor I execute: {code} select platform, collect_set(user_auth) as paid_types, count(distinct sessionid) as sessions from non_hss.session where event = 'stop' and platform != 'testplatform' and not (month = MONTH(current_date()) AND year = YEAR(current_date()) and day = day(current_date())) and ( (month >= MONTH(add_months(CURRENT_DATE(), -5)) AND year = YEAR(add_months(CURRENT_DATE(), -5))) OR (month <= MONTH(add_months(CURRENT_DATE(), -5)) AND year > YEAR(add_months(CURRENT_DATE(), -5))) ) group by platform {code} I get: {code} java.lang.RuntimeException: Distinct columns cannot exist in Aggregate operator containing aggregate functions which don't support partial aggregation. {code} IT WORKED IN 1.6.2. I've read error 5 times, and read code once. I still don't understand what I do incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17615) Getting "java.lang.RuntimeException: Distinct columns cannot exist in Aggregate "
Egor Pahomov created SPARK-17615: Summary: Getting "java.lang.RuntimeException: Distinct columns cannot exist in Aggregate " Key: SPARK-17615 URL: https://issues.apache.org/jira/browse/SPARK-17615 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Egor Pahomov Priority: Minor I execute: {code} select platform, collect_set(user_auth) as paid_types, count(distinct sessionid) as sessions from non_hss.session where event = 'stop' and platform != 'testplatform' and not (month = MONTH(current_date()) AND year = YEAR(current_date()) and day = day(current_date())) and ( (month >= MONTH(add_months(CURRENT_DATE(), -5)) AND year = YEAR(add_months(CURRENT_DATE(), -5))) OR (month <= MONTH(add_months(CURRENT_DATE(), -5)) AND year > YEAR(add_months(CURRENT_DATE(), -5))) ) group by platform {code} I get: {code} java.lang.RuntimeException: Distinct columns cannot exist in Aggregate operator containing aggregate functions which don't support partial aggregation. {code} IT WORKED IN 1.6.2. I've read error 5 times, and read code once. I still don't understand what I do incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary
Egor Pahomov created SPARK-17557: Summary: SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary Key: SPARK-17557 URL: https://issues.apache.org/jira/browse/SPARK-17557 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Egor Pahomov Working on 1.6.2, broken on 2.0 {code} select * from logs.a where year=2016 and month=9 and day=14 limit 100 {code} {code} java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48) at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17364) Can not query hive table starting with number
Egor Pahomov created SPARK-17364: Summary: Can not query hive table starting with number Key: SPARK-17364 URL: https://issues.apache.org/jira/browse/SPARK-17364 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Reporter: Egor Pahomov I can do it with spark-1.6.2 {code} SELECT * from temp.20160826_ip_list limit 100 {code} {code} Error: org.apache.spark.sql.catalyst.parser.ParseException: extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19) == SQL == SELECT * from temp.20160826_ip_list limit 100 ---^^^ SQLState: null ErrorCode: 0 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432040#comment-15432040 ] Egor Pahomov commented on SPARK-16334: -- (just reason for reopen) > [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException > - > > Key: SPARK-16334 > URL: https://issues.apache.org/jira/browse/SPARK-16334 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Assignee: Sameer Agarwal >Priority: Critical > Labels: sql > Fix For: 2.0.1, 2.1.0 > > > Query: > {code} > select * from blabla where user_id = 415706251 > {code} > Error: > {code} > 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 > (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934 > at > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Work on 1.6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Egor Pahomov reopened SPARK-16334: -- Seems like a lot of people still have a problem even after suggested fix > [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException > - > > Key: SPARK-16334 > URL: https://issues.apache.org/jira/browse/SPARK-16334 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Assignee: Sameer Agarwal >Priority: Critical > Labels: sql > Fix For: 2.0.1, 2.1.0 > > > Query: > {code} > select * from blabla where user_id = 415706251 > {code} > Error: > {code} > 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 > (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934 > at > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Work on 1.6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384736#comment-15384736 ] Egor Pahomov commented on SPARK-16334: -- [~sameerag] that particular case on which I experienced the problem - now working for me. It's hard for me to reproduce what exactly made the bug appear before - would not go into such much trouble. Thanks > [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException > - > > Key: SPARK-16334 > URL: https://issues.apache.org/jira/browse/SPARK-16334 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Priority: Critical > Labels: sql > > Query: > {code} > select * from blabla where user_id = 415706251 > {code} > Error: > {code} > 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 > (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934 > at > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Work on 1.6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15380249#comment-15380249 ] Egor Pahomov commented on SPARK-16334: -- Sure, would test on Monday. > [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException > - > > Key: SPARK-16334 > URL: https://issues.apache.org/jira/browse/SPARK-16334 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Priority: Critical > Labels: sql > > Query: > {code} > select * from blabla where user_id = 415706251 > {code} > Error: > {code} > 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 > (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934 > at > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Work on 1.6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16548) java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data
[ https://issues.apache.org/jira/browse/SPARK-16548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15378112#comment-15378112 ] Egor Pahomov commented on SPARK-16548: -- [~sowen] if your concern is "silently" we can wright errors to logs. WDYT? > java.io.CharConversionException: Invalid UTF-32 character prevents me from > querying my data > > > Key: SPARK-16548 > URL: https://issues.apache.org/jira/browse/SPARK-16548 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Egor Pahomov >Priority: Minor > > Basically, when I query my json data I get > {code} > java.io.CharConversionException: Invalid UTF-32 character 0x7b2265(above > 10) at char #192, byte #771) > at > com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189) > at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1855) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:571) > at > org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:142) > {code} > I do not like it. If you can not process one json among 100500 please return > null, do not fail everything. I have dirty one line fix, and I understand how > I can make it more reasonable. What is our position - what behaviour we wanna > get? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16548) java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data
Egor Pahomov created SPARK-16548: Summary: java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data Key: SPARK-16548 URL: https://issues.apache.org/jira/browse/SPARK-16548 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.1 Reporter: Egor Pahomov Priority: Minor Basically, when I query my json data I get {code} java.io.CharConversionException: Invalid UTF-32 character 0x7b2265(above 10) at char #192, byte #771) at com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189) at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1855) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:571) at org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:142) {code} I do not like it. If you can not process one json among 100500 please return null, do not fail everything. I have dirty one line fix, and I understand how I can make it more reasonable. What is our position - what behaviour we wanna get? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
Egor Pahomov created SPARK-16334: Summary: [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException Key: SPARK-16334 URL: https://issues.apache.org/jira/browse/SPARK-16334 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Reporter: Egor Pahomov Priority: Critical Query: {code} select * from blabla where user_id = 415706251 {code} Error: {code} 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934 at org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Work on 1.6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Egor Pahomov updated SPARK-16334: - Labels: sql (was: ) > [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException > - > > Key: SPARK-16334 > URL: https://issues.apache.org/jira/browse/SPARK-16334 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Priority: Critical > Labels: sql > > Query: > {code} > select * from blabla where user_id = 415706251 > {code} > Error: > {code} > 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 > (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934 > at > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Work on 1.6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16228) "Percentile" needs explicit cast to double
[ https://issues.apache.org/jira/browse/SPARK-16228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351697#comment-15351697 ] Egor Pahomov commented on SPARK-16228: -- [~srowen] "blocker" is questionable, I agree. I just believe, that everything which prevent you from moving from 1.6.1 to 2.0 without major code changes is "blocker". It's just if I move and such bug would be there a lot of my analysts notebook would be invalid. > "Percentile" needs explicit cast to double > -- > > Key: SPARK-16228 > URL: https://issues.apache.org/jira/browse/SPARK-16228 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov > > {quote} > select percentile(cast(id as bigint), cast(0.5 as double)) from temp.bla > {quote} > Works. > {quote} > select percentile(cast(id as bigint), 0.5 ) from temp.bla > {quote} > Throws > {quote} > Error in query: No handler for Hive UDF > 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': > org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method > for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (bigint, > decimal(38,18)). Possible choices: _FUNC_(bigint, array) > _FUNC_(bigint, double) ; line 1 pos 7 > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16228) "Percentile" needs explicit cast to double
[ https://issues.apache.org/jira/browse/SPARK-16228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Egor Pahomov updated SPARK-16228: - Description: {quote} select percentile(cast(id as bigint), cast(0.5 as double)) from temp.bla {quote} Works. {quote} select percentile(cast(id as bigint), 0.5 ) from temp.bla {quote} Throws {quote} Error in query: No handler for Hive UDF 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (bigint, decimal(38,18)). Possible choices: _FUNC_(bigint, array) _FUNC_(bigint, double) ; line 1 pos 7 {quote} was: select percentile(cast(id as bigint), cast(0.5 as double)) from temp.bla Works. select percentile(cast(id as bigint), 0.5 ) from temp.bla Throws Error in query: No handler for Hive UDF 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (bigint, decimal(38,18)). Possible choices: _FUNC_(bigint, array) _FUNC_(bigint, double) ; line 1 pos 7 > "Percentile" needs explicit cast to double > -- > > Key: SPARK-16228 > URL: https://issues.apache.org/jira/browse/SPARK-16228 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Priority: Blocker > > {quote} > select percentile(cast(id as bigint), cast(0.5 as double)) from temp.bla > {quote} > Works. > {quote} > select percentile(cast(id as bigint), 0.5 ) from temp.bla > {quote} > Throws > {quote} > Error in query: No handler for Hive UDF > 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': > org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method > for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (bigint, > decimal(38,18)). Possible choices: _FUNC_(bigint, array) > _FUNC_(bigint, double) ; line 1 pos 7 > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16228) "Percentile" needs explicit cast to double
Egor Pahomov created SPARK-16228: Summary: "Percentile" needs explicit cast to double Key: SPARK-16228 URL: https://issues.apache.org/jira/browse/SPARK-16228 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Egor Pahomov Priority: Blocker select percentile(cast(id as bigint), cast(0.5 as double)) from temp.bla Works. select percentile(cast(id as bigint), 0.5 ) from temp.bla Throws Error in query: No handler for Hive UDF 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (bigint, decimal(38,18)). Possible choices: _FUNC_(bigint, array) _FUNC_(bigint, double) ; line 1 pos 7 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15934) Return binary mode in ThriftServer
[ https://issues.apache.org/jira/browse/SPARK-15934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328814#comment-15328814 ] Egor Pahomov commented on SPARK-15934: -- Sure, let me create a pull request tomorrow. I would test, that everything working with all tools I mentioned - Tableau, DataGrip, Squirrel. > Return binary mode in ThriftServer > -- > > Key: SPARK-15934 > URL: https://issues.apache.org/jira/browse/SPARK-15934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov > > In spark-2.0.0 preview binary mode was turned off (SPARK-15095). > It was greatly irresponsible step due to the fact, that in 1.6.1 binary mode > was default and it turned off in 2.0.0. > Just to describe magnitude of harm not fixing this bug would do in my > organization: > * Tableau works only though Thrift Server and only with binary format. > Tableau would not work with spark-2.0.0 at all! > * I have bunch of analysts in my organization with configured sql > clients(DataGrip and Squirrel). I would need to go one by one to change > connection string for them(DataGrip). Squirrel simply do not work with http - > some jar hell in my case. > * let me not mention all other stuff which connects to our data > infrastructure through ThriftServer as gateway. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15934) Return binary mode in ThriftServer
Egor Pahomov created SPARK-15934: Summary: Return binary mode in ThriftServer Key: SPARK-15934 URL: https://issues.apache.org/jira/browse/SPARK-15934 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Egor Pahomov In spark-2.0.0 preview binary mode was turned off (SPARK-15095). It was greatly irresponsible step due to the fact, that in 1.6.1 binary mode was default and it turned off in 2.0.0. Just to describe magnitude of harm not fixing this bug would do in my organization: * Tableau works only though Thrift Server and only with binary format. Tableau would not work with spark-2.0.0 at all! * I have bunch of analysts in my organization with configured sql clients(DataGrip and Squirrel). I would need to go one by one to change connection string for them(DataGrip). Squirrel simply do not work with http - some jar hell in my case. * let me not mention all other stuff which connects to our data infrastructure through ThriftServer as gateway. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15409) Can't run spark odbc server on yarn the way I did it on 1.6.1
[ https://issues.apache.org/jira/browse/SPARK-15409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Egor Pahomov closed SPARK-15409. Resolution: Invalid > Can't run spark odbc server on yarn the way I did it on 1.6.1 > - > > Key: SPARK-15409 > URL: https://issues.apache.org/jira/browse/SPARK-15409 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Priority: Minor > > I'm getting error, while spark tries to run application in YARN - > {code} > 16/05/19 10:33:20 INFO yarn.Client: Application report for > application_1463075121059_12094 (state: ACCEPTED) > 16/05/19 10:33:21 INFO yarn.Client: Application report for > application_1463075121059_12094 (state: ACCEPTED) > 16/05/19 10:33:22 INFO yarn.Client: Application report for > application_1463075121059_12094 (state: ACCEPTED) > 16/05/19 10:33:23 WARN server.TransportChannelHandler: Exception in > connection from nod5-1-hadoop.anchorfree.net/192.168.12.128:56327 > java.lang.NoSuchMethodError: > java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView; > at > org.apache.spark.rpc.netty.Dispatcher.postToAll(Dispatcher.scala:107) > at > org.apache.spark.rpc.netty.NettyRpcHandler.channelActive(NettyRpcEnv.scala:618) > at > org.apache.spark.network.server.TransportRequestHandler.channelActive(TransportRequestHandler.java:86) > at > org.apache.spark.network.server.TransportChannelHandler.channelActive(TransportChannelHandler.java:89) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:183) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:169) > at > io.netty.channel.ChannelInboundHandlerAdapter.channelActive(ChannelInboundHandlerAdapter.java:64) > at > io.netty.handler.timeout.IdleStateHandler.channelActive(IdleStateHandler.java:251) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:183) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:169) > at > io.netty.channel.ChannelInboundHandlerAdapter.channelActive(ChannelInboundHandlerAdapter.java:64) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:183) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:169) > at > io.netty.channel.ChannelInboundHandlerAdapter.channelActive(ChannelInboundHandlerAdapter.java:64) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:183) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:169) > at > io.netty.channel.DefaultChannelPipeline.fireChannelActive(DefaultChannelPipeline.java:817) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:454) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378) > at > io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > 16/05/19 10:33:23 WARN channel.DefaultChannelPipeline: An exception was > thrown by a user handler's exceptionCaught() method while handling the > following exception: > java.lang.NoSuchMethodError: > java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView; > at > org.apache.spark.rpc.netty.Dispatcher.postToAll(Dispatcher.scala:107) > at > org.apache.spark.rpc.netty.NettyRpcHandler.channelActive(NettyRpcEnv.scala:618) > at > org.apache.spark.network.server.TransportRequestHandler.channelActive(TransportRequestHandler.java:86) > at > org.apache.spark.network.server.TransportChannelHandler.channelActive(TransportChannelHandler.java:89) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:183) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:169) > at >
[jira] [Created] (SPARK-15409) Can't run spark odbc server on yarn the way I did it on 1.6.1
Egor Pahomov created SPARK-15409: Summary: Can't run spark odbc server on yarn the way I did it on 1.6.1 Key: SPARK-15409 URL: https://issues.apache.org/jira/browse/SPARK-15409 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.0.0 Reporter: Egor Pahomov Priority: Minor I'm getting error, while spark tries to run application in YARN - {code} 16/05/19 10:33:20 INFO yarn.Client: Application report for application_1463075121059_12094 (state: ACCEPTED) 16/05/19 10:33:21 INFO yarn.Client: Application report for application_1463075121059_12094 (state: ACCEPTED) 16/05/19 10:33:22 INFO yarn.Client: Application report for application_1463075121059_12094 (state: ACCEPTED) 16/05/19 10:33:23 WARN server.TransportChannelHandler: Exception in connection from nod5-1-hadoop.anchorfree.net/192.168.12.128:56327 java.lang.NoSuchMethodError: java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView; at org.apache.spark.rpc.netty.Dispatcher.postToAll(Dispatcher.scala:107) at org.apache.spark.rpc.netty.NettyRpcHandler.channelActive(NettyRpcEnv.scala:618) at org.apache.spark.network.server.TransportRequestHandler.channelActive(TransportRequestHandler.java:86) at org.apache.spark.network.server.TransportChannelHandler.channelActive(TransportChannelHandler.java:89) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:183) at io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:169) at io.netty.channel.ChannelInboundHandlerAdapter.channelActive(ChannelInboundHandlerAdapter.java:64) at io.netty.handler.timeout.IdleStateHandler.channelActive(IdleStateHandler.java:251) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:183) at io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:169) at io.netty.channel.ChannelInboundHandlerAdapter.channelActive(ChannelInboundHandlerAdapter.java:64) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:183) at io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:169) at io.netty.channel.ChannelInboundHandlerAdapter.channelActive(ChannelInboundHandlerAdapter.java:64) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:183) at io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:169) at io.netty.channel.DefaultChannelPipeline.fireChannelActive(DefaultChannelPipeline.java:817) at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:454) at io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378) at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) 16/05/19 10:33:23 WARN channel.DefaultChannelPipeline: An exception was thrown by a user handler's exceptionCaught() method while handling the following exception: java.lang.NoSuchMethodError: java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView; at org.apache.spark.rpc.netty.Dispatcher.postToAll(Dispatcher.scala:107) at org.apache.spark.rpc.netty.NettyRpcHandler.channelActive(NettyRpcEnv.scala:618) at org.apache.spark.network.server.TransportRequestHandler.channelActive(TransportRequestHandler.java:86) at org.apache.spark.network.server.TransportChannelHandler.channelActive(TransportChannelHandler.java:89) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:183) at io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:169) at io.netty.channel.ChannelInboundHandlerAdapter.channelActive(ChannelInboundHandlerAdapter.java:64) at io.netty.handler.timeout.IdleStateHandler.channelActive(IdleStateHandler.java:251) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:183) at io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:169) at
[jira] [Created] (SPARK-13654) get_json_object fails with java.io.CharConversionException
Egor Pahomov created SPARK-13654: Summary: get_json_object fails with java.io.CharConversionException Key: SPARK-13654 URL: https://issues.apache.org/jira/browse/SPARK-13654 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.1 Reporter: Egor Pahomov I execute next query on my data: {code} select count(distinct get_json_object(regexp_extract(line, "^\\p{ASCII}*$", 0), '$.event')) from (select line from logs.raw_client_log where year=2016 and month=2 and day>28 and line rlike "^\\p{ASCII}*$" and line is not null) a {code} And it fails with {code} Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 420 in stage 168.0 failed 4 times, most recent failure: Lost task 420.3 in stage 168.0 (TID 13064, nod5-2-hadoop.anchorfree.net): java.io.CharConversionException: Invalid UTF-32 character 0x6576656e(above 10) at char #47, byte #191) at com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189) at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1855) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:571) at org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:142) at org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:141) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2202) at org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2.apply(jsonExpressions.scala:141) at org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2.apply(jsonExpressions.scala:138) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2202) at org.apache.spark.sql.catalyst.expressions.GetJsonObject.eval(jsonExpressions.scala:138) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source) at org.apache.spark.sql.execution.Expand$$anonfun$doExecute$1$$anonfun$3$$anon$1.next(Expand.scala:76) at org.apache.spark.sql.execution.Expand$$anonfun$doExecute$1$$anonfun$3$$anon$1.next(Expand.scala:62) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) {code} Basically Spark sells me the idea, that I have character 敮 in my data. But query {code} select line from logs.raw_client_log where year=2016 and month=2 and day>27 and line rlike "敮" {code} returns nothing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4403) Elastic allocation(spark.dynamicAllocation.enabled) results in task never being execued.
[ https://issues.apache.org/jira/browse/SPARK-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Egor Pahomov closed SPARK-4403. --- Resolution: Invalid Elastic allocation(spark.dynamicAllocation.enabled) results in task never being execued. Key: SPARK-4403 URL: https://issues.apache.org/jira/browse/SPARK-4403 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.1.1 Reporter: Egor Pahomov Attachments: ipython_out I execute ipython notebook + pyspark with spark.dynamicAllocation.enabled = true. Task never ends. Code: {code} import sys from random import random from operator import add partitions = 10 n = 10 * partitions def f(_): x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 + y ** 2 1 else 0 count = sc.parallelize(xrange(1, n + 1), partitions).map(f).reduce(add) print Pi is roughly %f % (4.0 * count / n) {code} {code} IPYTHON_ARGS=notebook --profile=ydf --port $IPYTHON_PORT --port-retries=0 --ip='*' --no-browser pyspark \ --verbose \ --master yarn-client \ --conf spark.driver.port=$((RANDOM_PORT + 2)) \ --conf spark.broadcast.port=$((RANDOM_PORT + 3)) \ --conf spark.replClassServer.port=$((RANDOM_PORT + 4)) \ --conf spark.blockManager.port=$((RANDOM_PORT + 5)) \ --conf spark.executor.port=$((RANDOM_PORT + 6)) \ --conf spark.fileserver.port=$((RANDOM_PORT + 7)) \ --conf spark.shuffle.service.enabled=true \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.minExecutors=1 \ --conf spark.dynamicAllocation.maxExecutors=10 \ --conf spark.ui.port=$SPARK_UI_PORT {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4403) Elastic allocation(spark.dynamicAllocation.enabled) results in task never being execued.
[ https://issues.apache.org/jira/browse/SPARK-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Egor Pahomov updated SPARK-4403: Attachment: ipython_out Elastic allocation(spark.dynamicAllocation.enabled) results in task never being execued. Key: SPARK-4403 URL: https://issues.apache.org/jira/browse/SPARK-4403 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.1.1 Reporter: Egor Pahomov Attachments: ipython_out I execute ipython notebook + pyspark with spark.dynamicAllocation.enabled = true. Task never ends. Code: {code} import sys from random import random from operator import add partitions = 10 n = 10 * partitions def f(_): x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 + y ** 2 1 else 0 count = sc.parallelize(xrange(1, n + 1), partitions).map(f).reduce(add) print Pi is roughly %f % (4.0 * count / n) {code} {code} pyspark \ --verbose \ --master yarn-client \ --conf spark.driver.port=$((RANDOM_PORT + 2)) \ --conf spark.broadcast.port=$((RANDOM_PORT + 3)) \ --conf spark.replClassServer.port=$((RANDOM_PORT + 4)) \ --conf spark.blockManager.port=$((RANDOM_PORT + 5)) \ --conf spark.executor.port=$((RANDOM_PORT + 6)) \ --conf spark.fileserver.port=$((RANDOM_PORT + 7)) \ --conf spark.shuffle.service.enabled=true \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.minExecutors=1 \ --conf spark.dynamicAllocation.maxExecutors=10 \ --conf spark.ui.port=$SPARK_UI_PORT {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2157) Can't write tight firewall rules for Spark
[ https://issues.apache.org/jira/browse/SPARK-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14035554#comment-14035554 ] Egor Pahomov commented on SPARK-2157: - Yep, I used patched version of spark. Can't write tight firewall rules for Spark -- Key: SPARK-2157 URL: https://issues.apache.org/jira/browse/SPARK-2157 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.0.0 Reporter: Andrew Ash Priority: Critical In order to run Spark in places with strict firewall rules, you need to be able to specify every port that's used between all parts of the stack. Per the [network activity section of the docs|http://spark.apache.org/docs/latest/spark-standalone.html#configuring-ports-for-network-security] most of the ports are configurable, but there are a few ports that aren't configurable. We need to make every port configurable to a particular port, so that we can run Spark in highly locked-down environments. -- This message was sent by Atlassian JIRA (v6.2#6252)