unsubscribe

2020-04-22 Thread akram azarm
-- *M Akram Azarm* *B Eng. in Software Engineering (Reading)* *UOW,UK / IIT,LK* Contact | 077-502-0402

Re: Error while reading hive tables with tmp/hidden files inside partitions

2020-04-22 Thread Wenchen Fan
This looks like a bug that path filter doesn't work for hive table reading. Can you open a JIRA ticket? On Thu, Apr 23, 2020 at 3:15 AM Dhrubajyoti Hati wrote: > Just wondering if any one could help me out on this. > > Thank you! > > > > > *Regards,Dhrubajyoti Hati.* > > > On Wed, Apr 22, 2020

Re: is RosckDB backend available in 3.0 preview?

2020-04-22 Thread Jungtaek Lim
Sorry I should have been more clear. The discussion went to the conclusion that RocksDB state store cannot be included in Spark main codebase - it should start as individual project, and can be adopted when the project is popular enough. (See PR for more details.) That's why I guided to the

Re: pyspark working with a different Python version than the cluster

2020-04-22 Thread Tang Jinxin
Hi  Copon,   Python In worker use python3 to termine, It may return python3.4 In some nodes, Could you check python3 results? Best wishes, Jinxin xiaoxingstack 邮箱:xiaoxingst...@gmail.com 签名由 网易邮箱大师 定制 On 04/23/2020 01:02, Odon Copon wrote: Hi, Something is happening to me that I don't quite

回复:[Spark SQL] [Beginner] Dataset[Row] collect to driver throwjava.io.EOFException: Premature EOF: no length prefix available

2020-04-22 Thread Tang Jinxin
Hi maqy,    The exception is occurd by connection closed,one of reasons is datanode side timeout if We have not find problem In spark before the exception.So We could try to find more clues In datanode log.        Best wishes,    Jinxin xiaoxingstack 邮箱:xiaoxingst...@gmail.com 签名由 网易邮箱大师 定制

回复:Can I collect Dataset[Row] to driver without converting it toArray [Row]?

2020-04-22 Thread Tang Jinxin
Hi maqy, Thanks for your question.Through consideration,I have some ideas as   follow:firstly,try not collect to driver if not nessessary,instead (use foreachpartition)send data from ececutors;secondly,if not use some high performance  ser/deser like kryo, we could have a try.As a summary,I

Re: Error while reading hive tables with tmp/hidden files inside partitions

2020-04-22 Thread Dhrubajyoti Hati
Just wondering if any one could help me out on this. Thank you! *Regards,Dhrubajyoti Hati.* On Wed, Apr 22, 2020 at 7:15 PM Dhrubajyoti Hati wrote: > Hi, > > Is there any way to discard files starting with dot(.) or ending with .tmp > in the hive partition while reading from Hive table

pyspark working with a different Python version than the cluster

2020-04-22 Thread Odon Copon
Hi, Something is happening to me that I don't quite understand. I ran pyspark on a machine that has Python 3.5 where I managed to run some commands, even the Spark cluster is using Python 3.4. If I do the same with spark-submit I get the "Python in worker has different version 3.4 than that in

Spark Adaptive configuration

2020-04-22 Thread Tzahi File
Hi, I saw that spark has an option to adapt the join and shuffle configuration. For example: "spark.sql.adaptive.shuffle.targetPostShuffleInputSize" I wanted to know if you had an experience with such configuration, how it changed the performance? Another question is whether along Spark SQL

Re: is RosckDB backend available in 3.0 preview?

2020-04-22 Thread kant kodali
is it going to make it in 3.0? On Tue, Apr 21, 2020 at 9:24 PM Jungtaek Lim wrote: > Unfortunately, the short answer is no. Please refer the last part of > discussion on the PR https://github.com/apache/spark/pull/24922 > > Unless we get any native implementation of this, I guess this project

回复: 回复:[Spark SQL] [Beginner] Dataset[Row] collect to driver throwjava.io.EOFException: Premature EOF: no length prefix available

2020-04-22 Thread maqy
Hi Jinxin,  spark web ui shows that all tasks are completed successfully, this error appears in the shell: java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:244) at

回复: Can I collect Dataset[Row] to driver without converting it toArray [Row]?

2020-04-22 Thread maqy
 Hi Andrew, Thank you for your reply, I am using the scala api of spark, and the tensorflow machine is not in the spark cluster. Is this JIRA / PR still valid in this situation?  In addition, the current bottleneck of the application is that the amount of data transferred through the

回复:[Spark SQL] [Beginner] Dataset[Row] collect to driver throwjava.io.EOFException: Premature EOF: no length prefix available

2020-04-22 Thread Tang Jinxin
Maybe datanode stop data transfer due    to timeout.Could you please provide exception stack? xiaoxingstack 邮箱:xiaoxingst...@gmail.com 签名由 网易邮箱大师 定制 在2020年04月22日 19:53,maqy 写道:     Today I meet the same problem using rdd.collect (), the format of rdd is Tuple2 [Int, Int]. And this problem will

Error while reading hive tables with tmp/hidden files inside partitions

2020-04-22 Thread Dhrubajyoti Hati
Hi, Is there any way to discard files starting with dot(.) or ending with .tmp in the hive partition while reading from Hive table using spark.read.table method. I tried using PathFilters but they didn't work. I am using spark-submit and passing my python file(pyspark) containing the source

Re: Can I collect Dataset[Row] to driver without converting it to Array [Row]?

2020-04-22 Thread Tang Jinxin
maybe could try someway like foreachpartition in foreachrdd,which will not together to driver take too extra consumption. xiaoxingstack 邮箱:xiaoxingst...@gmail.com 签名由 网易邮箱大师 定制 On 04/22/2020 21:02, Andrew Melo wrote: Hi Maqy On Wed, Apr 22, 2020 at 3:24 AM maqy <454618...@qq.com> wrote: > > I

Re: Can I collect Dataset[Row] to driver without converting it to Array [Row]?

2020-04-22 Thread Andrew Melo
Hi Maqy On Wed, Apr 22, 2020 at 3:24 AM maqy <454618...@qq.com> wrote: > > I will traverse this Dataset to convert it to Arrow and send it to Tensorflow > through Socket. (I presume you're using the python tensorflow API, if you're not, please ignore) There is a JIRA/PR ([1] [2]) which

回复: [Spark SQL] [Beginner] Dataset[Row] collect to driver throwjava.io.EOFException: Premature EOF: no length prefix available

2020-04-22 Thread maqy
Today I meet the same problem using rdd.collect (), the format of rdd is Tuple2 [Int, Int]. And this problem will appear when the amount of data reaches about 100GB. I guess there may be something wrong with deserialization. Has anyone else encountered this problem? Best regards, maqy

Re: What is the best way to take the top N entries from a hive table/data source?

2020-04-22 Thread ZHANG Wei
The performance issue might be caused by the parquet table partitions count, only 3. The reader used that partitions count to parallelize extraction. Refer to the log you provided: > spark.sql("select * from db.table limit 100").explain(false) > == Physical Plan == > CollectLimit 100 >

[Structured Streaming] Connecting to Kafka via a Custom Consumer / Producer

2020-04-22 Thread Patrick McGloin
Hi, The large international bank I work for has a custom Kafka implementation. The client libraries that are used to connect to Kafka have extra security steps. They implement the Kafka Consumer and Producer interfaces in this client library so once we use it to connect to Kafka, we can treat

回复: Can I collect Dataset[Row] to driver without converting it to Array [Row]?

2020-04-22 Thread maqy
I will traverse this Dataset to convert it to Arrow and send it to Tensorflow through Socket. I tried to use toLocalIterator() to traverse the dataset instead of collect to the driver, but toLocalIterator() will create a lot of jobs and will bring a lot of time consumption. Best regards, maqy

Re: Can I collect Dataset[Row] to driver without converting it to Array [Row]?

2020-04-22 Thread Michael Artz
What would you do with it once you get it into driver in a Dataset[Row]? Sent from my iPhone > On Apr 22, 2020, at 3:06 AM, maqy <454618...@qq.com> wrote: > >  > When the data is stored in the Dataset [Row] format, the memory usage is very > small. > When I use collect () to collect data to

Can I collect Dataset[Row] to driver without converting it to Array [Row]?

2020-04-22 Thread maqy
 When the data is stored in the Dataset [Row] format, the memory usage is very small.  When I use collect () to collect data to the driver, each line of the dataset will be converted to Row and stored in an array, which will bring great memory overhead.  So, can I collect Dataset[Row] to

Re: Using startingOffsets latest - no data from structured streaming kafka query

2020-04-22 Thread Ruijing Li
For some reason, after restarting the app and trying again, latest now works as expected. Not sure why it didn’t work before. On Tue, Apr 21, 2020 at 1:46 PM Ruijing Li wrote: > Yes, we did. But for some reason latest does not show them. The count is > always 0. > > On Sun, Apr 19, 2020 at 3:42

Re: Deadlock using Barrier Execution

2020-04-22 Thread wuyi
Hi csmith, Just be too late here. We just realize this bug recently. Here's the fix https://github.com/apache/spark/pull/28257. And I believe we're going to backport it into 2.4.x. Best, Yi Wu -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/