Unsubscribe

2023-02-22 Thread Tang Jinxin
Unsubscribe


(无主题)

2021-05-06 Thread Tang Jinxin
unsubscribe xiaoxingstack 邮箱:xiaoxingst...@gmail.com 签名由 网易邮箱大师 定制

Re: pyspark working with a different Python version than the cluster

2020-04-22 Thread Tang Jinxin
Hi  Copon,   Python In worker use python3 to termine, It may return python3.4 
In some nodes, Could you check python3 results? Best wishes, Jinxin 
xiaoxingstack 邮箱:xiaoxingst...@gmail.com 签名由 网易邮箱大师 定制 On 04/23/2020 01:02, 
Odon Copon wrote: Hi, Something is happening to me that I don't quite 
understand. I ran pyspark on a machine that has Python 3.5 where I managed to 
run some commands, even the Spark cluster is using Python 3.4. If I do the same 
with spark-submit I get the "Python in worker has different version 3.4 than 
that in driver 3.5" error. Why is Pyspark working then? Thanks. Regards

回复:[Spark SQL] [Beginner] Dataset[Row] collect to driver throwjava.io.EOFException: Premature EOF: no length prefix available

2020-04-22 Thread Tang Jinxin
Hi maqy,    The exception is occurd by connection closed,one of reasons is 
datanode side timeout if We have not find problem In spark before the 
exception.So We could try to find more clues In datanode log.        Best 
wishes,    Jinxin xiaoxingstack 邮箱:xiaoxingst...@gmail.com 签名由 网易邮箱大师 定制 
在2020年04月22日 23:40,maqy 写道:     Hi Jinxin, spark web ui shows that all tasks 
are completed successfully, this error appears in the shell: 
java.io.EOFException: Premature EOF: no length prefix available     at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:244)     
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
     at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStream$ResponseProcessor.run(DFSOutputStream.java:733)
 More information can be seen here: 
https://stackoverflow.com/questions/61202566/spark-sql-datasetrow-collect-to-driver-throw-java-io-eofexception-premature-e
   I speculate that there is a problem with deserialization, because after the 
web ui shows that the tasks of collect() are completed, the memory occupied by 
the “spark submit” process is still increasing. After a few minutes, the memory 
usage will no longer increase, and after a few minutes, the shell will report 
this error.   Best regards, maqy   发件人: Tang Jinxin 发送时间: 2020年4月22日 23:16 收件人: 
maqy 抄送: user@spark.apache.org 主题: 回复:[Spark SQL] [Beginner] Dataset[Row] 
collect to driver throwjava.io.EOFException: Premature EOF: no length prefix 
available   Maybe datanode stop data transfer due    to timeout.Could you 
please provide exception stack? xiaoxingstack 邮箱:xiaoxingst...@gmail.com 签名由 
网易邮箱大师 定制 在2020年04月22日 19:53,maqy 写道:     Today I meet the same problem using 
rdd.collect (), the format of rdd is Tuple2 [Int, Int]. And this problem will 
appear when the amount of data reaches about 100GB.     I guess there may be 
something wrong with deserialization. Has anyone else encountered this problem? 
  Best regards, maqy   发件人: maqy1...@outlook.com 发送时间: 2020年4月20日 10:33 收件人: 
user@spark.apache.org 主题: [Spark SQL] [Beginner] Dataset[Row] collect to driver 
throwjava.io.EOFException: Premature EOF: no length prefix available   Hi all, 
I get a Dataset[Row] through the following code:   val df: Dataset[Row] = 
spark.read.format("csv).schema(schema).load("hdfs://master:9000/mydata")   
After that I want to collect it to the driver:   val df_rows: Array[Row] = 
df.collect()   The Spark web ui shows that all tasks have run successfully, but 
the application did not stop. After more than ten minutes, an error will be 
generated in the shell:   java.io.EOFException: Premature EOF: no length prefix 
available   Environment:     Spark 2.4.3     Hadoop 2.7.7     Total rows of 
data about 800,000,000, 12GB     More detailed information can be seen 
here: 
https://stackoverflow.com/questions/61202566/spark-sql-datasetrow-collect-to-driver-throw-java-io-eofexception-premature-e
     Does anyone know the reason?   Best regards, maqy      

回复:Can I collect Dataset[Row] to driver without converting it toArray [Row]?

2020-04-22 Thread Tang Jinxin
Hi maqy, Thanks for your question.Through consideration,I have some ideas as  
follow:firstly,try not collect to driver if not nessessary,instead (use 
foreachpartition)send data from ececutors;secondly,if not use some high 
performance  ser/deser like kryo, we could have a try.As a summary,I recommend 
the first point(desigh a more efficient way when the data is too large). Best 
wishes, littlestar xiaoxingstack 邮箱:xiaoxingst...@gmail.com 签名由 网易邮箱大师 定制 
在2020年04月22日 23:24,maqy 写道: Hi Andrew,     Thank you for your reply, I am using 
the scala api of spark, and the tensorflow machine is not in the spark cluster. 
Is this JIRA / PR still valid in this situation? In addition, the current 
bottleneck of the application is that the amount of data transferred through 
the network(use collect()) is too large, and the deserialization seems to take 
some time.   Best wishes, maqy   发件人: Andrew Melo 发送时间: 2020年4月22日 21:02 收件人: 
maqy 抄送: Michael Artz; user@spark.apache.org 主题: Re: Can I collect Dataset[Row] 
to driver without converting it toArray [Row]?   On Wed, Apr 22, 2020 at 3:24 
AM maqy <454618...@qq.com> wrote: >  > I will traverse this Dataset to convert 
it to Arrow and send it to Tensorflow through Socket.   (I presume you're using 
the python tensorflow API, if you're not, please ignore)   There is a JIRA/PR 
([1] [2]) which proposes to add native support for Arrow serialization to 
python,   Under the hood, Spark is already serializing into Arrow format to 
transmit to python, it's just additionally doing an unconditional conversion to 
pandas once it reaches the python runner -- which is good if you're using 
pandas, not so great if pandas isn't what you operate on. The JIRA above would 
let you receive the arrow buffers (that already exist) directly.   Cheers, 
Andrew [1] https://issues.apache.org/jira/browse/SPARK-30153 [2] 
https://github.com/apache/spark/pull/26783   >  > I tried to use 
toLocalIterator() to traverse the dataset instead of collect  to the driver, 
but toLocalIterator() will create a lot of jobs and will bring a lot of time 
consumption. >  >  >  > Best regards, >  > maqy >  >  >  > 发件人: Michael Artz > 
发送时间: 2020年4月22日 16:09 > 收件人: maqy > 抄送: user@spark.apache.org > 主题: Re: Can I 
collect Dataset[Row] to driver without converting it to Array [Row]? >  >  >  > 
What would you do with it once you get it into driver in a Dataset[Row]? >  > 
Sent from my iPhone >  >  >  > On Apr 22, 2020, at 3:06 AM, maqy 
<454618...@qq.com> wrote: >  >  >  > When the data is stored in the Dataset 
[Row] format, the memory usage is very small. >  > When I use collect () to 
collect data to the driver, each line of the dataset will be converted to Row 
and stored in an array, which will bring great memory overhead. >  > So, can I 
collect Dataset[Row] to driver and keep its data format? >  >  >  > Best 
regards, >  > maqy >  >  >  >    
- To 
unsubscribe e-mail: user-unsubscr...@spark.apache.org   r  

回复:[Spark SQL] [Beginner] Dataset[Row] collect to driver throwjava.io.EOFException: Premature EOF: no length prefix available

2020-04-22 Thread Tang Jinxin
Maybe datanode stop data transfer due    to timeout.Could you please provide 
exception stack? xiaoxingstack 邮箱:xiaoxingst...@gmail.com 签名由 网易邮箱大师 定制 
在2020年04月22日 19:53,maqy 写道:     Today I meet the same problem using rdd.collect 
(), the format of rdd is Tuple2 [Int, Int]. And this problem will appear when 
the amount of data reaches about 100GB.     I guess there may be something 
wrong with deserialization. Has anyone else encountered this problem?   Best 
regards, maqy   发件人: maqy1...@outlook.com 发送时间: 2020年4月20日 10:33 收件人: 
user@spark.apache.org 主题: [Spark SQL] [Beginner] Dataset[Row] collect to driver 
throwjava.io.EOFException: Premature EOF: no length prefix available   Hi all, 
I get a Dataset[Row] through the following code:   val df: Dataset[Row] = 
spark.read.format("csv).schema(schema).load("hdfs://master:9000/mydata")   
After that I want to collect it to the driver:   val df_rows: Array[Row] = 
df.collect()   The Spark web ui shows that all tasks have run successfully, but 
the application did not stop. After more than ten minutes, an error will be 
generated in the shell:   java.io.EOFException: Premature EOF: no length prefix 
available   Environment:     Spark 2.4.3     Hadoop 2.7.7     Total rows of 
data about 800,000,000, 12GB     More detailed information can be seen 
here: 
https://stackoverflow.com/questions/61202566/spark-sql-datasetrow-collect-to-driver-throw-java-io-eofexception-premature-e
     Does anyone know the reason?   Best regards, maqy    

Re: Can I collect Dataset[Row] to driver without converting it to Array [Row]?

2020-04-22 Thread Tang Jinxin
maybe could try someway like foreachpartition in foreachrdd,which will not 
together to driver take too extra consumption. xiaoxingstack 
邮箱:xiaoxingst...@gmail.com 签名由 网易邮箱大师 定制 On 04/22/2020 21:02, Andrew Melo 
wrote: Hi Maqy On Wed, Apr 22, 2020 at 3:24 AM maqy <454618...@qq.com> wrote: > 
> I will traverse this Dataset to convert it to Arrow and send it to Tensorflow 
through Socket. (I presume you're using the python tensorflow API, if you're 
not, please ignore) There is a JIRA/PR ([1] [2]) which proposes to add native 
support for Arrow serialization to python, Under the hood, Spark is already 
serializing into Arrow format to transmit to python, it's just additionally 
doing an unconditional conversion to pandas once it reaches the python runner 
-- which is good if you're using pandas, not so great if pandas isn't what you 
operate on. The JIRA above would let you receive the arrow buffers (that 
already exist) directly. Cheers, Andrew [1] 
https://issues.apache.org/jira/browse/SPARK-30153 [2] 
https://github.com/apache/spark/pull/26783 > > I tried to use toLocalIterator() 
to traverse the dataset instead of collect  to the driver, but 
toLocalIterator() will create a lot of jobs and will bring a lot of time 
consumption. > > > > Best regards, > > maqy > > > > 发件人: Michael Artz > 发送时间: 
2020年4月22日 16:09 > 收件人: maqy > 抄送: user@spark.apache.org > 主题: Re: Can I 
collect Dataset[Row] to driver without converting it to Array [Row]? > > > > 
What would you do with it once you get it into driver in a Dataset[Row]? > > 
Sent from my iPhone > > > > On Apr 22, 2020, at 3:06 AM, maqy 
<454618...@qq.com> wrote: > >  > > When the data is stored in the Dataset 
[Row] format, the memory usage is very small. > > When I use collect () to 
collect data to the driver, each line of the dataset will be converted to Row 
and stored in an array, which will bring great memory overhead. > > So, can I 
collect Dataset[Row] to driver and keep its data format? > > > > Best regards, 
> > maqy > > > > 
- To 
unsubscribe e-mail: user-unsubscr...@spark.apache.org