Re: PySpark error java.lang.IllegalArgumentException

2023-07-10 Thread elango vaidyanathan
Finally I was able to solve this issue by setting this conf.
"spark.driver.extraJavaOptions=-Dorg.xerial.snappy.tempdir=/my_user/temp_
folder"

Thanks all!



On Sat, 8 Jul 2023 at 3:45 AM, Brian Huynh  wrote:

> Hi Khalid,
>
> Elango mentioned the file is working fine in our another environment with
> the same driver and executor memory
>
> Brian
>
> On Jul 7, 2023, at 10:18 AM, Khalid Mammadov 
> wrote:
>
> 
>
> Perhaps that parquet file is corrupted or got that is in that folder?
> To check, try to read that file with pandas or other tools to see if you
> can read without Spark.
>
> On Wed, 5 Jul 2023, 07:25 elango vaidyanathan, 
> wrote:
>
>>
>> Hi team,
>>
>> Any updates on this below issue
>>
>> On Mon, 3 Jul 2023 at 6:18 PM, elango vaidyanathan 
>> wrote:
>>
>>>
>>>
>>> Hi all,
>>>
>>> I am reading a parquet file like this and it gives 
>>> java.lang.IllegalArgumentException.
>>> However i can work with other parquet files (such as nyc taxi parquet
>>> files) without any issue. I have copied the full error log as well. Can you
>>> please check once and let me know how to fix this?
>>>
>>> import pyspark
>>>
>>> from pyspark.sql import SparkSession
>>>
>>> spark=SparkSession.builder.appName("testPyspark").config("spark.executor.memory",
>>> "20g").config("spark.driver.memory", "50g").getOrCreate()
>>>
>>> df=spark.read.parquet("/data/202301/account_cycle")
>>>
>>> df.printSchema() # worksfine
>>>
>>> df.count() #worksfine
>>>
>>> df.show()# getting below error
>>>
>>> >>> df.show()
>>>
>>> 23/07/03 18:07:20 INFO FileSourceStrategy: Pushed Filters:
>>>
>>> 23/07/03 18:07:20 INFO FileSourceStrategy: Post-Scan Filters:
>>>
>>> 23/07/03 18:07:20 INFO FileSourceStrategy: Output Data Schema:
>>> struct>> account_status: string, currency_code: string, opened_dt: date ... 30 more
>>> fields>
>>>
>>> 23/07/03 18:07:20 INFO MemoryStore: Block broadcast_19 stored as values
>>> in memory (estimated size 540.6 KiB, free 26.5 GiB)
>>>
>>> 23/07/03 18:07:20 INFO MemoryStore: Block broadcast_19_piece0 stored as
>>> bytes in memory (estimated size 46.0 KiB, free 26.5 GiB)
>>>
>>> 23/07/03 18:07:20 INFO BlockManagerInfo: Added broadcast_19_piece0 in
>>> memory on mynode:41055 (size: 46.0 KiB, free: 26.5 GiB)
>>>
>>> 23/07/03 18:07:20 INFO SparkContext: Created broadcast 19 from
>>> showString at NativeMethodAccessorImpl.java:0
>>>
>>> 23/07/03 18:07:20 INFO FileSourceScanExec: Planning scan with bin
>>> packing, max size: 134217728 bytes, open cost is considered as scanning
>>> 4194304 bytes.
>>>
>>> 23/07/03 18:07:20 INFO SparkContext: Starting job: showString at
>>> NativeMethodAccessorImpl.java:0
>>>
>>> 23/07/03 18:07:20 INFO DAGScheduler: Got job 13 (showString at
>>> NativeMethodAccessorImpl.java:0) with 1 output partitions
>>>
>>> 23/07/03 18:07:20 INFO DAGScheduler: Final stage: ResultStage 14
>>> (showString at NativeMethodAccessorImpl.java:0)
>>>
>>> 23/07/03 18:07:20 INFO DAGScheduler: Parents of final stage: List()
>>>
>>> 23/07/03 18:07:20 INFO DAGScheduler: Missing parents: List()
>>>
>>> 23/07/03 18:07:20 INFO DAGScheduler: Submitting ResultStage 14
>>> (MapPartitionsRDD[42] at showString at NativeMethodAccessorImpl.java:0),
>>> which has no missing parents
>>>
>>> 23/07/03 18:07:20 INFO MemoryStore: Block broadcast_20 stored as values
>>> in memory (estimated size 38.1 KiB, free 26.5 GiB)
>>>
>>> 23/07/03 18:07:20 INFO MemoryStore: Block broadcast_20_piece0 stored as
>>> bytes in memory (estimated size 10.5 KiB, free 26.5 GiB)
>>>
>>> 23/07/03 18:07:20 INFO BlockManagerInfo: Added broadcast_20_piece0 in
>>> memory on mynode:41055 (size: 10.5 KiB, free: 26.5 GiB)
>>>
>>> 23/07/03 18:07:20 INFO SparkContext: Created broadcast 20 from broadcast
>>> at DAGScheduler.scala:1478
>>>
>>> 23/07/03 18:07:20 INFO DAGScheduler: Submitting 1 missing tasks from
>>> ResultStage 14 (MapPartitionsRDD[42] at showString at
>>> NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions
>>> Vector(0))
>>>
>>> 23/07/03 18:07:20 INFO TaskSchedulerImpl: Adding task set 14.0 with 1
>>> tasks resource profile 0
>>>
>>> 23/07/03 18:07:20 INFO TaskSetManager: Starting task 0.0 in stage 14.0
>>> (TID 48) (mynode, executor driver, partition 0, PROCESS_LOCAL, 4890 bytes)
>>> taskResourceAssignments Map()
>>>
>>> 23/07/03 18:07:20 INFO Executor: Running task 0.0 in stage 14.0 (TID 48)
>>>
>>> 23/07/03 18:07:20 INFO FileScanRDD: Reading File path:
>>> file:///data/202301/account_cycle/account_cycle-202301-53.parquet, range:
>>> 0-134217728, partition values: [empty row]
>>>
>>> 23/07/03 18:07:20 ERROR Executor: Exception in task 0.0 in stage 14.0
>>> (TID 48)
>>>
>>> java.lang.IllegalArgumentException
>>>
>>> at java.nio.Buffer.limit(Buffer.java:275)
>>>
>>> at org.xerial.snappy.Snappy.uncompress(Snappy.java:553)
>>>
>>> at
>>> org.apache.parquet.hadoop.codec.SnappyDecompressor.decompress(SnappyDecompressor.java:71)
>>>
>>> at
>>> 

Re: PySpark error java.lang.IllegalArgumentException

2023-07-07 Thread Brian Huynh
Hi Khalid,Elango mentioned the file is working fine in our another environment with the same driver and executor memoryBrianOn Jul 7, 2023, at 10:18 AM, Khalid Mammadov  wrote:Perhaps that parquet file is corrupted or got that is in that folder?To check, try to read that file with pandas or other tools to see if you can read without Spark.On Wed, 5 Jul 2023, 07:25 elango vaidyanathan,  wrote:Hi team,Any updates on this below issueOn Mon, 3 Jul 2023 at 6:18 PM, elango vaidyanathan  wrote: Hi all,
I am reading a parquet file like this and it gives java.lang.IllegalArgumentException. However i can work with other parquet files (such as nyc taxi parquet files) without any issue. I have copied the full error log as well. Can you please check once and let me know how to fix this?
import pyspark
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("testPyspark").config("spark.executor.memory", "20g").config("spark.driver.memory", "50g").getOrCreate()
df=spark.read.parquet("/data/202301/account_cycle")
df.printSchema() # worksfine
df.count() #worksfine
df.show()# getting below error
>>> df.show()
23/07/03 18:07:20 INFO FileSourceStrategy: Pushed Filters:
23/07/03 18:07:20 INFO FileSourceStrategy: Post-Scan Filters:
23/07/03 18:07:20 INFO FileSourceStrategy: Output Data Schema: struct
23/07/03 18:07:20 INFO MemoryStore: Block broadcast_19 stored as values in memory (estimated size 540.6 KiB, free 26.5 GiB)
23/07/03 18:07:20 INFO MemoryStore: Block broadcast_19_piece0 stored as bytes in memory (estimated size 46.0 KiB, free 26.5 GiB)
23/07/03 18:07:20 INFO BlockManagerInfo: Added broadcast_19_piece0 in memory on mynode:41055 (size: 46.0 KiB, free: 26.5 GiB)
23/07/03 18:07:20 INFO SparkContext: Created broadcast 19 from showString at NativeMethodAccessorImpl.java:0
23/07/03 18:07:20 INFO FileSourceScanExec: Planning scan with bin packing, max size: 134217728 bytes, open cost is considered as scanning 4194304 bytes.
23/07/03 18:07:20 INFO SparkContext: Starting job: showString at NativeMethodAccessorImpl.java:0
23/07/03 18:07:20 INFO DAGScheduler: Got job 13 (showString at NativeMethodAccessorImpl.java:0) with 1 output partitions
23/07/03 18:07:20 INFO DAGScheduler: Final stage: ResultStage 14 (showString at NativeMethodAccessorImpl.java:0)
23/07/03 18:07:20 INFO DAGScheduler: Parents of final stage: List()
23/07/03 18:07:20 INFO DAGScheduler: Missing parents: List()
23/07/03 18:07:20 INFO DAGScheduler: Submitting ResultStage 14 (MapPartitionsRDD[42] at showString at NativeMethodAccessorImpl.java:0), which has no missing parents
23/07/03 18:07:20 INFO MemoryStore: Block broadcast_20 stored as values in memory (estimated size 38.1 KiB, free 26.5 GiB)
23/07/03 18:07:20 INFO MemoryStore: Block broadcast_20_piece0 stored as bytes in memory (estimated size 10.5 KiB, free 26.5 GiB)
23/07/03 18:07:20 INFO BlockManagerInfo: Added broadcast_20_piece0 in memory on mynode:41055 (size: 10.5 KiB, free: 26.5 GiB)
23/07/03 18:07:20 INFO SparkContext: Created broadcast 20 from broadcast at DAGScheduler.scala:1478
23/07/03 18:07:20 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 14 (MapPartitionsRDD[42] at showString at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0))
23/07/03 18:07:20 INFO TaskSchedulerImpl: Adding task set 14.0 with 1 tasks resource profile 0
23/07/03 18:07:20 INFO TaskSetManager: Starting task 0.0 in stage 14.0 (TID 48) (mynode, executor driver, partition 0, PROCESS_LOCAL, 4890 bytes) taskResourceAssignments Map()
23/07/03 18:07:20 INFO Executor: Running task 0.0 in stage 14.0 (TID 48)
23/07/03 18:07:20 INFO FileScanRDD: Reading File path: file:///data/202301/account_cycle/account_cycle-202301-53.parquet, range: 0-134217728, partition values: [empty row]
23/07/03 18:07:20 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 48)
java.lang.IllegalArgumentException
    at java.nio.Buffer.limit(Buffer.java:275)
    at org.xerial.snappy.Snappy.uncompress(Snappy.java:553)
    at org.apache.parquet.hadoop.codec.SnappyDecompressor.decompress(SnappyDecompressor.java:71)
    at org.apache.parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:51)
    at java.io.DataInputStream.readFully(DataInputStream.java:195)
    at java.io.DataInputStream.readFully(DataInputStream.java:169)
    at org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:286)
    at org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:237)
    at org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:246)
    at org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary.(PlainValuesDictionary.java:154)
    at org.apache.parquet.column.Encoding$1.initDictionary(Encoding.java:96)
    at org.apache.parquet.column.Encoding$5.initDictionary(Encoding.java:163)
    at 

Re: PySpark error java.lang.IllegalArgumentException

2023-07-07 Thread Khalid Mammadov
Perhaps that parquet file is corrupted or got that is in that folder?
To check, try to read that file with pandas or other tools to see if you
can read without Spark.

On Wed, 5 Jul 2023, 07:25 elango vaidyanathan,  wrote:

>
> Hi team,
>
> Any updates on this below issue
>
> On Mon, 3 Jul 2023 at 6:18 PM, elango vaidyanathan 
> wrote:
>
>>
>>
>> Hi all,
>>
>> I am reading a parquet file like this and it gives 
>> java.lang.IllegalArgumentException.
>> However i can work with other parquet files (such as nyc taxi parquet
>> files) without any issue. I have copied the full error log as well. Can you
>> please check once and let me know how to fix this?
>>
>> import pyspark
>>
>> from pyspark.sql import SparkSession
>>
>> spark=SparkSession.builder.appName("testPyspark").config("spark.executor.memory",
>> "20g").config("spark.driver.memory", "50g").getOrCreate()
>>
>> df=spark.read.parquet("/data/202301/account_cycle")
>>
>> df.printSchema() # worksfine
>>
>> df.count() #worksfine
>>
>> df.show()# getting below error
>>
>> >>> df.show()
>>
>> 23/07/03 18:07:20 INFO FileSourceStrategy: Pushed Filters:
>>
>> 23/07/03 18:07:20 INFO FileSourceStrategy: Post-Scan Filters:
>>
>> 23/07/03 18:07:20 INFO FileSourceStrategy: Output Data Schema:
>> struct> account_status: string, currency_code: string, opened_dt: date ... 30 more
>> fields>
>>
>> 23/07/03 18:07:20 INFO MemoryStore: Block broadcast_19 stored as values
>> in memory (estimated size 540.6 KiB, free 26.5 GiB)
>>
>> 23/07/03 18:07:20 INFO MemoryStore: Block broadcast_19_piece0 stored as
>> bytes in memory (estimated size 46.0 KiB, free 26.5 GiB)
>>
>> 23/07/03 18:07:20 INFO BlockManagerInfo: Added broadcast_19_piece0 in
>> memory on mynode:41055 (size: 46.0 KiB, free: 26.5 GiB)
>>
>> 23/07/03 18:07:20 INFO SparkContext: Created broadcast 19 from showString
>> at NativeMethodAccessorImpl.java:0
>>
>> 23/07/03 18:07:20 INFO FileSourceScanExec: Planning scan with bin
>> packing, max size: 134217728 bytes, open cost is considered as scanning
>> 4194304 bytes.
>>
>> 23/07/03 18:07:20 INFO SparkContext: Starting job: showString at
>> NativeMethodAccessorImpl.java:0
>>
>> 23/07/03 18:07:20 INFO DAGScheduler: Got job 13 (showString at
>> NativeMethodAccessorImpl.java:0) with 1 output partitions
>>
>> 23/07/03 18:07:20 INFO DAGScheduler: Final stage: ResultStage 14
>> (showString at NativeMethodAccessorImpl.java:0)
>>
>> 23/07/03 18:07:20 INFO DAGScheduler: Parents of final stage: List()
>>
>> 23/07/03 18:07:20 INFO DAGScheduler: Missing parents: List()
>>
>> 23/07/03 18:07:20 INFO DAGScheduler: Submitting ResultStage 14
>> (MapPartitionsRDD[42] at showString at NativeMethodAccessorImpl.java:0),
>> which has no missing parents
>>
>> 23/07/03 18:07:20 INFO MemoryStore: Block broadcast_20 stored as values
>> in memory (estimated size 38.1 KiB, free 26.5 GiB)
>>
>> 23/07/03 18:07:20 INFO MemoryStore: Block broadcast_20_piece0 stored as
>> bytes in memory (estimated size 10.5 KiB, free 26.5 GiB)
>>
>> 23/07/03 18:07:20 INFO BlockManagerInfo: Added broadcast_20_piece0 in
>> memory on mynode:41055 (size: 10.5 KiB, free: 26.5 GiB)
>>
>> 23/07/03 18:07:20 INFO SparkContext: Created broadcast 20 from broadcast
>> at DAGScheduler.scala:1478
>>
>> 23/07/03 18:07:20 INFO DAGScheduler: Submitting 1 missing tasks from
>> ResultStage 14 (MapPartitionsRDD[42] at showString at
>> NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions
>> Vector(0))
>>
>> 23/07/03 18:07:20 INFO TaskSchedulerImpl: Adding task set 14.0 with 1
>> tasks resource profile 0
>>
>> 23/07/03 18:07:20 INFO TaskSetManager: Starting task 0.0 in stage 14.0
>> (TID 48) (mynode, executor driver, partition 0, PROCESS_LOCAL, 4890 bytes)
>> taskResourceAssignments Map()
>>
>> 23/07/03 18:07:20 INFO Executor: Running task 0.0 in stage 14.0 (TID 48)
>>
>> 23/07/03 18:07:20 INFO FileScanRDD: Reading File path:
>> file:///data/202301/account_cycle/account_cycle-202301-53.parquet, range:
>> 0-134217728, partition values: [empty row]
>>
>> 23/07/03 18:07:20 ERROR Executor: Exception in task 0.0 in stage 14.0
>> (TID 48)
>>
>> java.lang.IllegalArgumentException
>>
>> at java.nio.Buffer.limit(Buffer.java:275)
>>
>> at org.xerial.snappy.Snappy.uncompress(Snappy.java:553)
>>
>> at
>> org.apache.parquet.hadoop.codec.SnappyDecompressor.decompress(SnappyDecompressor.java:71)
>>
>> at
>> org.apache.parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:51)
>>
>> at java.io.DataInputStream.readFully(DataInputStream.java:195)
>>
>> at java.io.DataInputStream.readFully(DataInputStream.java:169)
>>
>> at
>> org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:286)
>>
>> at
>> org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:237)
>>
>> at
>> org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:246)
>>
>> at
>> 

Re: PySpark error java.lang.IllegalArgumentException

2023-07-05 Thread elango vaidyanathan
Hi team,

Any updates on this below issue

On Mon, 3 Jul 2023 at 6:18 PM, elango vaidyanathan 
wrote:

>
>
> Hi all,
>
> I am reading a parquet file like this and it gives 
> java.lang.IllegalArgumentException.
> However i can work with other parquet files (such as nyc taxi parquet
> files) without any issue. I have copied the full error log as well. Can you
> please check once and let me know how to fix this?
>
> import pyspark
>
> from pyspark.sql import SparkSession
>
> spark=SparkSession.builder.appName("testPyspark").config("spark.executor.memory",
> "20g").config("spark.driver.memory", "50g").getOrCreate()
>
> df=spark.read.parquet("/data/202301/account_cycle")
>
> df.printSchema() # worksfine
>
> df.count() #worksfine
>
> df.show()# getting below error
>
> >>> df.show()
>
> 23/07/03 18:07:20 INFO FileSourceStrategy: Pushed Filters:
>
> 23/07/03 18:07:20 INFO FileSourceStrategy: Post-Scan Filters:
>
> 23/07/03 18:07:20 INFO FileSourceStrategy: Output Data Schema:
> struct account_status: string, currency_code: string, opened_dt: date ... 30 more
> fields>
>
> 23/07/03 18:07:20 INFO MemoryStore: Block broadcast_19 stored as values in
> memory (estimated size 540.6 KiB, free 26.5 GiB)
>
> 23/07/03 18:07:20 INFO MemoryStore: Block broadcast_19_piece0 stored as
> bytes in memory (estimated size 46.0 KiB, free 26.5 GiB)
>
> 23/07/03 18:07:20 INFO BlockManagerInfo: Added broadcast_19_piece0 in
> memory on mynode:41055 (size: 46.0 KiB, free: 26.5 GiB)
>
> 23/07/03 18:07:20 INFO SparkContext: Created broadcast 19 from showString
> at NativeMethodAccessorImpl.java:0
>
> 23/07/03 18:07:20 INFO FileSourceScanExec: Planning scan with bin packing,
> max size: 134217728 bytes, open cost is considered as scanning 4194304
> bytes.
>
> 23/07/03 18:07:20 INFO SparkContext: Starting job: showString at
> NativeMethodAccessorImpl.java:0
>
> 23/07/03 18:07:20 INFO DAGScheduler: Got job 13 (showString at
> NativeMethodAccessorImpl.java:0) with 1 output partitions
>
> 23/07/03 18:07:20 INFO DAGScheduler: Final stage: ResultStage 14
> (showString at NativeMethodAccessorImpl.java:0)
>
> 23/07/03 18:07:20 INFO DAGScheduler: Parents of final stage: List()
>
> 23/07/03 18:07:20 INFO DAGScheduler: Missing parents: List()
>
> 23/07/03 18:07:20 INFO DAGScheduler: Submitting ResultStage 14
> (MapPartitionsRDD[42] at showString at NativeMethodAccessorImpl.java:0),
> which has no missing parents
>
> 23/07/03 18:07:20 INFO MemoryStore: Block broadcast_20 stored as values in
> memory (estimated size 38.1 KiB, free 26.5 GiB)
>
> 23/07/03 18:07:20 INFO MemoryStore: Block broadcast_20_piece0 stored as
> bytes in memory (estimated size 10.5 KiB, free 26.5 GiB)
>
> 23/07/03 18:07:20 INFO BlockManagerInfo: Added broadcast_20_piece0 in
> memory on mynode:41055 (size: 10.5 KiB, free: 26.5 GiB)
>
> 23/07/03 18:07:20 INFO SparkContext: Created broadcast 20 from broadcast
> at DAGScheduler.scala:1478
>
> 23/07/03 18:07:20 INFO DAGScheduler: Submitting 1 missing tasks from
> ResultStage 14 (MapPartitionsRDD[42] at showString at
> NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions
> Vector(0))
>
> 23/07/03 18:07:20 INFO TaskSchedulerImpl: Adding task set 14.0 with 1
> tasks resource profile 0
>
> 23/07/03 18:07:20 INFO TaskSetManager: Starting task 0.0 in stage 14.0
> (TID 48) (mynode, executor driver, partition 0, PROCESS_LOCAL, 4890 bytes)
> taskResourceAssignments Map()
>
> 23/07/03 18:07:20 INFO Executor: Running task 0.0 in stage 14.0 (TID 48)
>
> 23/07/03 18:07:20 INFO FileScanRDD: Reading File path:
> file:///data/202301/account_cycle/account_cycle-202301-53.parquet, range:
> 0-134217728, partition values: [empty row]
>
> 23/07/03 18:07:20 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID
> 48)
>
> java.lang.IllegalArgumentException
>
> at java.nio.Buffer.limit(Buffer.java:275)
>
> at org.xerial.snappy.Snappy.uncompress(Snappy.java:553)
>
> at
> org.apache.parquet.hadoop.codec.SnappyDecompressor.decompress(SnappyDecompressor.java:71)
>
> at
> org.apache.parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:51)
>
> at java.io.DataInputStream.readFully(DataInputStream.java:195)
>
> at java.io.DataInputStream.readFully(DataInputStream.java:169)
>
> at
> org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:286)
>
> at
> org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:237)
>
> at
> org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:246)
>
> at
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary.(PlainValuesDictionary.java:154)
>
> at
> org.apache.parquet.column.Encoding$1.initDictionary(Encoding.java:96)
>
> at
> org.apache.parquet.column.Encoding$5.initDictionary(Encoding.java:163)
>
> at
> 

PySpark error java.lang.IllegalArgumentException

2023-07-03 Thread elango vaidyanathan
Hi all,

I am reading a parquet file like this and it gives
java.lang.IllegalArgumentException.
However i can work with other parquet files (such as nyc taxi parquet
files) without any issue. I have copied the full error log as well. Can you
please check once and let me know how to fix this?

import pyspark

from pyspark.sql import SparkSession

spark=SparkSession.builder.appName("testPyspark").config("spark.executor.memory",
"20g").config("spark.driver.memory", "50g").getOrCreate()

df=spark.read.parquet("/data/202301/account_cycle")

df.printSchema() # worksfine

df.count() #worksfine

df.show()# getting below error

>>> df.show()

23/07/03 18:07:20 INFO FileSourceStrategy: Pushed Filters:

23/07/03 18:07:20 INFO FileSourceStrategy: Post-Scan Filters:

23/07/03 18:07:20 INFO FileSourceStrategy: Output Data Schema:
struct

23/07/03 18:07:20 INFO MemoryStore: Block broadcast_19 stored as values in
memory (estimated size 540.6 KiB, free 26.5 GiB)

23/07/03 18:07:20 INFO MemoryStore: Block broadcast_19_piece0 stored as
bytes in memory (estimated size 46.0 KiB, free 26.5 GiB)

23/07/03 18:07:20 INFO BlockManagerInfo: Added broadcast_19_piece0 in
memory on mynode:41055 (size: 46.0 KiB, free: 26.5 GiB)

23/07/03 18:07:20 INFO SparkContext: Created broadcast 19 from showString
at NativeMethodAccessorImpl.java:0

23/07/03 18:07:20 INFO FileSourceScanExec: Planning scan with bin packing,
max size: 134217728 bytes, open cost is considered as scanning 4194304
bytes.

23/07/03 18:07:20 INFO SparkContext: Starting job: showString at
NativeMethodAccessorImpl.java:0

23/07/03 18:07:20 INFO DAGScheduler: Got job 13 (showString at
NativeMethodAccessorImpl.java:0) with 1 output partitions

23/07/03 18:07:20 INFO DAGScheduler: Final stage: ResultStage 14
(showString at NativeMethodAccessorImpl.java:0)

23/07/03 18:07:20 INFO DAGScheduler: Parents of final stage: List()

23/07/03 18:07:20 INFO DAGScheduler: Missing parents: List()

23/07/03 18:07:20 INFO DAGScheduler: Submitting ResultStage 14
(MapPartitionsRDD[42] at showString at NativeMethodAccessorImpl.java:0),
which has no missing parents

23/07/03 18:07:20 INFO MemoryStore: Block broadcast_20 stored as values in
memory (estimated size 38.1 KiB, free 26.5 GiB)

23/07/03 18:07:20 INFO MemoryStore: Block broadcast_20_piece0 stored as
bytes in memory (estimated size 10.5 KiB, free 26.5 GiB)

23/07/03 18:07:20 INFO BlockManagerInfo: Added broadcast_20_piece0 in
memory on mynode:41055 (size: 10.5 KiB, free: 26.5 GiB)

23/07/03 18:07:20 INFO SparkContext: Created broadcast 20 from broadcast at
DAGScheduler.scala:1478

23/07/03 18:07:20 INFO DAGScheduler: Submitting 1 missing tasks from
ResultStage 14 (MapPartitionsRDD[42] at showString at
NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions
Vector(0))

23/07/03 18:07:20 INFO TaskSchedulerImpl: Adding task set 14.0 with 1 tasks
resource profile 0

23/07/03 18:07:20 INFO TaskSetManager: Starting task 0.0 in stage 14.0 (TID
48) (mynode, executor driver, partition 0, PROCESS_LOCAL, 4890 bytes)
taskResourceAssignments Map()

23/07/03 18:07:20 INFO Executor: Running task 0.0 in stage 14.0 (TID 48)

23/07/03 18:07:20 INFO FileScanRDD: Reading File path:
file:///data/202301/account_cycle/account_cycle-202301-53.parquet, range:
0-134217728, partition values: [empty row]

23/07/03 18:07:20 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID
48)

java.lang.IllegalArgumentException

at java.nio.Buffer.limit(Buffer.java:275)

at org.xerial.snappy.Snappy.uncompress(Snappy.java:553)

at
org.apache.parquet.hadoop.codec.SnappyDecompressor.decompress(SnappyDecompressor.java:71)

at
org.apache.parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:51)

at java.io.DataInputStream.readFully(DataInputStream.java:195)

at java.io.DataInputStream.readFully(DataInputStream.java:169)

at
org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:286)

at
org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:237)

at
org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:246)

at
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary.(PlainValuesDictionary.java:154)

at
org.apache.parquet.column.Encoding$1.initDictionary(Encoding.java:96)

at
org.apache.parquet.column.Encoding$5.initDictionary(Encoding.java:163)

at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.(VectorizedColumnReader.java:114)

at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:352)

at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:293)

at

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Oliver Ruebenacker
So I think now that my problem is Spark-related after all. It looks like my
bootstrap script installs SciPy just fine in a regular environment, but
somehow interaction with PySpark breaks it.

On Fri, Jan 6, 2023 at 12:39 PM Bjørn Jørgensen 
wrote:

> Create a Dockerfile
>
> FROM fedora
>
> RUN sudo yum install -y python3-devel
> RUN sudo pip3 install -U Cython && \
> sudo pip3 install -U pybind11 && \
> sudo pip3 install -U pythran && \
> sudo pip3 install -U numpy && \
> sudo pip3 install -U scipy
>
>
>
>
>
>
> docker build --pull --rm -f "Dockerfile" -t fedoratest:latest "."
>
> Sending build context to Docker daemon  2.048kB
> Step 1/3 : FROM fedora
> latest: Pulling from library/fedora
> Digest:
> sha256:3487c98481d1bba7e769cf7bcecd6343c2d383fdd6bed34ec541b6b23ef07664
> Status: Image is up to date for fedora:latest
>  ---> 95b7a2603d3a
> Step 2/3 : RUN sudo yum install -y python3-devel
>  ---> Running in a7c648ae7014
> Fedora 37 - x86_64  7.5 MB/s |  64 MB
> 00:08
> Fedora 37 openh264 (From Cisco) - x86_64418  B/s | 2.5 kB
> 00:06
> Fedora Modular 37 - x86_64  471 kB/s | 3.0 MB
> 00:06
> Fedora 37 - x86_64 - Updates3.0 MB/s |  20 MB
> 00:06
> Fedora Modular 37 - x86_64 - Updates179 kB/s | 1.1 MB
> 00:06
> Last metadata expiration check: 0:00:01 ago on Fri Jan  6 17:37:59 2023.
> Dependencies resolved.
>
> 
>  Package  Architecture Version  Repository
> Size
>
> 
> Installing:
>  python3-develx86_64   3.11.1-1.fc37updates
> 269 k
> Upgrading:
>  python3  x86_64   3.11.1-1.fc37updates
>  27 k
>  python3-libs x86_64   3.11.1-1.fc37updates
> 9.6 M
> Installing dependencies:
>  libpkgconf   x86_64   1.8.0-3.fc37 fedora
> 36 k
>  pkgconf  x86_64   1.8.0-3.fc37 fedora
> 41 k
>  pkgconf-m4   noarch   1.8.0-3.fc37 fedora
> 14 k
>  pkgconf-pkg-config   x86_64   1.8.0-3.fc37 fedora
> 10 k
> Installing weak dependencies:
>  python3-pip  noarch   22.2.2-3.fc37updates
> 3.1 M
>  python3-setuptools   noarch   62.6.0-2.fc37fedora
>  1.6 M
>
> Transaction Summary
>
> 
> Install  7 Packages
> Upgrade  2 Packages
>
> Total download size: 15 M
> Downloading Packages:
> (1/9): pkgconf-m4-1.8.0-3.fc37.noarch.rpm   2.9 kB/s |  14 kB
> 00:05
> (2/9): libpkgconf-1.8.0-3.fc37.x86_64.rpm   7.1 kB/s |  36 kB
> 00:05
> (3/9): pkgconf-1.8.0-3.fc37.x86_64.rpm  8.2 kB/s |  41 kB
> 00:05
> (4/9): pkgconf-pkg-config-1.8.0-3.fc37.x86_64.r 143 kB/s |  10 kB
> 00:00
> (5/9): python3-devel-3.11.1-1.fc37.x86_64.rpm   458 kB/s | 269 kB
> 00:00
> (6/9): python3-3.11.1-1.fc37.x86_64.rpm 442 kB/s |  27 kB
> 00:00
> (7/9): python3-setuptools-62.6.0-2.fc37.noarch. 2.1 MB/s | 1.6 MB
> 00:00
> (8/9): python3-pip-22.2.2-3.fc37.noarch.rpm 4.0 MB/s | 3.1 MB
> 00:00
> (9/9): python3-libs-3.11.1-1.fc37.x86_64.rpm7.2 MB/s | 9.6 MB
> 00:01
>
> 
> Total   1.8 MB/s |  15 MB
> 00:08
> Running transaction check
> Transaction check succeeded.
> Running transaction test
> Transaction test succeeded.
> Running transaction
>   Preparing:
>  1/1
>   Upgrading: python3-libs-3.11.1-1.fc37.x86_64
> 1/11
>   Upgrading: python3-3.11.1-1.fc37.x86_64
>  2/11
>   Installing   : python3-setuptools-62.6.0-2.fc37.noarch
> 3/11
>   Installing   : python3-pip-22.2.2-3.fc37.noarch
>  4/11
>   Installing   : pkgconf-m4-1.8.0-3.fc37.noarch
>  5/11
>   Installing   : libpkgconf-1.8.0-3.fc37.x86_64
>  6/11
>   Installing   : pkgconf-1.8.0-3.fc37.x86_64
> 7/11
>   Installing   : pkgconf-pkg-config-1.8.0-3.fc37.x86_64
>  8/11
>   Installing   : python3-devel-3.11.1-1.fc37.x86_64
>  9/11
>   Cleanup  : python3-3.11.0-1.fc37.x86_64
> 10/11
>   Cleanup  : python3-libs-3.11.0-1.fc37.x86_64
>  11/11
>   Running scriptlet: python3-libs-3.11.0-1.fc37.x86_64
>  11/11
>   Verifying: libpkgconf-1.8.0-3.fc37.x86_64
>  1/11
>   Verifying: pkgconf-1.8.0-3.fc37.x86_64
> 2/11
>   Verifying: pkgconf-m4-1.8.0-3.fc37.noarch
>  3/11
>   Verifying: pkgconf-pkg-config-1.8.0-3.fc37.x86_64
>  4/11
>   Verifying: python3-setuptools-62.6.0-2.fc37.noarch
> 5/11
>   Verifying: python3-devel-3.11.1-1.fc37.x86_64
>  6/11
>   Verifying: python3-pip-22.2.2-3.fc37.noarch
>  7/11
>   Verifying: python3-3.11.1-1.fc37.x86_64
>  8/11
>   Verifying: python3-3.11.0-1.fc37.x86_64

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Bjørn Jørgensen
Create a Dockerfile

FROM fedora

RUN sudo yum install -y python3-devel
RUN sudo pip3 install -U Cython && \
sudo pip3 install -U pybind11 && \
sudo pip3 install -U pythran && \
sudo pip3 install -U numpy && \
sudo pip3 install -U scipy






docker build --pull --rm -f "Dockerfile" -t fedoratest:latest "."

Sending build context to Docker daemon  2.048kB
Step 1/3 : FROM fedora
latest: Pulling from library/fedora
Digest:
sha256:3487c98481d1bba7e769cf7bcecd6343c2d383fdd6bed34ec541b6b23ef07664
Status: Image is up to date for fedora:latest
 ---> 95b7a2603d3a
Step 2/3 : RUN sudo yum install -y python3-devel
 ---> Running in a7c648ae7014
Fedora 37 - x86_64  7.5 MB/s |  64 MB 00:08

Fedora 37 openh264 (From Cisco) - x86_64418  B/s | 2.5 kB 00:06

Fedora Modular 37 - x86_64  471 kB/s | 3.0 MB 00:06

Fedora 37 - x86_64 - Updates3.0 MB/s |  20 MB 00:06

Fedora Modular 37 - x86_64 - Updates179 kB/s | 1.1 MB 00:06

Last metadata expiration check: 0:00:01 ago on Fri Jan  6 17:37:59 2023.
Dependencies resolved.

 Package  Architecture Version  Repository
Size

Installing:
 python3-develx86_64   3.11.1-1.fc37updates
269 k
Upgrading:
 python3  x86_64   3.11.1-1.fc37updates
 27 k
 python3-libs x86_64   3.11.1-1.fc37updates
9.6 M
Installing dependencies:
 libpkgconf   x86_64   1.8.0-3.fc37 fedora
36 k
 pkgconf  x86_64   1.8.0-3.fc37 fedora
41 k
 pkgconf-m4   noarch   1.8.0-3.fc37 fedora
14 k
 pkgconf-pkg-config   x86_64   1.8.0-3.fc37 fedora
10 k
Installing weak dependencies:
 python3-pip  noarch   22.2.2-3.fc37updates
3.1 M
 python3-setuptools   noarch   62.6.0-2.fc37fedora
 1.6 M

Transaction Summary

Install  7 Packages
Upgrade  2 Packages

Total download size: 15 M
Downloading Packages:
(1/9): pkgconf-m4-1.8.0-3.fc37.noarch.rpm   2.9 kB/s |  14 kB 00:05

(2/9): libpkgconf-1.8.0-3.fc37.x86_64.rpm   7.1 kB/s |  36 kB 00:05

(3/9): pkgconf-1.8.0-3.fc37.x86_64.rpm  8.2 kB/s |  41 kB 00:05

(4/9): pkgconf-pkg-config-1.8.0-3.fc37.x86_64.r 143 kB/s |  10 kB 00:00

(5/9): python3-devel-3.11.1-1.fc37.x86_64.rpm   458 kB/s | 269 kB 00:00

(6/9): python3-3.11.1-1.fc37.x86_64.rpm 442 kB/s |  27 kB 00:00

(7/9): python3-setuptools-62.6.0-2.fc37.noarch. 2.1 MB/s | 1.6 MB 00:00

(8/9): python3-pip-22.2.2-3.fc37.noarch.rpm 4.0 MB/s | 3.1 MB 00:00

(9/9): python3-libs-3.11.1-1.fc37.x86_64.rpm7.2 MB/s | 9.6 MB 00:01


Total   1.8 MB/s |  15 MB 00:08

Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing:
 1/1
  Upgrading: python3-libs-3.11.1-1.fc37.x86_64
1/11
  Upgrading: python3-3.11.1-1.fc37.x86_64
 2/11
  Installing   : python3-setuptools-62.6.0-2.fc37.noarch
3/11
  Installing   : python3-pip-22.2.2-3.fc37.noarch
 4/11
  Installing   : pkgconf-m4-1.8.0-3.fc37.noarch
 5/11
  Installing   : libpkgconf-1.8.0-3.fc37.x86_64
 6/11
  Installing   : pkgconf-1.8.0-3.fc37.x86_64
7/11
  Installing   : pkgconf-pkg-config-1.8.0-3.fc37.x86_64
 8/11
  Installing   : python3-devel-3.11.1-1.fc37.x86_64
 9/11
  Cleanup  : python3-3.11.0-1.fc37.x86_64
10/11
  Cleanup  : python3-libs-3.11.0-1.fc37.x86_64
 11/11
  Running scriptlet: python3-libs-3.11.0-1.fc37.x86_64
 11/11
  Verifying: libpkgconf-1.8.0-3.fc37.x86_64
 1/11
  Verifying: pkgconf-1.8.0-3.fc37.x86_64
2/11
  Verifying: pkgconf-m4-1.8.0-3.fc37.noarch
 3/11
  Verifying: pkgconf-pkg-config-1.8.0-3.fc37.x86_64
 4/11
  Verifying: python3-setuptools-62.6.0-2.fc37.noarch
5/11
  Verifying: python3-devel-3.11.1-1.fc37.x86_64
 6/11
  Verifying: python3-pip-22.2.2-3.fc37.noarch
 7/11
  Verifying: python3-3.11.1-1.fc37.x86_64
 8/11
  Verifying: python3-3.11.0-1.fc37.x86_64
 9/11
  Verifying: python3-libs-3.11.1-1.fc37.x86_64
 10/11
  Verifying: python3-libs-3.11.0-1.fc37.x86_64
 11/11

Upgraded:
  python3-3.11.1-1.fc37.x86_64 python3-libs-3.11.1-1.fc37.x86_64

Installed:
  libpkgconf-1.8.0-3.fc37.x86_64

  pkgconf-1.8.0-3.fc37.x86_64

  pkgconf-m4-1.8.0-3.fc37.noarch

  pkgconf-pkg-config-1.8.0-3.fc37.x86_64

  python3-devel-3.11.1-1.fc37.x86_64

  python3-pip-22.2.2-3.fc37.noarch

  

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Mich Talebzadeh
https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 6 Jan 2023 at 17:13, Oliver Ruebenacker 
wrote:

> Thank you for the link. I already tried most of what was suggested there,
> but without success.
>
> On Fri, Jan 6, 2023 at 11:35 AM Bjørn Jørgensen 
> wrote:
>
>>
>>
>>
>> https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp
>>
>>
>>
>>
>> fre. 6. jan. 2023, 16:01 skrev Oliver Ruebenacker <
>> oliv...@broadinstitute.org>:
>>
>>>
>>>  Hello,
>>>
>>>   I'm trying to install SciPy using a bootstrap script and then use it
>>> to calculate a new field in a dataframe, running on AWS EMR.
>>>
>>>   Although the SciPy website states that only NumPy is needed, when I
>>> tried to install SciPy using pip, pip kept failing, complaining about
>>> missing software, until I ended up with this bootstrap script:
>>>
>>>
>>>
>>>
>>>
>>>
>>> *sudo yum install -y python3-develsudo pip3 install -U Cythonsudo pip3
>>> install -U pybind11sudo pip3 install -U pythransudo pip3 install -U
>>> numpysudo pip3 install -U scipy*
>>>
>>>   At this point, the bootstrap seems to be successful, but then at this
>>> line:
>>>
>>> *from scipy.stats import norm*
>>>
>>>   I get the following error:
>>>
>>> *ValueError: numpy.ndarray size changed, may indicate binary
>>> incompatibility. Expected 88 from C header, got 80 from PyObject*
>>>
>>>   Any advice on how to proceed? Thanks!
>>>
>>>  Best, Oliver
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network , 
>>> Flannick
>>> Lab , Broad Institute
>>> 
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>


Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Oliver Ruebenacker
Thank you for the link. I already tried most of what was suggested there,
but without success.

On Fri, Jan 6, 2023 at 11:35 AM Bjørn Jørgensen 
wrote:

>
>
>
> https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp
>
>
>
>
> fre. 6. jan. 2023, 16:01 skrev Oliver Ruebenacker <
> oliv...@broadinstitute.org>:
>
>>
>>  Hello,
>>
>>   I'm trying to install SciPy using a bootstrap script and then use it to
>> calculate a new field in a dataframe, running on AWS EMR.
>>
>>   Although the SciPy website states that only NumPy is needed, when I
>> tried to install SciPy using pip, pip kept failing, complaining about
>> missing software, until I ended up with this bootstrap script:
>>
>>
>>
>>
>>
>>
>> *sudo yum install -y python3-develsudo pip3 install -U Cythonsudo pip3
>> install -U pybind11sudo pip3 install -U pythransudo pip3 install -U
>> numpysudo pip3 install -U scipy*
>>
>>   At this point, the bootstrap seems to be successful, but then at this
>> line:
>>
>> *from scipy.stats import norm*
>>
>>   I get the following error:
>>
>> *ValueError: numpy.ndarray size changed, may indicate binary
>> incompatibility. Expected 88 from C header, got 80 from PyObject*
>>
>>   Any advice on how to proceed? Thanks!
>>
>>  Best, Oliver
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network , 
>> Flannick
>> Lab , Broad Institute
>> 
>>
>

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
, Flannick
Lab , Broad Institute



Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Bjørn Jørgensen
https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp




fre. 6. jan. 2023, 16:01 skrev Oliver Ruebenacker <
oliv...@broadinstitute.org>:

>
>  Hello,
>
>   I'm trying to install SciPy using a bootstrap script and then use it to
> calculate a new field in a dataframe, running on AWS EMR.
>
>   Although the SciPy website states that only NumPy is needed, when I
> tried to install SciPy using pip, pip kept failing, complaining about
> missing software, until I ended up with this bootstrap script:
>
>
>
>
>
>
> *sudo yum install -y python3-develsudo pip3 install -U Cythonsudo pip3
> install -U pybind11sudo pip3 install -U pythransudo pip3 install -U
> numpysudo pip3 install -U scipy*
>
>   At this point, the bootstrap seems to be successful, but then at this
> line:
>
> *from scipy.stats import norm*
>
>   I get the following error:
>
> *ValueError: numpy.ndarray size changed, may indicate binary
> incompatibility. Expected 88 from C header, got 80 from PyObject*
>
>   Any advice on how to proceed? Thanks!
>
>  Best, Oliver
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>


[PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Oliver Ruebenacker
 Hello,

  I'm trying to install SciPy using a bootstrap script and then use it to
calculate a new field in a dataframe, running on AWS EMR.

  Although the SciPy website states that only NumPy is needed, when I tried
to install SciPy using pip, pip kept failing, complaining about missing
software, until I ended up with this bootstrap script:






*sudo yum install -y python3-develsudo pip3 install -U Cythonsudo pip3
install -U pybind11sudo pip3 install -U pythransudo pip3 install -U
numpysudo pip3 install -U scipy*

  At this point, the bootstrap seems to be successful, but then at this
line:

*from scipy.stats import norm*

  I get the following error:

*ValueError: numpy.ndarray size changed, may indicate binary
incompatibility. Expected 88 from C header, got 80 from PyObject*

  Any advice on how to proceed? Thanks!

 Best, Oliver

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
, Flannick
Lab , Broad Institute



Pyspark error when converting string to timestamp in map function

2018-08-17 Thread Keith Chapman
Hi all,

I'm trying to create a dataframe enforcing a schema so that I can write it
to a parquet file. The schema has timestamps and I get an error with
pyspark. The following is a snippet of code that exhibits the problem,

df = sqlctx.range(1000)
schema = StructType([StructField('a', TimestampType(), True)])
df1 = sqlctx.createDataFrame(df.rdd.map(row_gen_func), schema)

row_gen_func is a function that retruns timestamp strings of the form
"2018-03-21 11:09:44"

When I compile this with Spark 2.2 I get the following error,

raise TypeError("%s can not accept object %r in type %s" % (dataType, obj,
type(obj)))
TypeError: TimestampType can not accept object '2018-03-21 08:06:17' in
type 

Regards,
Keith.

http://keith-chapman.com


Re: Pandas UDF for PySpark error. Big Dataset

2018-05-29 Thread Bryan Cutler
Can you share some of the code used, or at least the pandas_udf plus the
stacktrace?  Also does decreasing your dataset size fix the oom?

On Mon, May 28, 2018, 4:22 PM Traku traku  wrote:

> Hi.
>
> I'm trying to use the new feature but I can't use it with a big dataset
> (about 5 million rows).
>
> I tried  increasing executor memory, driver memory, partition number, but
> any solution can help me to solve the problem.
>
> One of the executor task increase the shufle memory until fails.
>
> Error is arrow generated: unable to expand the buffer.
>
> Any idea?
>


Pandas UDF for PySpark error. Big Dataset

2018-05-28 Thread Traku traku
Hi.

I'm trying to use the new feature but I can't use it with a big dataset
(about 5 million rows).

I tried  increasing executor memory, driver memory, partition number, but
any solution can help me to solve the problem.

One of the executor task increase the shufle memory until fails.

Error is arrow generated: unable to expand the buffer.

Any idea?


Re: Pyspark Error: Unable to read a hive table with transactional property set as 'True'

2018-03-02 Thread ayan guha
Hi

Couple of questions:

1. It seems the error is due to number format:
Caused by: java.util.concurrent.ExecutionException:
java.lang.NumberFormatException:
For input string: "0003024_"
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.
generateSplitsInfo(OrcInputFormat.java:998)
... 75 more
Why do you think it is due to ACID?

2. You should not be creating Hive Context again in REPL, no need for that.
REPL already reports: SparkContext available as sc, HiveContext available
as sqlContext.

3. Have you tried the same with spark 2.x?



On Sat, Mar 3, 2018 at 5:00 AM, Debabrata Ghosh 
wrote:

> Hi All,
>Greetings ! I needed some help to read a Hive table
> via Pyspark for which the transactional property is set to 'True' (In other
> words ACID property is enabled). Following is the entire stacktrace and the
> description of the hive table. Would you please be able to help me resolve
> the error:
>
> 18/03/01 11:06:22 INFO BlockManagerMaster: Registered BlockManager
> 18/03/01 11:06:22 INFO EventLoggingListener: Logging events to
> hdfs:///spark-history/local-1519923982155
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 1.6.3
>   /_/
>
> Using Python version 2.7.12 (default, Jul  2 2016 17:42:40)
> SparkContext available as sc, HiveContext available as sqlContext.
> >>> from pyspark.sql import HiveContext
> >>> hive_context = HiveContext(sc)
> >>> hive_context.sql("select count(*) from load_etl.trpt_geo_defect_prod_
> dec07_del_blank").show()
> 18/03/01 11:09:45 INFO HiveContext: Initializing execution hive, version
> 1.2.1
> 18/03/01 11:09:45 INFO ClientWrapper: Inspected Hadoop version:
> 2.7.3.2.6.0.3-8
> 18/03/01 11:09:45 INFO ClientWrapper: Loaded 
> org.apache.hadoop.hive.shims.Hadoop23Shims
> for Hadoop version 2.7.3.2.6.0.3-8
> 18/03/01 11:09:46 INFO HiveMetaStore: 0: Opening raw store with
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 18/03/01 11:09:46 INFO ObjectStore: ObjectStore, initialize called
> 18/03/01 11:09:46 INFO Persistence: Property 
> hive.metastore.integral.jdo.pushdown
> unknown - will be ignored
> 18/03/01 11:09:46 INFO Persistence: Property datanucleus.cache.level2
> unknown - will be ignored
> 18/03/01 11:09:50 INFO ObjectStore: Setting MetaStore object pin classes
> with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,
> Partition,Database,Type,FieldSchema,Order"
> 18/03/01 11:09:50 INFO Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema"
> is tagged as "embedded-only" so does not have its own datastore table.
> 18/03/01 11:09:50 INFO Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MOrder"
> is tagged as "embedded-only" so does not have its own datastore table.
> 18/03/01 11:09:53 INFO Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema"
> is tagged as "embedded-only" so does not have its own datastore table.
> 18/03/01 11:09:53 INFO Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MOrder"
> is tagged as "embedded-only" so does not have its own datastore table.
> 18/03/01 11:09:54 INFO MetaStoreDirectSql: Using direct SQL, underlying DB
> is DERBY
> 18/03/01 11:09:54 INFO ObjectStore: Initialized ObjectStore
> 18/03/01 11:09:54 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so recording
> the schema version 1.2.0
> 18/03/01 11:09:54 WARN ObjectStore: Failed to get database default,
> returning NoSuchObjectException
> 18/03/01 11:09:54 INFO HiveMetaStore: Added admin role in metastore
> 18/03/01 11:09:54 INFO HiveMetaStore: Added public role in metastore
> 18/03/01 11:09:55 INFO HiveMetaStore: No user is added in admin role,
> since config is empty
> 18/03/01 11:09:55 INFO HiveMetaStore: 0: get_all_databases
> 18/03/01 11:09:55 INFO audit: ugi=devu...@ip.com   ip=unknown-ip-addr
>   cmd=get_all_databases
> 18/03/01 11:09:55 INFO HiveMetaStore: 0: get_functions: db=default pat=*
> 18/03/01 11:09:55 INFO audit: ugi=devu...@ip.com   ip=unknown-ip-addr
>   cmd=get_functions: db=default pat=*
> 18/03/01 11:09:55 INFO Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MResourceUri"
> is tagged as "embedded-only" so does not have its own datastore table.
> 18/03/01 11:09:55 INFO SessionState: Created local directory:
> /tmp/22ea9ac9-23d1-4247-9e02-ce45809cd9ae_resources
> 18/03/01 11:09:55 INFO SessionState: Created HDFS directory:
> /tmp/hive/hdetldev/22ea9ac9-23d1-4247-9e02-ce45809cd9ae
> 18/03/01 11:09:55 INFO SessionState: Created local directory:
> /tmp/hdetldev/22ea9ac9-23d1-4247-9e02-ce45809cd9ae
> 18/03/01 11:09:55 INFO SessionState: Created HDFS directory:
> 

Pyspark Error: Unable to read a hive table with transactional property set as 'True'

2018-03-02 Thread Debabrata Ghosh
Hi All,
   Greetings ! I needed some help to read a Hive table
via Pyspark for which the transactional property is set to 'True' (In other
words ACID property is enabled). Following is the entire stacktrace and the
description of the hive table. Would you please be able to help me resolve
the error:

18/03/01 11:06:22 INFO BlockManagerMaster: Registered BlockManager
18/03/01 11:06:22 INFO EventLoggingListener: Logging events to
hdfs:///spark-history/local-1519923982155
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.3
  /_/

Using Python version 2.7.12 (default, Jul  2 2016 17:42:40)
SparkContext available as sc, HiveContext available as sqlContext.
>>> from pyspark.sql import HiveContext
>>> hive_context = HiveContext(sc)
>>> hive_context.sql("select count(*) from
load_etl.trpt_geo_defect_prod_dec07_del_blank").show()
18/03/01 11:09:45 INFO HiveContext: Initializing execution hive, version
1.2.1
18/03/01 11:09:45 INFO ClientWrapper: Inspected Hadoop version:
2.7.3.2.6.0.3-8
18/03/01 11:09:45 INFO ClientWrapper: Loaded
org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version
2.7.3.2.6.0.3-8
18/03/01 11:09:46 INFO HiveMetaStore: 0: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
18/03/01 11:09:46 INFO ObjectStore: ObjectStore, initialize called
18/03/01 11:09:46 INFO Persistence: Property
hive.metastore.integral.jdo.pushdown unknown - will be ignored
18/03/01 11:09:46 INFO Persistence: Property datanucleus.cache.level2
unknown - will be ignored
18/03/01 11:09:50 INFO ObjectStore: Setting MetaStore object pin classes
with
hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
18/03/01 11:09:50 INFO Datastore: The class
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
"embedded-only" so does not have its own datastore table.
18/03/01 11:09:50 INFO Datastore: The class
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
"embedded-only" so does not have its own datastore table.
18/03/01 11:09:53 INFO Datastore: The class
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
"embedded-only" so does not have its own datastore table.
18/03/01 11:09:53 INFO Datastore: The class
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
"embedded-only" so does not have its own datastore table.
18/03/01 11:09:54 INFO MetaStoreDirectSql: Using direct SQL, underlying DB
is DERBY
18/03/01 11:09:54 INFO ObjectStore: Initialized ObjectStore
18/03/01 11:09:54 WARN ObjectStore: Version information not found in
metastore. hive.metastore.schema.verification is not enabled so recording
the schema version 1.2.0
18/03/01 11:09:54 WARN ObjectStore: Failed to get database default,
returning NoSuchObjectException
18/03/01 11:09:54 INFO HiveMetaStore: Added admin role in metastore
18/03/01 11:09:54 INFO HiveMetaStore: Added public role in metastore
18/03/01 11:09:55 INFO HiveMetaStore: No user is added in admin role, since
config is empty
18/03/01 11:09:55 INFO HiveMetaStore: 0: get_all_databases
18/03/01 11:09:55 INFO audit: ugi=devu...@ip.com   ip=unknown-ip-addr
cmd=get_all_databases
18/03/01 11:09:55 INFO HiveMetaStore: 0: get_functions: db=default pat=*
18/03/01 11:09:55 INFO audit: ugi=devu...@ip.com   ip=unknown-ip-addr
cmd=get_functions: db=default pat=*
18/03/01 11:09:55 INFO Datastore: The class
"org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as
"embedded-only" so does not have its own datastore table.
18/03/01 11:09:55 INFO SessionState: Created local directory:
/tmp/22ea9ac9-23d1-4247-9e02-ce45809cd9ae_resources
18/03/01 11:09:55 INFO SessionState: Created HDFS directory:
/tmp/hive/hdetldev/22ea9ac9-23d1-4247-9e02-ce45809cd9ae
18/03/01 11:09:55 INFO SessionState: Created local directory:
/tmp/hdetldev/22ea9ac9-23d1-4247-9e02-ce45809cd9ae
18/03/01 11:09:55 INFO SessionState: Created HDFS directory:
/tmp/hive/hdetldev/22ea9ac9-23d1-4247-9e02-ce45809cd9ae/_tmp_space.db
18/03/01 11:09:55 INFO HiveContext: default warehouse location is
/user/hive/warehouse
18/03/01 11:09:55 INFO HiveContext: Initializing HiveMetastoreConnection
version 1.2.1 using Spark classes.
18/03/01 11:09:55 INFO ClientWrapper: Inspected Hadoop version:
2.7.3.2.6.0.3-8
18/03/01 11:09:55 INFO ClientWrapper: Loaded
org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version
2.7.3.2.6.0.3-8
18/03/01 11:09:56 INFO metastore: Trying to connect to metastore with URI
thrift://ip.com:9083
18/03/01 11:09:56 INFO metastore: Connected to metastore.
18/03/01 11:09:56 INFO SessionState: Created local directory:
/tmp/24379bb3-8ddf-4716-b68d-07ac0f92d9f1_resources
18/03/01 11:09:56 INFO SessionState: Created HDFS directory:
/tmp/hive/hdetldev/24379bb3-8ddf-4716-b68d-07ac0f92d9f1
18/03/01 11:09:56 INFO SessionState: Created local directory:
/tmp/hdetldev/24379bb3-8ddf-4716-b68d-07ac0f92d9f1
18/03/01 

Fwd: pyspark: Error when training a GMM with an initial GaussianMixtureModel

2015-11-25 Thread Guillaume Maze
Hi all,
We're trying to train a Gaussian Mixture Model (GMM) with a specified
initial model.
Doc 1.5.1 says we should use a GaussianMixtureModel object as input
for the "initialModel" parameter to the GaussianMixture.train method.
Before creating our own initial model (the plan is to use a Kmean
result for instance), we simply wanted to test case this scenario.
So we try to initialize a 2nd training using the GaussianMixtureModel
from the output a 1st training.
But this trivial scenario throws an error.
Could you please help us determine what's going on here ?
Thanks a lot
guillaume

PS: we run (py)spark 1.5.1 with hadoop 2.6

Below is the trivial scenario code and the error:

 SOURCE CODE
from pyspark.mllib.clustering import GaussianMixture
from numpy import array
import sys
import os
import pyspark

### Local default options
K=2 # "k" (int) Set the number of Gaussians in the mixture model.  Default:
2
convergenceTol=1e-3 # "convergenceTol" (double) Set the largest change in
log-likelihood at which convergence is considered to have occurred.
maxIterations=100 # "maxIterations" (int) Set the maximum number
of iterations to run. Default: 100
seed=None # "seed" (long) Set the random seed
initialModel=None

### Load and parse the sample data
data = sc.textFile("gmm_data.txt") # Data from the dummy set
here: data/mllib/gmm_data.txt
parsedData = data.map(lambda line: array([float(x) for x
in line.strip().split(' ')]))
print type(parsedData)
print type(parsedData.first())

### 1st training: Build the GMM
gmm = GaussianMixture.train(parsedData, K, convergenceTol, maxIterations,
seed, initialModel)

# output parameters of model
for i in range(2):
print ("weight = ", gmm.weights[i], "mu = ", gmm.gaussians[i].mu,
"sigma = ", gmm.gaussians[i].sigma.toArray())

### 2nd training: Re-build a GMM using an initial model
initialModel = gmm
print type(initialModel)
gmm = GaussianMixture.train(parsedData, K, convergenceTol, maxIterations,
seed, initialModel)


 OUTPUT WITH ERROR:


('weight = ', 0.51945003367044018, 'mu = ', DenseVector([-0.1045,
0.0429]), 'sigma = ', array([[ 4.90706817, -2.00676881],
   [-2.00676881,  1.01143891]]))
('weight = ', 0.48054996632955982, 'mu = ', DenseVector([0.0722,
0.0167]), 'sigma = ', array([[ 4.77975653,  1.87624558],
   [ 1.87624558,  0.91467242]]))


---
Py4JJavaError Traceback (most recent call last)
 in ()
 33 initialModel = gmm
 34 print type(initialModel)
---> 35 gmm = GaussianMixture.train(parsedData, K, convergenceTol,
maxIterations, seed, initialModel) #

/opt/spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/mllib/clustering.pyc
in train(cls, rdd, k, convergenceTol, maxIterations, seed,
initialModel)
306 java_model =
callMLlibFunc("trainGaussianMixtureModel",
rdd.map(_convert_to_vector),
307k, convergenceTol,
maxIterations, seed,
--> 308initialModelWeights,
initialModelMu, initialModelSigma)
309 return GaussianMixtureModel(java_model)
310

/opt/spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/mllib/common.pyc
in callMLlibFunc(name, *args)
128 sc = SparkContext._active_spark_context
129 api = getattr(sc._jvm.PythonMLLibAPI(), name)
--> 130 return callJavaFunc(sc, api, *args)
131
132

/opt/spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/mllib/common.pyc
in callJavaFunc(sc, func, *args)
120 def callJavaFunc(sc, func, *args):
121 """ Call Java Function """
--> 122 args = [_py2java(sc, a) for a in args]
123 return _java2py(sc, func(*args))
124

/opt/spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/mllib/common.pyc
in _py2java(sc, obj)
 86 else:
 87 data = bytearray(PickleSerializer().dumps(obj))
---> 88 obj = sc._jvm.SerDe.loads(data)
 89 return obj
 90

/opt/spark/spark-1.5.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539
540 for temp_arg in temp_args:

/opt/spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/utils.pyc in
deco(*a, **kw)
 34 def deco(*a, **kw):
 35 try:
---> 36 return f(*a, **kw)
 37 except py4j.protocol.Py4JJavaError as e:
 38 s = e.java_exception.toString()

/opt/spark/spark-1.5.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 

Re: Pyspark: "Error: No main class set in JAR; please specify one with --class"

2015-10-01 Thread Marcelo Vanzin
How are you running the actual application?

I find it slightly odd that you're setting PYSPARK_SUBMIT_ARGS
directly; that's supposed to be an internal env variable used by
Spark. You'd normally pass those parameters in the spark-submit (or
pyspark) command line.

On Thu, Oct 1, 2015 at 8:56 AM, YaoPau <jonrgr...@gmail.com> wrote:
> I'm trying to add multiple SerDe jars to my pyspark session.
>
> I got the first one working by changing my PYSPARK_SUBMIT_ARGS to:
>
> "--master yarn-client --executor-cores 5 --num-executors 5 --driver-memory
> 3g --executor-memory 5g --jars /home/me/jars/csv-serde-1.1.2.jar"
>
> But when I tried to add a second, using:
>
> "--master yarn-client --executor-cores 5 --num-executors 5 --driver-memory
> 3g --executor-memory 5g --jars /home/me/jars/csv-serde-1.1.2.jar,
> /home/me/jars/json-serde-1.3-jar-with-dependencies.jar"
>
> I got the error "Error: No main class set in JAR; please specify one with
> --class".
>
> How do I specify the class for just the second JAR?
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-No-main-class-set-in-JAR-please-specify-one-with-class-tp24900.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Pyspark: "Error: No main class set in JAR; please specify one with --class"

2015-10-01 Thread Ted Yu
In your second command, have you tried changing the comma to colon ?

Cheers

On Thu, Oct 1, 2015 at 8:56 AM, YaoPau <jonrgr...@gmail.com> wrote:

> I'm trying to add multiple SerDe jars to my pyspark session.
>
> I got the first one working by changing my PYSPARK_SUBMIT_ARGS to:
>
> "--master yarn-client --executor-cores 5 --num-executors 5 --driver-memory
> 3g --executor-memory 5g --jars /home/me/jars/csv-serde-1.1.2.jar"
>
> But when I tried to add a second, using:
>
> "--master yarn-client --executor-cores 5 --num-executors 5 --driver-memory
> 3g --executor-memory 5g --jars /home/me/jars/csv-serde-1.1.2.jar,
> /home/me/jars/json-serde-1.3-jar-with-dependencies.jar"
>
> I got the error "Error: No main class set in JAR; please specify one with
> --class".
>
> How do I specify the class for just the second JAR?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-No-main-class-set-in-JAR-please-specify-one-with-class-tp24900.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Pyspark: "Error: No main class set in JAR; please specify one with --class"

2015-10-01 Thread YaoPau
I'm trying to add multiple SerDe jars to my pyspark session.

I got the first one working by changing my PYSPARK_SUBMIT_ARGS to:

"--master yarn-client --executor-cores 5 --num-executors 5 --driver-memory
3g --executor-memory 5g --jars /home/me/jars/csv-serde-1.1.2.jar"

But when I tried to add a second, using: 

"--master yarn-client --executor-cores 5 --num-executors 5 --driver-memory
3g --executor-memory 5g --jars /home/me/jars/csv-serde-1.1.2.jar,
/home/me/jars/json-serde-1.3-jar-with-dependencies.jar"

I got the error "Error: No main class set in JAR; please specify one with
--class".

How do I specify the class for just the second JAR?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-No-main-class-set-in-JAR-please-specify-one-with-class-tp24900.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



pyspark error with zip

2015-03-31 Thread Charles Hayden
?

The following program fails in the zip step.

x = sc.parallelize([1, 2, 3, 1, 2, 3])
y = sc.parallelize([1, 2, 3])
z = x.distinct()
print x.zip(y).collect()


The error that is produced depends on whether multiple partitions have been 
specified or not.

I understand that

the two RDDs [must] have the same number of partitions and the same number of 
elements in each partition.

What is the best way to work around this restriction?

I have been performing the operation with the following code, but I am hoping 
to find something more efficient.

def safe_zip(left, right):
ix_left = left.zipWithIndex().map(lambda row: (row[1], row[0]))
ix_right = right.zipWithIndex().map(lambda row: (row[1], row[0]))
return ix_left.join(ix_right).sortByKey().values()




Re: Pyspark Error

2014-11-18 Thread Shannon Quinn
My best guess would be a networking issue--it looks like the Python 
socket library isn't able to connect to whatever hostname you're 
providing Spark in the configuration.


On 11/18/14 9:10 AM, amin mohebbi wrote:

Hi there,

*I have already downloaded Pre-built spark-1.1.0, I want to run 
pyspark by try typing ./bin/pyspark but I got the following error:*

*
*







*scala shell is up and working fine*

hduser@master:~/Downloads/spark-1.1.0$ ./bin/spark-shell
Java HotSpot(TM) Client VM warning: ignoring option MaxPermSize=128m; 
support was removed in 8.0
Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties

.
.
14/11/18 04:33:13 INFO AkkaUtils: Connecting to HeartbeatReceiver: 
akka.tcp://sparkDriver@master:34937/user/HeartbeatReceiver

14/11/18 04:33:13 INFO SparkILoop: Created spark context..
Spark context available as sc.

scala hduser@master:~/Downloads/spark-1.1.0$


*
*
*But python shell does not work:*

hduser@master:~/Downloads/spark-1.1.0$
hduser@master:~/Downloads/spark-1.1.0$
hduser@master:~/Downloads/spark-1.1.0$ ./bin/pyspark
Python 2.7.3 (default, Feb 27 2014, 20:00:17)
[GCC 4.6.3] on linux2
Type help, copyright, credits or license for more information.
Java HotSpot(TM) Client VM warning: ignoring option MaxPermSize=128m; 
support was removed in 8.0
Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties

14/11/18 04:36:06 INFO SecurityManager: Changing view acls to: hduser,
14/11/18 04:36:06 INFO SecurityManager: Changing modify acls to: hduser,
14/11/18 04:36:06 INFO SecurityManager: SecurityManager: 
authentication disabled; ui acls disabled; users with view 
permissions: Set(hduser, ); users with modify permissions: Set(hduser, )

14/11/18 04:36:06 INFO Slf4jLogger: Slf4jLogger started
14/11/18 04:36:06 INFO Remoting: Starting remoting
14/11/18 04:36:06 INFO Remoting: Remoting started; listening on 
addresses :[akka.tcp://sparkDriver@master:52317]
14/11/18 04:36:06 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://sparkDriver@master:52317]
14/11/18 04:36:06 INFO Utils: Successfully started service 
'sparkDriver' on port 52317.

14/11/18 04:36:06 INFO SparkEnv: Registering MapOutputTracker
14/11/18 04:36:06 INFO SparkEnv: Registering BlockManagerMaster
14/11/18 04:36:06 INFO DiskBlockManager: Created local directory at 
/tmp/spark-local-20141118043606-c346
14/11/18 04:36:07 INFO Utils: Successfully started service 'Connection 
manager for block manager' on port 47507.
14/11/18 04:36:07 INFO ConnectionManager: Bound socket to port 47507 
with id = ConnectionManagerId(master,47507)
14/11/18 04:36:07 INFO MemoryStore: MemoryStore started with capacity 
267.3 MB

14/11/18 04:36:07 INFO BlockManagerMaster: Trying to register BlockManager
14/11/18 04:36:07 INFO BlockManagerMasterActor: Registering block 
manager master:47507 with 267.3 MB RAM

14/11/18 04:36:07 INFO BlockManagerMaster: Registered BlockManager
14/11/18 04:36:07 INFO HttpFileServer: HTTP File server directory is 
/tmp/spark-8b29544a-c74b-4a3e-88e0-13801c8dcc65

14/11/18 04:36:07 INFO HttpServer: Starting HTTP Server
14/11/18 04:36:07 INFO Utils: Successfully started service 'HTTP file 
server' on port 40029.
14/11/18 04:36:12 INFO Utils: Successfully started service 'SparkUI' 
on port 4040.
14/11/18 04:36:12 INFO SparkUI: Started SparkUI at http://master:4040 
http://master:4040/
14/11/18 04:36:12 INFO AkkaUtils: Connecting to HeartbeatReceiver: 
akka.tcp://sparkDriver@master:52317/user/HeartbeatReceiver
14/11/18 04:36:12 INFO SparkUI: Stopped Spark web UI at 
http://master:4040 http://master:4040/

14/11/18 04:36:12 INFO DAGScheduler: Stopping DAGScheduler
14/11/18 04:36:13 INFO MapOutputTrackerMasterActor: 
MapOutputTrackerActor stopped!

14/11/18 04:36:13 INFO ConnectionManager: Selector thread was interrupted!
14/11/18 04:36:13 INFO ConnectionManager: ConnectionManager stopped
14/11/18 04:36:13 INFO MemoryStore: MemoryStore cleared
14/11/18 04:36:13 INFO BlockManager: BlockManager stopped
14/11/18 04:36:13 INFO BlockManagerMaster: BlockManagerMaster stopped
14/11/18 04:36:13 INFO RemoteActorRefProvider$RemotingTerminator: 
Shutting down remote daemon.

14/11/18 04:36:13 INFO SparkContext: Successfully stopped SparkContext
14/11/18 04:36:13 INFO RemoteActorRefProvider$RemotingTerminator: 
Remote daemon shut down; proceeding with flushing remote transports.

14/11/18 04:36:13 INFO Remoting: Remoting shut down
14/11/18 04:36:13 INFO RemoteActorRefProvider$RemotingTerminator: 
Remoting shut down.

Traceback (most recent call last):
  File /home/hduser/Downloads/spark-1.1.0/python/pyspark/shell.py, 
line 44, in module

sc = SparkContext(appName=PySparkShell, pyFiles=add_files)
  File /home/hduser/Downloads/spark-1.1.0/python/pyspark/context.py, 
line 107, in __init__

conf)
  File /home/hduser/Downloads/spark-1.1.0/python/pyspark/context.py, 
line 159, in _do_init

self._accumulatorServer = accumulators._start_update_server()
  File 

Re: Pyspark Error

2014-11-18 Thread Davies Liu
It seems that `localhost` can not be resolved in your machines, I had
filed https://issues.apache.org/jira/browse/SPARK-4475 to track it.

On Tue, Nov 18, 2014 at 6:10 AM, amin mohebbi
aminn_...@yahoo.com.invalid wrote:
 Hi there,

 I have already downloaded Pre-built spark-1.1.0, I want to run pyspark by
 try typing ./bin/pyspark but I got the following error:








 scala shell is up and working fine

 hduser@master:~/Downloads/spark-1.1.0$ ./bin/spark-shell
 Java HotSpot(TM) Client VM warning: ignoring option MaxPermSize=128m;
 support was removed in 8.0
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 .
 .
 14/11/18 04:33:13 INFO AkkaUtils: Connecting to HeartbeatReceiver:
 akka.tcp://sparkDriver@master:34937/user/HeartbeatReceiver
 14/11/18 04:33:13 INFO SparkILoop: Created spark context..
 Spark context available as sc.

 scala hduser@master:~/Downloads/spark-1.1.0$



 But python shell does not work:

 hduser@master:~/Downloads/spark-1.1.0$
 hduser@master:~/Downloads/spark-1.1.0$
 hduser@master:~/Downloads/spark-1.1.0$ ./bin/pyspark
 Python 2.7.3 (default, Feb 27 2014, 20:00:17)
 [GCC 4.6.3] on linux2
 Type help, copyright, credits or license for more information.
 Java HotSpot(TM) Client VM warning: ignoring option MaxPermSize=128m;
 support was removed in 8.0
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 14/11/18 04:36:06 INFO SecurityManager: Changing view acls to: hduser,
 14/11/18 04:36:06 INFO SecurityManager: Changing modify acls to: hduser,
 14/11/18 04:36:06 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(hduser, );
 users with modify permissions: Set(hduser, )
 14/11/18 04:36:06 INFO Slf4jLogger: Slf4jLogger started
 14/11/18 04:36:06 INFO Remoting: Starting remoting
 14/11/18 04:36:06 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkDriver@master:52317]
 14/11/18 04:36:06 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://sparkDriver@master:52317]
 14/11/18 04:36:06 INFO Utils: Successfully started service 'sparkDriver' on
 port 52317.
 14/11/18 04:36:06 INFO SparkEnv: Registering MapOutputTracker
 14/11/18 04:36:06 INFO SparkEnv: Registering BlockManagerMaster
 14/11/18 04:36:06 INFO DiskBlockManager: Created local directory at
 /tmp/spark-local-20141118043606-c346
 14/11/18 04:36:07 INFO Utils: Successfully started service 'Connection
 manager for block manager' on port 47507.
 14/11/18 04:36:07 INFO ConnectionManager: Bound socket to port 47507 with id
 = ConnectionManagerId(master,47507)
 14/11/18 04:36:07 INFO MemoryStore: MemoryStore started with capacity 267.3
 MB
 14/11/18 04:36:07 INFO BlockManagerMaster: Trying to register BlockManager
 14/11/18 04:36:07 INFO BlockManagerMasterActor: Registering block manager
 master:47507 with 267.3 MB RAM
 14/11/18 04:36:07 INFO BlockManagerMaster: Registered BlockManager
 14/11/18 04:36:07 INFO HttpFileServer: HTTP File server directory is
 /tmp/spark-8b29544a-c74b-4a3e-88e0-13801c8dcc65
 14/11/18 04:36:07 INFO HttpServer: Starting HTTP Server
 14/11/18 04:36:07 INFO Utils: Successfully started service 'HTTP file
 server' on port 40029.
 14/11/18 04:36:12 INFO Utils: Successfully started service 'SparkUI' on port
 4040.
 14/11/18 04:36:12 INFO SparkUI: Started SparkUI at http://master:4040
 14/11/18 04:36:12 INFO AkkaUtils: Connecting to HeartbeatReceiver:
 akka.tcp://sparkDriver@master:52317/user/HeartbeatReceiver
 14/11/18 04:36:12 INFO SparkUI: Stopped Spark web UI at http://master:4040
 14/11/18 04:36:12 INFO DAGScheduler: Stopping DAGScheduler
 14/11/18 04:36:13 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor
 stopped!
 14/11/18 04:36:13 INFO ConnectionManager: Selector thread was interrupted!
 14/11/18 04:36:13 INFO ConnectionManager: ConnectionManager stopped
 14/11/18 04:36:13 INFO MemoryStore: MemoryStore cleared
 14/11/18 04:36:13 INFO BlockManager: BlockManager stopped
 14/11/18 04:36:13 INFO BlockManagerMaster: BlockManagerMaster stopped
 14/11/18 04:36:13 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
 down remote daemon.
 14/11/18 04:36:13 INFO SparkContext: Successfully stopped SparkContext
 14/11/18 04:36:13 INFO RemoteActorRefProvider$RemotingTerminator: Remote
 daemon shut down; proceeding with flushing remote transports.
 14/11/18 04:36:13 INFO Remoting: Remoting shut down
 14/11/18 04:36:13 INFO RemoteActorRefProvider$RemotingTerminator: Remoting
 shut down.
 Traceback (most recent call last):
   File /home/hduser/Downloads/spark-1.1.0/python/pyspark/shell.py, line
 44, in module
 sc = SparkContext(appName=PySparkShell, pyFiles=add_files)
   File /home/hduser/Downloads/spark-1.1.0/python/pyspark/context.py, line
 107, in __init__
 conf)
   File /home/hduser/Downloads/spark-1.1.0/python/pyspark/context.py, line
 159, in _do_init
 self._accumulatorServer = accumulators._start_update_server()
   File 

Re: Pyspark Error when broadcast numpy array

2014-11-12 Thread bliuab
Dear Liu:

I have tested this issue under Spark-1.1.0. The problem is solved under
this newer version.


On Wed, Nov 12, 2014 at 3:18 PM, Bo Liu bli...@cse.ust.hk wrote:

 Dear Liu:

 Thank you for your replay. I will set up an experimental environment for
 spark-1.1 and test it.

 On Wed, Nov 12, 2014 at 2:30 PM, Davies Liu-2 [via Apache Spark User List]
 ml-node+s1001560n1868...@n3.nabble.com wrote:

 Yes, your broadcast should be about 300M, much smaller than 2G, I
 didn't read your post carefully.

 The broadcast in Python had been improved much since 1.1, I think it
 will work in 1.1 or upcoming 1.2 release, could you upgrade to 1.1?

 Davies

 On Tue, Nov 11, 2014 at 8:37 PM, bliuab [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18684i=0 wrote:

  Dear Liu:
 
  Thank you very much for your help. I will update that patch. By the
 way, as
  I have succeed to broadcast an array of size(30M) the log said that
 such
  array takes around 230MB memory. As a result, I think the numpy array
 that
  leads to error is much smaller than 2G.
 
  On Wed, Nov 12, 2014 at 12:29 PM, Davies Liu-2 [via Apache Spark User
 List]
  [hidden email] wrote:
 
  This PR fix the problem: https://github.com/apache/spark/pull/2659
 
  cc @josh
 
  Davies
 
  On Tue, Nov 11, 2014 at 7:47 PM, bliuab [hidden email] wrote:
 
   In spark-1.0.2, I have come across an error when I try to broadcast
 a
   quite
   large numpy array(with 35M dimension). The error information except
 the
   java.lang.NegativeArraySizeException error and details is listed
 below.
   Moreover, when broadcast a relatively smaller numpy array(30M
   dimension),
   everything works fine. And 30M dimension numpy array takes 230M
 memory
   which, in my opinion, not very large.
   As far as I have surveyed, it seems related with py4j. However, I
 have
   no
   idea how to fix  this. I would be appreciated if I can get some
 hint.
   
   py4j.protocol.Py4JError: An error occurred while calling
 o23.broadcast.
   Trace:
   java.lang.NegativeArraySizeException
   at py4j.Base64.decode(Base64.java:292)
   at py4j.Protocol.getBytes(Protocol.java:167)
   at py4j.Protocol.getObject(Protocol.java:276)
   at
   py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81)
   at py4j.commands.CallCommand.execute(CallCommand.java:77)
   at py4j.GatewayConnection.run(GatewayConnection.java:207)
   -
   And the test code is a follows:
   conf =
  
   SparkConf().setAppName('brodyliu_LR').setMaster('spark://
 10.231.131.87:5051')
   conf.set('spark.executor.memory', '4000m')
   conf.set('spark.akka.timeout', '10')
   conf.set('spark.ui.port','8081')
   conf.set('spark.cores.max','150')
   #conf.set('spark.rdd.compress', 'True')
   conf.set('spark.default.parallelism', '300')
   #configure the spark environment
   sc = SparkContext(conf=conf, batchSize=1)
  
   vec = np.random.rand(3500)
   a = sc.broadcast(vec)
  
  
  
  
  
  
   --
   View this message in context:
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html
   Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
  
  
 -
   To unsubscribe, e-mail: [hidden email]
   For additional commands, e-mail: [hidden email]
  
 
  -
  To unsubscribe, e-mail: [hidden email]
  For additional commands, e-mail: [hidden email]
 
 
 
  
  If you reply to this email, your message will be added to the
 discussion
  below:
 
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18673.html
  To unsubscribe from Pyspark Error when broadcast numpy array, click
 here.
  NAML
 
 
 
 
  --
  My Homepage: www.cse.ust.hk/~bliuab
  MPhil student in Hong Kong University of Science and Technology.
  Clear Water Bay, Kowloon, Hong Kong.
  Profile at LinkedIn.
 
  
  View this message in context: Re: Pyspark Error when broadcast numpy
 array
 
  Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18684i=1
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18684i=2



 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18684.html
  To unsubscribe from Pyspark Error when broadcast numpy array, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=18662code

Pyspark Error when broadcast numpy array

2014-11-11 Thread bliuab
In spark-1.0.2, I have come across an error when I try to broadcast a quite
large numpy array(with 35M dimension). The error information except the
java.lang.NegativeArraySizeException error and details is listed below.
Moreover, when broadcast a relatively smaller numpy array(30M dimension),
everything works fine. And 30M dimension numpy array takes 230M memory
which, in my opinion, not very large.
As far as I have surveyed, it seems related with py4j. However, I have no
idea how to fix  this. I would be appreciated if I can get some hint.

py4j.protocol.Py4JError: An error occurred while calling o23.broadcast.
Trace:
java.lang.NegativeArraySizeException
at py4j.Base64.decode(Base64.java:292)
at py4j.Protocol.getBytes(Protocol.java:167)
at py4j.Protocol.getObject(Protocol.java:276)
at
py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81)
at py4j.commands.CallCommand.execute(CallCommand.java:77)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
-
And the test code is a follows:
conf =
SparkConf().setAppName('brodyliu_LR').setMaster('spark://10.231.131.87:5051')   
  
conf.set('spark.executor.memory', '4000m')  
conf.set('spark.akka.timeout', '10')
conf.set('spark.ui.port','8081')
conf.set('spark.cores.max','150')   
#conf.set('spark.rdd.compress', 'True') 
conf.set('spark.default.parallelism', '300')
#configure the spark environment
sc = SparkContext(conf=conf, batchSize=1)   

vec = np.random.rand(3500)  
a = sc.broadcast(vec)






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Pyspark Error when broadcast numpy array

2014-11-11 Thread Davies Liu
This PR fix the problem: https://github.com/apache/spark/pull/2659

cc @josh

Davies

On Tue, Nov 11, 2014 at 7:47 PM, bliuab bli...@cse.ust.hk wrote:
 In spark-1.0.2, I have come across an error when I try to broadcast a quite
 large numpy array(with 35M dimension). The error information except the
 java.lang.NegativeArraySizeException error and details is listed below.
 Moreover, when broadcast a relatively smaller numpy array(30M dimension),
 everything works fine. And 30M dimension numpy array takes 230M memory
 which, in my opinion, not very large.
 As far as I have surveyed, it seems related with py4j. However, I have no
 idea how to fix  this. I would be appreciated if I can get some hint.
 
 py4j.protocol.Py4JError: An error occurred while calling o23.broadcast.
 Trace:
 java.lang.NegativeArraySizeException
 at py4j.Base64.decode(Base64.java:292)
 at py4j.Protocol.getBytes(Protocol.java:167)
 at py4j.Protocol.getObject(Protocol.java:276)
 at
 py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81)
 at py4j.commands.CallCommand.execute(CallCommand.java:77)
 at py4j.GatewayConnection.run(GatewayConnection.java:207)
 -
 And the test code is a follows:
 conf =
 SparkConf().setAppName('brodyliu_LR').setMaster('spark://10.231.131.87:5051')
 conf.set('spark.executor.memory', '4000m')
 conf.set('spark.akka.timeout', '10')
 conf.set('spark.ui.port','8081')
 conf.set('spark.cores.max','150')
 #conf.set('spark.rdd.compress', 'True')
 conf.set('spark.default.parallelism', '300')
 #configure the spark environment
 sc = SparkContext(conf=conf, batchSize=1)

 vec = np.random.rand(3500)
 a = sc.broadcast(vec)






 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Pyspark Error when broadcast numpy array

2014-11-11 Thread bliuab
Dear Liu:

Thank you very much for your help. I will update that patch. By the way, as
I have succeed to broadcast an array of size(30M) the log said that such
array takes around 230MB memory. As a result, I think the numpy array that
leads to error is much smaller than 2G.

On Wed, Nov 12, 2014 at 12:29 PM, Davies Liu-2 [via Apache Spark User List]
ml-node+s1001560n18673...@n3.nabble.com wrote:

 This PR fix the problem: https://github.com/apache/spark/pull/2659

 cc @josh

 Davies

 On Tue, Nov 11, 2014 at 7:47 PM, bliuab [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18673i=0 wrote:

  In spark-1.0.2, I have come across an error when I try to broadcast a
 quite
  large numpy array(with 35M dimension). The error information except the
  java.lang.NegativeArraySizeException error and details is listed below.
  Moreover, when broadcast a relatively smaller numpy array(30M
 dimension),
  everything works fine. And 30M dimension numpy array takes 230M memory
  which, in my opinion, not very large.
  As far as I have surveyed, it seems related with py4j. However, I have
 no
  idea how to fix  this. I would be appreciated if I can get some hint.
  
  py4j.protocol.Py4JError: An error occurred while calling o23.broadcast.
  Trace:
  java.lang.NegativeArraySizeException
  at py4j.Base64.decode(Base64.java:292)
  at py4j.Protocol.getBytes(Protocol.java:167)
  at py4j.Protocol.getObject(Protocol.java:276)
  at
  py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81)
  at py4j.commands.CallCommand.execute(CallCommand.java:77)
  at py4j.GatewayConnection.run(GatewayConnection.java:207)
  -
  And the test code is a follows:
  conf =
  SparkConf().setAppName('brodyliu_LR').setMaster('spark://
 10.231.131.87:5051')
  conf.set('spark.executor.memory', '4000m')
  conf.set('spark.akka.timeout', '10')
  conf.set('spark.ui.port','8081')
  conf.set('spark.cores.max','150')
  #conf.set('spark.rdd.compress', 'True')
  conf.set('spark.default.parallelism', '300')
  #configure the spark environment
  sc = SparkContext(conf=conf, batchSize=1)
 
  vec = np.random.rand(3500)
  a = sc.broadcast(vec)
 
 
 
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18673i=1
  For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18673i=2
 

 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18673i=3
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18673i=4



 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18673.html
  To unsubscribe from Pyspark Error when broadcast numpy array, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=18662code=YmxpdWFiQGNzZS51c3QuaGt8MTg2NjJ8NTUwMDMxMjYz
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




-- 
My Homepage: www.cse.ust.hk/~bliuab
MPhil student in Hong Kong University of Science and Technology.
Clear Water Bay, Kowloon, Hong Kong.
Profile at LinkedIn http://www.linkedin.com/pub/liu-bo/55/52b/10b.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18674.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Pyspark Error when broadcast numpy array

2014-11-11 Thread Davies Liu
Yes, your broadcast should be about 300M, much smaller than 2G, I
didn't read your post carefully.

The broadcast in Python had been improved much since 1.1, I think it
will work in 1.1 or upcoming 1.2 release, could you upgrade to 1.1?

Davies

On Tue, Nov 11, 2014 at 8:37 PM, bliuab bli...@cse.ust.hk wrote:
 Dear Liu:

 Thank you very much for your help. I will update that patch. By the way, as
 I have succeed to broadcast an array of size(30M) the log said that such
 array takes around 230MB memory. As a result, I think the numpy array that
 leads to error is much smaller than 2G.

 On Wed, Nov 12, 2014 at 12:29 PM, Davies Liu-2 [via Apache Spark User List]
 [hidden email] wrote:

 This PR fix the problem: https://github.com/apache/spark/pull/2659

 cc @josh

 Davies

 On Tue, Nov 11, 2014 at 7:47 PM, bliuab [hidden email] wrote:

  In spark-1.0.2, I have come across an error when I try to broadcast a
  quite
  large numpy array(with 35M dimension). The error information except the
  java.lang.NegativeArraySizeException error and details is listed below.
  Moreover, when broadcast a relatively smaller numpy array(30M
  dimension),
  everything works fine. And 30M dimension numpy array takes 230M memory
  which, in my opinion, not very large.
  As far as I have surveyed, it seems related with py4j. However, I have
  no
  idea how to fix  this. I would be appreciated if I can get some hint.
  
  py4j.protocol.Py4JError: An error occurred while calling o23.broadcast.
  Trace:
  java.lang.NegativeArraySizeException
  at py4j.Base64.decode(Base64.java:292)
  at py4j.Protocol.getBytes(Protocol.java:167)
  at py4j.Protocol.getObject(Protocol.java:276)
  at
  py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81)
  at py4j.commands.CallCommand.execute(CallCommand.java:77)
  at py4j.GatewayConnection.run(GatewayConnection.java:207)
  -
  And the test code is a follows:
  conf =
 
  SparkConf().setAppName('brodyliu_LR').setMaster('spark://10.231.131.87:5051')
  conf.set('spark.executor.memory', '4000m')
  conf.set('spark.akka.timeout', '10')
  conf.set('spark.ui.port','8081')
  conf.set('spark.cores.max','150')
  #conf.set('spark.rdd.compress', 'True')
  conf.set('spark.default.parallelism', '300')
  #configure the spark environment
  sc = SparkContext(conf=conf, batchSize=1)
 
  vec = np.random.rand(3500)
  a = sc.broadcast(vec)
 
 
 
 
 
 
  --
  View this message in context:
  http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: [hidden email]
  For additional commands, e-mail: [hidden email]
 

 -
 To unsubscribe, e-mail: [hidden email]
 For additional commands, e-mail: [hidden email]



 
 If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18673.html
 To unsubscribe from Pyspark Error when broadcast numpy array, click here.
 NAML




 --
 My Homepage: www.cse.ust.hk/~bliuab
 MPhil student in Hong Kong University of Science and Technology.
 Clear Water Bay, Kowloon, Hong Kong.
 Profile at LinkedIn.

 
 View this message in context: Re: Pyspark Error when broadcast numpy array

 Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Pyspark Error when broadcast numpy array

2014-11-11 Thread bliuab
Dear Liu:

Thank you for your replay. I will set up an experimental environment for
spark-1.1 and test it.

On Wed, Nov 12, 2014 at 2:30 PM, Davies Liu-2 [via Apache Spark User List] 
ml-node+s1001560n1868...@n3.nabble.com wrote:

 Yes, your broadcast should be about 300M, much smaller than 2G, I
 didn't read your post carefully.

 The broadcast in Python had been improved much since 1.1, I think it
 will work in 1.1 or upcoming 1.2 release, could you upgrade to 1.1?

 Davies

 On Tue, Nov 11, 2014 at 8:37 PM, bliuab [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18684i=0 wrote:

  Dear Liu:
 
  Thank you very much for your help. I will update that patch. By the way,
 as
  I have succeed to broadcast an array of size(30M) the log said that such
  array takes around 230MB memory. As a result, I think the numpy array
 that
  leads to error is much smaller than 2G.
 
  On Wed, Nov 12, 2014 at 12:29 PM, Davies Liu-2 [via Apache Spark User
 List]
  [hidden email] wrote:
 
  This PR fix the problem: https://github.com/apache/spark/pull/2659
 
  cc @josh
 
  Davies
 
  On Tue, Nov 11, 2014 at 7:47 PM, bliuab [hidden email] wrote:
 
   In spark-1.0.2, I have come across an error when I try to broadcast a
   quite
   large numpy array(with 35M dimension). The error information except
 the
   java.lang.NegativeArraySizeException error and details is listed
 below.
   Moreover, when broadcast a relatively smaller numpy array(30M
   dimension),
   everything works fine. And 30M dimension numpy array takes 230M
 memory
   which, in my opinion, not very large.
   As far as I have surveyed, it seems related with py4j. However, I
 have
   no
   idea how to fix  this. I would be appreciated if I can get some hint.
   
   py4j.protocol.Py4JError: An error occurred while calling
 o23.broadcast.
   Trace:
   java.lang.NegativeArraySizeException
   at py4j.Base64.decode(Base64.java:292)
   at py4j.Protocol.getBytes(Protocol.java:167)
   at py4j.Protocol.getObject(Protocol.java:276)
   at
   py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81)
   at py4j.commands.CallCommand.execute(CallCommand.java:77)
   at py4j.GatewayConnection.run(GatewayConnection.java:207)
   -
   And the test code is a follows:
   conf =
  
   SparkConf().setAppName('brodyliu_LR').setMaster('spark://
 10.231.131.87:5051')
   conf.set('spark.executor.memory', '4000m')
   conf.set('spark.akka.timeout', '10')
   conf.set('spark.ui.port','8081')
   conf.set('spark.cores.max','150')
   #conf.set('spark.rdd.compress', 'True')
   conf.set('spark.default.parallelism', '300')
   #configure the spark environment
   sc = SparkContext(conf=conf, batchSize=1)
  
   vec = np.random.rand(3500)
   a = sc.broadcast(vec)
  
  
  
  
  
  
   --
   View this message in context:
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html
   Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
  
   -
   To unsubscribe, e-mail: [hidden email]
   For additional commands, e-mail: [hidden email]
  
 
  -
  To unsubscribe, e-mail: [hidden email]
  For additional commands, e-mail: [hidden email]
 
 
 
  
  If you reply to this email, your message will be added to the
 discussion
  below:
 
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18673.html
  To unsubscribe from Pyspark Error when broadcast numpy array, click
 here.
  NAML
 
 
 
 
  --
  My Homepage: www.cse.ust.hk/~bliuab
  MPhil student in Hong Kong University of Science and Technology.
  Clear Water Bay, Kowloon, Hong Kong.
  Profile at LinkedIn.
 
  
  View this message in context: Re: Pyspark Error when broadcast numpy
 array
 
  Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18684i=1
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18684i=2



 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18684.html
  To unsubscribe from Pyspark Error when broadcast numpy array, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=18662code=YmxpdWFiQGNzZS51c3QuaGt8MTg2NjJ8NTUwMDMxMjYz
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase

PySpark Error on Windows with sc.wholeTextFiles

2014-10-16 Thread Griffiths, Michael (NYC-RPM)
Hi,

I'm running into an error on Windows (x64, 8.1) running Spark 1.1.0 (pre-builet 
for Hadoop 2.4: 
spark-1.1.0-bin-hadoop2.4.tgzhttp://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-hadoop2.4.tgz)
 with Java SE Version 8 Update 20 (build 1.8.0_20-b26); just getting started 
with Spark.

When running sc.wholeTextFiles() on a directory, I can run the command but not 
do anything with the resulting RDD - specifically, I get an error in 
py4j.protocol.Py4JJavaError; the error is unspecified, though the location is 
included. I've attached the traceback below.

In this situation, I'm trying to load all files from a folder on the local 
filesystem, located at D:\testdata. The folder contains one file, which can be 
loaded successfully with sc.textFile(d:/testdata/filename) - no problems at 
all - so I do not believe the file is throwing the error.

Is there any advice on what I should look at further to isolate or fix the 
error? Am I doing something obviously wrong?

Thanks,
Michael


Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.1.0
  /_/

Using Python version 2.7.7 (default, Jun 11 2014 10:40:02)
SparkContext available as sc.
 file = sc.textFile(d:/testdata/cbcc5b470ec06f212990c68c8f76e887b884)
 file.count()
732
 file.first()
u'!DOCTYPE html'
 data = sc.wholeTextFiles('d:/testdata')
 data.first()
Traceback (most recent call last):
  File stdin, line 1, in module
  File D:\spark\python\pyspark\rdd.py, line 1167, in first
return self.take(1)[0]
  File D:\spark\python\pyspark\rdd.py, line 1126, in take
totalParts = self._jrdd.partitions().size()
  File D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py, line 
538, in __call__
  File D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py, line 300, 
in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o21.partitions.
: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:559)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:534)
at 
org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42)
   at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1697)
at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1679)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:302)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:263)
at 
org.apache.spark.input.WholeTextFileInputFormat.setMaxSplitSize(WholeTextFileInputFormat.scala:54)
at 
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50)
at 
org.apache.spark.api.java.JavaPairRDD.partitions(JavaPairRDD.scala:44)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)

 data.count()
Traceback (most recent call last):
  File stdin, line 1, in module
  File D:\spark\python\pyspark\rdd.py, line 847, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File D:\spark\python\pyspark\rdd.py, line 838, in sum
return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File D:\spark\python\pyspark\rdd.py, line 759, in reduce
vals = self.mapPartitions(func).collect()
  File 

Re: PySpark Error on Windows with sc.wholeTextFiles

2014-10-16 Thread Davies Liu
It's a bug, could you file a JIRA for this? Thanks!

Davies

On Thu, Oct 16, 2014 at 8:28 AM, Griffiths, Michael (NYC-RPM)
michael.griffi...@reprisemedia.com wrote:
 Hi,



 I’m running into an error on Windows (x64, 8.1) running Spark 1.1.0
 (pre-builet for Hadoop 2.4: spark-1.1.0-bin-hadoop2.4.tgz) with Java SE
 Version 8 Update 20 (build 1.8.0_20-b26); just getting started with Spark.



 When running sc.wholeTextFiles() on a directory, I can run the command but
 not do anything with the resulting RDD – specifically, I get an error in
 py4j.protocol.Py4JJavaError; the error is unspecified, though the location
 is included. I’ve attached the traceback below.



 In this situation, I’m trying to load all files from a folder on the local
 filesystem, located at D:\testdata. The folder contains one file, which can
 be loaded successfully with sc.textFile(“d:/testdata/filename”) – no
 problems at all – so I do not believe the file is throwing the error.



 Is there any advice on what I should look at further to isolate or fix the
 error? Am I doing something obviously wrong?



 Thanks,

 Michael





 Welcome to

     __

  / __/__  ___ _/ /__

 _\ \/ _ \/ _ `/ __/  '_/

/__ / .__/\_,_/_/ /_/\_\   version 1.1.0

   /_/



 Using Python version 2.7.7 (default, Jun 11 2014 10:40:02)

 SparkContext available as sc.

 file =
 sc.textFile(d:/testdata/cbcc5b470ec06f212990c68c8f76e887b884)

 file.count()

 732

 file.first()

 u'!DOCTYPE html'

 data = sc.wholeTextFiles('d:/testdata')

 data.first()

 Traceback (most recent call last):

   File stdin, line 1, in module

   File D:\spark\python\pyspark\rdd.py, line 1167, in first

 return self.take(1)[0]

   File D:\spark\python\pyspark\rdd.py, line 1126, in take

 totalParts = self._jrdd.partitions().size()

   File D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py, line
 538, in __call__

   File D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py, line
 300, in get_return_value

 py4j.protocol.Py4JJavaError: An error occurred while calling o21.partitions.

 : java.lang.NullPointerException

 at java.lang.ProcessBuilder.start(Unknown Source)

 at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)

 at org.apache.hadoop.util.Shell.run(Shell.java:418)

 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)

 at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:559)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:534)

 at
 org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42)

at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1697)

 at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1679)

 at
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:302)

 at
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:263)

 at
 org.apache.spark.input.WholeTextFileInputFormat.setMaxSplitSize(WholeTextFileInputFormat.scala:54)

 at
 org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:219)

 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)

 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)

 at
 org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50)

 at
 org.apache.spark.api.java.JavaPairRDD.partitions(JavaPairRDD.scala:44)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

 at java.lang.reflect.Method.invoke(Unknown Source)

 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)

 at
 py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)

 at py4j.Gateway.invoke(Gateway.java:259)

 at
 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)

 at py4j.commands.CallCommand.execute(CallCommand.java:79)

 at py4j.GatewayConnection.run(GatewayConnection.java:207)

 at java.lang.Thread.run(Unknown Source)



 data.count()

 Traceback (most recent call last):

   File stdin, line 1, in module

   File D:\spark\python\pyspark\rdd.py, line 847, in count

 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()

   File