date:20160112

Re: FPGrowth does not handle large result sets

2016-01-12 Thread Ritu Raj Tiwari

I have been giving it 8-12G

-Raj
Sent from my iPhone

> On Jan 12, 2016, at 6:50 PM, Sabarish Sasidharan 
>  wrote:
> 
> How much RAM are you giving to the driver? 17000 items being collected 
> shouldn't fail unless your driver memory is too low.
> 
> Regards
> Sab
> 
>> On 13-Jan-2016 6:14 am, "Ritu Raj Tiwari"  
>> wrote:
>> Folks:
>> We are running into a problem where FPGrowth seems to choke on data sets 
>> that we think are not too large. We have about 200,000 transactions. Each 
>> transaction is composed of on an average 50 items. There are about 17,000 
>> unique item (SKUs) that might show up in any transaction.
>> 
>> When running locally with 12G ram given to the PySpark process, the FPGrowth 
>> code fails with out of memory error for minSupport of 0.001. The failure 
>> occurs when we try to enumerate and save the frequent itemsets. Looking at 
>> the FPGrowth code 
>> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala),
>>  it seems this is because the genFreqItems() method tries to collect() all 
>> items. Is there a way the code could be rewritten so it does not try to 
>> collect and therefore store all frequent item sets in memory?
>> 
>> Thanks for any insights.
>> 
>> -Raj

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread David Russell

Hi Richard,

> Would it be possible to access the session API from within ROSE,
> to get for example the images that are generated by R / openCPU

Technically it would be possible although there would be some potentially 
significant runtime costs per task in doing so, primarily those related to 
extracting image data from the R session, serializing and then moving that data 
across the cluster for each and every image.

From a design perspective ROSE was intended to be used within Spark scale 
applications where R object data was seen as the primary task output. An output 
in a format that could be rapidly serialized and easily processed. Are there 
real world use cases where Spark scale applications capable of generating 10k, 
100k, or even millions of image files would actually need to capture and store 
images? If so, how practically speaking, would these images ever be used? I'm 
just not sure. Maybe you could describe your own use case to provide some 
insights?

> and the logging to stdout that is logged by R?

If you are referring to the R console output (generated within the R session 
during the execution of an OCPUTask) then this data could certainly 
(optionally) be captured and returned on an OCPUResult. Again, can you provide 
any details for how you might use this console output in a real world 
application?

As an aside, for simple standalone Spark applications that will only ever run 
on a single host (no cluster) you could consider using an alternative library 
called fluent-r. This library is also available under my GitHub repo, [see 
here](https://github.com/onetapbeyond/fluent-r). The fluent-r library already 
has support for the retrieval of R objects, R console output and R graphics 
device image/plots. However it is not as lightweight as ROSE and it not 
designed to work in a clustered environment. ROSE on the other hand is designed 
for scale.

David

"All that is gold does not glitter, Not all those who wander are lost."

 Original Message 
Subject: Re: ROSE: Spark + R on the JVM.
Local Time: January 12 2016 6:56 pm
UTC Time: January 12 2016 11:56 pm
From: rsiebel...@gmail.com
To: m...@vijaykiran.com
CC: 
cjno...@gmail.com,themarchoffo...@protonmail.com,user@spark.apache.org,d...@spark.apache.org

Hi,

this looks great and seems to be very usable.
Would it be possible to access the session API from within ROSE, to get for 
example the images that are generated by R / openCPU and the logging to stdout 
that is logged by R?

thanks in advance,
Richard

On Tue, Jan 12, 2016 at 10:16 PM, Vijay Kiran  wrote:

I think it would be this: https://github.com/onetapbeyond/opencpu-spark-executor

> On 12 Jan 2016, at 18:32, Corey Nolet  wrote:
>

> David,
>
> Thank you very much for announcing this! It looks like it could be very 
> useful. Would you mind providing a link to the github?
>
> On Tue, Jan 12, 2016 at 10:03 AM, David  
> wrote:
> Hi all,
>
> I'd like to share news of the recent release of a new Spark package, ROSE.
>
> ROSE is a Scala library offering access to the full scientific computing 
> power of the R programming language to Apache Spark batch and streaming 
> applications on the JVM. Where Apache SparkR lets data scientists use Spark 
> from R, ROSE is designed to let Scala and Java developers use R from Spark.
>
> The project is available and documented on GitHub and I would encourage you 
> to take a look. Any feedback, questions etc very welcome.
>
> David
>
> "All that is gold does not glitter, Not all those who wander are lost."
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: PCA OutOfMemoryError

2016-01-12 Thread Bharath Ravi Kumar

Any suggestion/opinion?
On 12-Jan-2016 2:06 pm, "Bharath Ravi Kumar"  wrote:

> We're running PCA (selecting 100 principal components) on a dataset that
> has ~29K columns and is 70G in size stored in ~600 parts on HDFS. The
> matrix in question is mostly sparse with tens of columns populate in most
> rows, but a few rows with thousands of columns populated. We're running
> spark on mesos with driver memory set to 40G and executor memory set to
> 80G. We're however encountering an out of memory error (included at the end
> of the message) regardless of the number of rdd partitions or the degree of
> task parallelism being set. I noticed a warning at the beginning of the PCA
> computation stage: " WARN
> org.apache.spark.mllib.linalg.distributed.RowMatrix: 29604 columns will
> require at least 7011 megabyte  of memory!"
> I don't understand which memory this refers to. Is this the executor
> memory?  The driver memory? Any other?
> The stacktrace appears to indicate that a large array is probably being
> passed along with the task. Could this array have been passed as a
> broadcast variable instead ? Any suggestions / workarounds other than
> re-implementing the algorithm?
>
> Thanks,
> Bharath
>
> 
>
> Exception in thread "main" java.lang.OutOfMemoryError: Requested array
> size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:2271)
> at
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at
> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
> at
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
> at
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
> at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
> at
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
> at
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
> at
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:2030)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
> at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:702)
> at
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1100)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
> at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1091)
> at
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computeGramianMatrix(RowMatrix.scala:124)
> at
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:350)
> at
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponents(RowMatrix.scala:386)
> at org.apache.spark.mllib.feature.PCA.fit(PCA.scala:46)
>
>

Re: failure to parallelize an RDD

2016-01-12 Thread Ted Yu

Which release of Spark are you using ?

Can you turn on DEBUG logging to see if there is more clue ?

Thanks

On Tue, Jan 12, 2016 at 6:37 PM, AlexG  wrote:

> I transpose a matrix (colChunkOfA) stored as a 200-by-54843210 as an array
> of
> rows in Array[Array[Float]] format into another matrix (rowChunk) also
> stored row-wise as a 54843210-by-200 Array[Array[Float]] using the
> following
> code:
>
> val rowChunk = new Array[Tuple2[Int,Array[Float]]](numCols)
> val colIndices = (0 until colChunkOfA.length).toArray
>
> (0 until numCols).foreach( rowIdx => {
>   rowChunk(rowIdx) = Tuple2(rowIdx, colIndices.map(colChunkOfA(_)(rowIdx)))
> })
>
> This succeeds, but the following code which attempts to turn rowChunk into
> an RDD fails silently: spark-submit just ends, and none of the executor
> logs
> indicate any errors occurring.
>
> val parallelRowChunkRDD = sc.parallelize(rowChunk).cache
> parallelRowChunkRDD.count
>
> What is the culprit here?
>
> Here is the log output starting from the count instruction:
>
> 16/01/13 02:23:38 INFO SparkContext: Starting job: count at
> transposeAvroToAvroChunks.scala:129
> 16/01/13 02:23:38 INFO DAGScheduler: Got job 3 (count at
> transposeAvroToAvroChunks.scala:129) with 928 output partitions
> 16/01/13 02:23:38 INFO DAGScheduler: Final stage: ResultStage 3(count at
> transposeAvroToAvroChunks.scala:129)
> 16/01/13 02:23:38 INFO DAGScheduler: Parents of final stage: List()
> 16/01/13 02:23:38 INFO DAGScheduler: Missing parents: List()
> 16/01/13 02:23:38 INFO DAGScheduler: Submitting ResultStage 3
> (ParallelCollectionRDD[2448] at parallelize at
> transposeAvroToAvroChunks.scala:128), which has no missing parents
> 16/01/13 02:23:38 INFO MemoryStore: ensureFreeSpace(1048) called with
> curMem=50917367, maxMem=127452201615
> 16/01/13 02:23:38 INFO MemoryStore: Block broadcast_615 stored as values in
> memory (estimated size 1048.0 B, free 118.7 GB)
> 16/01/13 02:23:38 INFO MemoryStore: ensureFreeSpace(740) called with
> curMem=50918415, maxMem=127452201615
> 16/01/13 02:23:38 INFO MemoryStore: Block broadcast_615_piece0 stored as
> bytes in memory (estimated size 740.0 B, free 118.7 GB)
> 16/01/13 02:23:38 INFO BlockManagerInfo: Added broadcast_615_piece0 in
> memory on 172.31.36.112:36581 (size: 740.0 B, free: 118.7 GB)
> 16/01/13 02:23:38 INFO SparkContext: Created broadcast 615 from broadcast
> at
> DAGScheduler.scala:861
> 16/01/13 02:23:38 INFO DAGScheduler: Submitting 928 missing tasks from
> ResultStage 3 (ParallelCollectionRDD[2448] at parallelize at
> transposeAvroToAvroChunks.scala:128)
> 16/01/13 02:23:38 INFO TaskSchedulerImpl: Adding task set 3.0 with 928
> tasks
> 16/01/13 02:23:39 WARN TaskSetManager: Stage 3 contains a task of very
> large
> size (47027 KB). The maximum recommended task size is 100 KB.
> 16/01/13 02:23:39 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
> 1219, 172.31.34.184, PROCESS_LOCAL, 48156290 bytes)
> ...
> 16/01/13 02:27:13 INFO TaskSetManager: Starting task 927.0 in stage 3.0
> (TID
> 2146, 172.31.42.67, PROCESS_LOCAL, 48224789 bytes)
> 16/01/13 02:27:17 INFO BlockManagerInfo: Removed broadcast_419_piece0 on
> 172.31.36.112:36581 in memory (size: 17.4 KB, free: 118.7 GB)
> 16/01/13 02:27:21 INFO BlockManagerInfo: Removed broadcast_419_piece0 on
> 172.31.35.157:51059 in memory (size: 17.4 KB, free: 10.4 GB)
> 16/01/13 02:27:21 INFO BlockManagerInfo: Removed broadcast_419_piece0 on
> 172.31.47.118:34888 in memory (size: 17.4 KB, free: 10.4 GB)
> 16/01/13 02:27:22 INFO BlockManagerInfo: Removed broadcast_419_piece0 on
> 172.31.38.42:48582 in memory (size: 17.4 KB, free: 10.4 GB)
> 16/01/13 02:27:38 INFO BlockManagerInfo: Added broadcast_615_piece0 in
> memory on 172.31.41.68:59281 (size: 740.0 B, free: 10.4 GB)
> 16/01/13 02:27:55 INFO BlockManagerInfo: Added broadcast_615_piece0 in
> memory on 172.31.47.118:59575 (size: 740.0 B, free: 10.4 GB)
> 16/01/13 02:28:47 INFO BlockManagerInfo: Added broadcast_615_piece0 in
> memory on 172.31.40.24:55643 (size: 740.0 B, free: 10.4 GB)
> 16/01/13 02:28:49 INFO BlockManagerInfo: Added broadcast_615_piece0 in
> memory on 172.31.47.118:53671 (size: 740.0 B, free: 10.4 GB)
>
> This is the end of the log, so it looks like all 928 tasks got started, but
> presumably somewhere in running, they ran into an error. Nothing shows up
> in
> the executor logs.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/failure-to-parallelize-an-RDD-tp25950.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

RE: spark job failure - akka error Association with remote system has failed

2016-01-12 Thread vivek.meghanathan

I have used master_ip as ip address and spark conf also has Ip address . But 
the following logs shows hostname. (The spark Ui shows master details in IP)


16/01/13 12:31:38 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkDriver@masternode1:36537] has failed, address is now 
gated for [5000] ms. Reason is: [Disassociated].

From: Vivek Meghanathan (WT01 - NEP)
Sent: 13 January 2016 12:18
To: user@spark.apache.org
Subject: spark job failure - akka error Association with remote system has 
failed

Hi All,
I am running spark 1.3.0 standalone cluster mode, we have rebooted the cluster 
servers (system reboot). After that the spark jobs are failing by showing 
following error (it fails within 7-8 seconds). 2 of the jobs are running fine. 
All the jobs used to be stable before the system reboot. We have not enabled 
any default configurations in the conf file other than spark-env.sh, slaves and 
log4j.properties.

Warning in the master log:

16/01/13 11:58:16 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkDriver@masternode1:41419] has failed, address is now 
gated for [5000] ms. Reason is: [Disassociated].

Regards,
Vivek M
The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com

RE: Dedup

2016-01-12 Thread gpmacalalad

sowen wrote
> Arrays are not immutable and do not have the equals semantics you want to
> use them as a key.  Use a Scala immutable List.
> On Oct 9, 2014 12:32 PM, "Ge, Yao (Y.)" 

> yge@

>  wrote:
> 
>> Yes. I was using String array as arguments in the reduceByKey. I think
>> String array is actually immutable and simply returning the first
>> argument
>> without cloning one should work. I will look into mapPartitions as we can
>> have up to 40% duplicates. Will follow up on this if necessary. Thanks
>> very
>> much Sean!
>>
>> -Yao
>>
>> -Original Message-
>> From: Sean Owen [mailto:

> sowen@

> ]
>> Sent: Thursday, October 09, 2014 3:04 AM
>> To: Ge, Yao (Y.)
>> Cc: 

> user@.apache

>> Subject: Re: Dedup
>>
>> I think the question is about copying the argument. If it's an immutable
>> value like String, yes just return the first argument and ignore the
>> second. If you're dealing with a notoriously mutable value like a Hadoop
>> Writable, you need to copy the value you return.
>>
>> This works fine although you will spend a fair bit of time marshaling all
>> of those duplicates together just to discard all but one.
>>
>> If there are lots of duplicates, It would take a bit more work, but would
>> be faster, to do something like this: mapPartitions and retain one input
>> value each unique dedup criteria, and then output those pairs, and then
>> reduceByKey the result.
>>
>> On Wed, Oct 8, 2014 at 8:37 PM, Ge, Yao (Y.) 

> yge@

>  wrote:
>> > I need to do deduplication processing in Spark. The current plan is to
>> > generate a tuple where key is the dedup criteria and value is the
>> > original input. I am thinking to use reduceByKey to discard duplicate
>> > values. If I do that, can I simply return the first argument or should
>> > I return a copy of the first argument. Is there are better way to do
>> dedup in Spark?
>> >
>> >
>> >
>> > -Yao
>>

Hi I'm a bit new at (scala/spark), we are doing data deduplication. so far I
can handle exact match for 3M line of data. but I'm  on a delema on fuzzy
match using cosine and jaro winkler. My biggest problem is on what way to
optimize my method getting a match with a 90% above. I am planning to group
first before matching but this may result to missingout some important
match. can someone help me,much appreciated.  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Dedup-tp15967p25951.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

failure to parallelize an RDD

2016-01-12 Thread AlexG

I transpose a matrix (colChunkOfA) stored as a 200-by-54843210 as an array of
rows in Array[Array[Float]] format into another matrix (rowChunk) also
stored row-wise as a 54843210-by-200 Array[Array[Float]] using the following
code:

val rowChunk = new Array[Tuple2[Int,Array[Float]]](numCols)
val colIndices = (0 until colChunkOfA.length).toArray

(0 until numCols).foreach( rowIdx => { 
  rowChunk(rowIdx) = Tuple2(rowIdx, colIndices.map(colChunkOfA(_)(rowIdx)))
})   

This succeeds, but the following code which attempts to turn rowChunk into
an RDD fails silently: spark-submit just ends, and none of the executor logs
indicate any errors occurring. 

val parallelRowChunkRDD = sc.parallelize(rowChunk).cache
parallelRowChunkRDD.count

What is the culprit here?

Here is the log output starting from the count instruction:

16/01/13 02:23:38 INFO SparkContext: Starting job: count at
transposeAvroToAvroChunks.scala:129
16/01/13 02:23:38 INFO DAGScheduler: Got job 3 (count at
transposeAvroToAvroChunks.scala:129) with 928 output partitions
16/01/13 02:23:38 INFO DAGScheduler: Final stage: ResultStage 3(count at
transposeAvroToAvroChunks.scala:129)
16/01/13 02:23:38 INFO DAGScheduler: Parents of final stage: List()
16/01/13 02:23:38 INFO DAGScheduler: Missing parents: List()
16/01/13 02:23:38 INFO DAGScheduler: Submitting ResultStage 3
(ParallelCollectionRDD[2448] at parallelize at
transposeAvroToAvroChunks.scala:128), which has no missing parents
16/01/13 02:23:38 INFO MemoryStore: ensureFreeSpace(1048) called with
curMem=50917367, maxMem=127452201615
16/01/13 02:23:38 INFO MemoryStore: Block broadcast_615 stored as values in
memory (estimated size 1048.0 B, free 118.7 GB)
16/01/13 02:23:38 INFO MemoryStore: ensureFreeSpace(740) called with
curMem=50918415, maxMem=127452201615
16/01/13 02:23:38 INFO MemoryStore: Block broadcast_615_piece0 stored as
bytes in memory (estimated size 740.0 B, free 118.7 GB)
16/01/13 02:23:38 INFO BlockManagerInfo: Added broadcast_615_piece0 in
memory on 172.31.36.112:36581 (size: 740.0 B, free: 118.7 GB)
16/01/13 02:23:38 INFO SparkContext: Created broadcast 615 from broadcast at
DAGScheduler.scala:861
16/01/13 02:23:38 INFO DAGScheduler: Submitting 928 missing tasks from
ResultStage 3 (ParallelCollectionRDD[2448] at parallelize at
transposeAvroToAvroChunks.scala:128)
16/01/13 02:23:38 INFO TaskSchedulerImpl: Adding task set 3.0 with 928 tasks
16/01/13 02:23:39 WARN TaskSetManager: Stage 3 contains a task of very large
size (47027 KB). The maximum recommended task size is 100 KB.
16/01/13 02:23:39 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
1219, 172.31.34.184, PROCESS_LOCAL, 48156290 bytes)
...
16/01/13 02:27:13 INFO TaskSetManager: Starting task 927.0 in stage 3.0 (TID
2146, 172.31.42.67, PROCESS_LOCAL, 48224789 bytes)
16/01/13 02:27:17 INFO BlockManagerInfo: Removed broadcast_419_piece0 on
172.31.36.112:36581 in memory (size: 17.4 KB, free: 118.7 GB)
16/01/13 02:27:21 INFO BlockManagerInfo: Removed broadcast_419_piece0 on
172.31.35.157:51059 in memory (size: 17.4 KB, free: 10.4 GB)
16/01/13 02:27:21 INFO BlockManagerInfo: Removed broadcast_419_piece0 on
172.31.47.118:34888 in memory (size: 17.4 KB, free: 10.4 GB)
16/01/13 02:27:22 INFO BlockManagerInfo: Removed broadcast_419_piece0 on
172.31.38.42:48582 in memory (size: 17.4 KB, free: 10.4 GB)
16/01/13 02:27:38 INFO BlockManagerInfo: Added broadcast_615_piece0 in
memory on 172.31.41.68:59281 (size: 740.0 B, free: 10.4 GB)
16/01/13 02:27:55 INFO BlockManagerInfo: Added broadcast_615_piece0 in
memory on 172.31.47.118:59575 (size: 740.0 B, free: 10.4 GB)
16/01/13 02:28:47 INFO BlockManagerInfo: Added broadcast_615_piece0 in
memory on 172.31.40.24:55643 (size: 740.0 B, free: 10.4 GB)
16/01/13 02:28:49 INFO BlockManagerInfo: Added broadcast_615_piece0 in
memory on 172.31.47.118:53671 (size: 740.0 B, free: 10.4 GB)
 
This is the end of the log, so it looks like all 928 tasks got started, but
presumably somewhere in running, they ran into an error. Nothing shows up in
the executor logs.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/failure-to-parallelize-an-RDD-tp25950.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

hiveContext.sql() - Join query fails silently

2016-01-12 Thread Jins George


Hi All,

I have used a  hiveContext.sql() to join a temporary table created from 
Dataframe and parquet tables created in Hive.


The join query runs fine for few hours and then suddenly fails to do the 
Join. Once the issue happens the dataframe returned from 
hiveContext.sql() is empty. If I restart the job, things starts working 
again.


If anyone has faced this type of issue, please suggest

I am using spark 1.5.1, single node in local mode and hive Hive 1.1.0

Thanks,
Jins George

ml.classification.NaiveBayesModel how to reshape theta

2016-01-12 Thread Andy Davidson

I am trying to debug my trained model by exploring theta
Theta is a Matrix. The java Doc for Matrix says that it is column major
formate

I have trained a NaiveBayesModel. Is the number of classes == to the number
of rows? 

int numRows = nbModel.numClasses();

int numColumns = nbModel.numFeatures();



Kind regards



Andy

Re:

2016-01-12 Thread Sabarish Sasidharan

You could generate as many duplicates with a tag/sequence. And then use a
custom partitioner that uses that tag/sequence in addition to the key to do
the partitioning.

Regards
Sab
On 12-Jan-2016 12:21 am, "Daniel Imberman" 
wrote:

> Hi all,
>
> I'm looking for a way to efficiently partition an RDD, but allow the same
> data to exists on multiple partitions.
>
>
> Lets say I have a key-value RDD with keys {1,2,3,4}
>
> I want to be able to to repartition the RDD so that so the partitions look
> like
>
> p1 = {1,2}
> p2 = {2,3}
> p3 = {3,4}
>
> Locality is important in this situation as I would be doing internal
> comparison values.
>
> Does anyone have any thoughts as to how I could go about doing this?
>
> Thank you
>

spark job failure - akka error Association with remote system has failed

2016-01-12 Thread vivek.meghanathan

Hi All,
I am running spark 1.3.0 standalone cluster mode, we have rebooted the cluster 
servers (system reboot). After that the spark jobs are failing by showing 
following error (it fails within 7-8 seconds). 2 of the jobs are running fine. 
All the jobs used to be stable before the system reboot. We have not enabled 
any default configurations in the conf file other than spark-env.sh, slaves and 
log4j.properties.

Warning in the master log:

16/01/13 11:58:16 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkDriver@masternode1:41419] has failed, address is now 
gated for [5000] ms. Reason is: [Disassociated].

Regards,
Vivek M
The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com

Serializing DataSets

2016-01-12 Thread Simon Hafner

What's the proper way to write DataSets to disk? Convert them to a
DataFrame and use the writers there?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: failure to parallelize an RDD

2016-01-12 Thread Alex Gittens

I'm using Spark 1.5.1

When I turned on DEBUG, I don't see anything that looks useful. Other than
the INFO outputs, there is a ton of RPC message related logs, and this bit:

16/01/13 05:53:43 DEBUG ClosureCleaner: +++ Cleaning closure 
(org.apache.spark.rdd.RDD$$anonfun$count$1) +++
16/01/13 05:53:43 DEBUG ClosureCleaner:  + declared fields: 1
16/01/13 05:53:43 DEBUG ClosureCleaner:  public static final long
org.apache.spark.rdd.RDD$$anonfun$count$1.serialVersionUID
16/01/13 05:53:43 DEBUG ClosureCleaner:  + declared methods: 2
16/01/13 05:53:43 DEBUG ClosureCleaner:  public final java.lang.Object
org.apache.spark.rdd.RDD$$anonfun$count$1.apply(java.lang.Object)
16/01/13 05:53:43 DEBUG ClosureCleaner:  public final long
org.apache.spark.rdd.RDD$$anonfun$count$1.apply(scala.collection.Iterator)
16/01/13 05:53:43 DEBUG ClosureCleaner:  + inner classes: 0
16/01/13 05:53:43 DEBUG ClosureCleaner:  + outer classes: 0
16/01/13 05:53:43 DEBUG ClosureCleaner:  + outer objects: 0
16/01/13 05:53:43 DEBUG ClosureCleaner:  + populating accessed fields
because this is the starting closure
16/01/13 05:53:43 DEBUG ClosureCleaner:  + fields accessed by starting
closure: 0
16/01/13 05:53:43 DEBUG ClosureCleaner:  + there are no enclosing objects!
16/01/13 05:53:43 DEBUG ClosureCleaner:  +++ closure 
(org.apache.spark.rdd.RDD$$anonfun$count$1) is now cleaned +++
16/01/13 05:53:43 DEBUG ClosureCleaner: +++ Cleaning closure 
(org.apache.spark.SparkContext$$anonfun$runJob$5) +++
16/01/13 05:53:43 DEBUG ClosureCleaner:  + declared fields: 2
16/01/13 05:53:43 DEBUG ClosureCleaner:  public static final long
org.apache.spark.SparkContext$$anonfun$runJob$5.serialVersionUID
16/01/13 05:53:43 DEBUG ClosureCleaner:  private final scala.Function1
org.apache.spark.SparkContext$$anonfun$runJob$5.cleanedFunc$1
16/01/13 05:53:43 DEBUG ClosureCleaner:  + declared methods: 2
16/01/13 05:53:43 DEBUG ClosureCleaner:  public final java.lang.Object
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(java.lang.Object,java.lang.Object)
16/01/13 05:53:43 DEBUG ClosureCleaner:  public final java.lang.Object
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(org.apache.spark.TaskContext,scala.collection.Iterator)
16/01/13 05:53:43 DEBUG ClosureCleaner:  + inner classes: 0
16/01/13 05:53:43 DEBUG ClosureCleaner:  + outer classes: 0
16/01/13 05:53:43 DEBUG ClosureCleaner:  + outer objects: 0
16/01/13 05:53:43 DEBUG ClosureCleaner:  + populating accessed fields
because this is the starting closure
16/01/13 05:53:43 DEBUG ClosureCleaner:  + fields accessed by starting
closure: 0
16/01/13 05:53:43 DEBUG ClosureCleaner:  + there are no enclosing objects!
16/01/13 05:53:43 DEBUG ClosureCleaner:  +++ closure 
(org.apache.spark.SparkContext$$anonfun$runJob$5) is now cleaned +++
16/01/13 05:53:43 INFO SparkContext: Starting job: count at
transposeAvroToAvroChunks.scala:129
16/01/13 05:53:43 INFO DAGScheduler: Got job 3 (count at
transposeAvroToAvroChunks.scala:129) with 928 output partitions
16/01/13 05:53:43 INFO DAGScheduler: Final stage: ResultStage 3(count at
transposeAvroToAvroChunks.scala:129)


On Tue, Jan 12, 2016 at 6:41 PM, Ted Yu  wrote:

> Which release of Spark are you using ?
>
> Can you turn on DEBUG logging to see if there is more clue ?
>
> Thanks
>
> On Tue, Jan 12, 2016 at 6:37 PM, AlexG  wrote:
>
>> I transpose a matrix (colChunkOfA) stored as a 200-by-54843210 as an
>> array of
>> rows in Array[Array[Float]] format into another matrix (rowChunk) also
>> stored row-wise as a 54843210-by-200 Array[Array[Float]] using the
>> following
>> code:
>>
>> val rowChunk = new Array[Tuple2[Int,Array[Float]]](numCols)
>> val colIndices = (0 until colChunkOfA.length).toArray
>>
>> (0 until numCols).foreach( rowIdx => {
>>   rowChunk(rowIdx) = Tuple2(rowIdx,
>> colIndices.map(colChunkOfA(_)(rowIdx)))
>> })
>>
>> This succeeds, but the following code which attempts to turn rowChunk into
>> an RDD fails silently: spark-submit just ends, and none of the executor
>> logs
>> indicate any errors occurring.
>>
>> val parallelRowChunkRDD = sc.parallelize(rowChunk).cache
>> parallelRowChunkRDD.count
>>
>> What is the culprit here?
>>
>> Here is the log output starting from the count instruction:
>>
>> 16/01/13 02:23:38 INFO SparkContext: Starting job: count at
>> transposeAvroToAvroChunks.scala:129
>> 16/01/13 02:23:38 INFO DAGScheduler: Got job 3 (count at
>> transposeAvroToAvroChunks.scala:129) with 928 output partitions
>> 16/01/13 02:23:38 INFO DAGScheduler: Final stage: ResultStage 3(count at
>> transposeAvroToAvroChunks.scala:129)
>> 16/01/13 02:23:38 INFO DAGScheduler: Parents of final stage: List()
>> 16/01/13 02:23:38 INFO DAGScheduler: Missing parents: List()
>> 16/01/13 02:23:38 INFO DAGScheduler: Submitting ResultStage 3
>> (ParallelCollectionRDD[2448] at parallelize at
>> transposeAvroToAvroChunks.scala:128), which has no missing parents
>> 16/01/13 02:23:38 INFO MemoryStore:

Re: JMXSink for YARN deployment

2016-01-12 Thread Kyle Lin

Hello there


I run both driver and master on the same node, so I got "Port already in
use" exception.

Is there any solution to set different port for each component?

Kyle


2015-12-05 5:54 GMT+08:00 spearson23 :

> Run "spark-submit --help" to see all available options.
>
> To get JMX to work you need to:
>
> spark-submit --driver-java-options "-Dcom.sun.management.jmxremote
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.port=JMX_PORT" --conf
> spark.metrics.conf=metrics.properties --class 'CLASS_NAME' --master
> yarn-cluster --files /PATH/TO/metrics.properties /PATH/TO/JAR.FILE
>
>
> This will run JMX on the driver node on or "JMX_PORT".  Note that the
> driver
> node and the YARN master node are not the same, you'll have to look where
> spark put the driver node and then connect there.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/JMXSink-for-YARN-deployment-tp13958p25572.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: sparkR ORC support.

2016-01-12 Thread Sandeep Khurana

The code is very simple, pasted below .
hive-site.xml is in spark conf already. I still see this error

Error in writeJobj(con, object) : invalid jobj 3

after running the script  below


script
===
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")


.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)

sc <<- sparkR.init()
sc <<- sparkRHive.init()
hivecontext <<- sparkRHive.init(sc)
df <- loadDF(hivecontext, "/data/ingest/sparktest1/", "orc")
#View(df)


On Wed, Jan 6, 2016 at 11:08 PM, Felix Cheung 
wrote:

> Yes, as Yanbo suggested, it looks like there is something wrong with the
> sqlContext.
>
> Could you forward us your code please?
>
>
>
>
>
> On Wed, Jan 6, 2016 at 5:52 AM -0800, "Yanbo Liang" 
> wrote:
>
> You should ensure your sqlContext is HiveContext.
>
> sc <- sparkR.init()
>
> sqlContext <- sparkRHive.init(sc)
>
>
> 2016-01-06 20:35 GMT+08:00 Sandeep Khurana :
>
> Felix
>
> I tried the option suggested by you.  It gave below error.  I am going to
> try the option suggested by Prem .
>
> Error in writeJobj(con, object) : invalid jobj 1
> 8
> stop("invalid jobj ", value$id)
> 7
> writeJobj(con, object)
> 6
> writeObject(con, a)
> 5
> writeArgs(rc, args)
> 4
> invokeJava(isStatic = TRUE, className, methodName, ...)
> 3
> callJStatic("org.apache.spark.sql.api.r.SQLUtils", "loadDF", sqlContext,
> source, options)
> 2
> read.df(sqlContext, filepath, "orc") at
> spark_api.R#108
>
> On Wed, Jan 6, 2016 at 10:30 AM, Felix Cheung 
> wrote:
>
> Firstly I don't have ORC data to verify but this should work:
>
> df <- loadDF(sqlContext, "data/path", "orc")
>
> Secondly, could you check if sparkR.stop() was called? sparkRHive.init()
> should be called after sparkR.init() - please check if there is any error
> message there.
>
> _
> From: Prem Sure 
> Sent: Tuesday, January 5, 2016 8:12 AM
> Subject: Re: sparkR ORC support.
> To: Sandeep Khurana 
> Cc: spark users , Deepak Sharma <
> deepakmc...@gmail.com>
>
>
>
> Yes Sandeep, also copy hive-site.xml too to spark conf directory.
>
>
> On Tue, Jan 5, 2016 at 10:07 AM, Sandeep Khurana 
> wrote:
>
> Also, do I need to setup hive in spark as per the link
> http://stackoverflow.com/questions/26360725/accesing-hive-tables-in-spark
> ?
>
> We might need to copy hdfs-site.xml file to spark conf directory ?
>
> On Tue, Jan 5, 2016 at 8:28 PM, Sandeep Khurana 
> wrote:
>
> Deepak
>
> Tried this. Getting this error now
>
> rror in sql(hivecontext, "FROM CATEGORIES SELECT category_id", "") :   unused 
> argument ("")
>
>
> On Tue, Jan 5, 2016 at 6:48 PM, Deepak Sharma 
> wrote:
>
> Hi Sandeep
> can you try this ?
>
> results <- sql(hivecontext, "FROM test SELECT id","")
>
> Thanks
> Deepak
>
>
> On Tue, Jan 5, 2016 at 5:49 PM, Sandeep Khurana 
> wrote:
>
> Thanks Deepak.
>
> I tried this as well. I created a hivecontext   with  "hivecontext <<-
> sparkRHive.init(sc) "  .
>
> When I tried to read hive table from this ,
>
> results <- sql(hivecontext, "FROM test SELECT id")
>
> I get below error,
>
> Error in callJMethod(sqlContext, "sql", sqlQuery) :   Invalid jobj 2. If 
> SparkR was restarted, Spark operations need to be re-executed.
>
>
> Not sure what is causing this? Any leads or ideas? I am using rstudio.
>
>
>
> On Tue, Jan 5, 2016 at 5:35 PM, Deepak Sharma 
> wrote:
>
> Hi Sandeep
> I am not sure if ORC can be read directly in R.
> But there can be a workaround .First create hive table on top of ORC files
> and then access hive table in R.
>
> Thanks
> Deepak
>
> On Tue, Jan 5, 2016 at 4:57 PM, Sandeep Khurana 
> wrote:
>
> Hello
>
> I need to read an ORC files in hdfs in R using spark. I am not able to
> find a package to do that.
>
> Can anyone help with documentation or example for this purpose?
>
> --
> Architect
> Infoworks.io 
> http://Infoworks.io
>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>
>
>
>
> --
> Architect
> Infoworks.io 
> http://Infoworks.io
>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>
>
>
>
> --
> Architect
> Infoworks.io 
> http://Infoworks.io
>
>
>
>
> --
> Architect
> Infoworks.io 
> http://Infoworks.io
>
>
>
>
>
>
>
> --
> Architect
> Infoworks.io
> http://Infoworks.io
>
>
>


-- 
Architect
Infoworks.io
http://Infoworks.io

Re: sparkR ORC support.

2016-01-12 Thread Sandeep Khurana

Complete stacktrace is. Can it be something wih java versions?


stop("invalid jobj ", value$id)
8
writeJobj(con, object)
7
writeObject(con, a)
6
writeArgs(rc, args)
5
invokeJava(isStatic = TRUE, className, methodName, ...)
4
callJStatic("org.apache.spark.sql.api.r.SQLUtils", "loadDF", sqlContext,
source, options)
3
read.df(sqlContext, path, source, schema, ...)
2
loadDF(hivecontext, filepath, "orc")

On Tue, Jan 12, 2016 at 2:41 PM, Sandeep Khurana 
wrote:

> Running this gave
>
> 16/01/12 04:06:54 INFO BlockManagerMaster: Registered BlockManagerError in 
> writeJobj(con, object) : invalid jobj 3
>
>
> How does it know which hive schema to connect to?
>
>
>
> On Tue, Jan 12, 2016 at 2:34 PM, Felix Cheung 
> wrote:
>
>> It looks like you have overwritten sc. Could you try this:
>>
>>
>> Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")
>>
>> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
>> library(SparkR)
>>
>> sc <- sparkR.init()
>> hivecontext <- sparkRHive.init(sc)
>> df <- loadDF(hivecontext, "/data/ingest/sparktest1/", "orc")
>>
>>
>>
>> --
>> Date: Tue, 12 Jan 2016 14:28:58 +0530
>> Subject: Re: sparkR ORC support.
>> From: sand...@infoworks.io
>> To: felixcheun...@hotmail.com
>> CC: yblia...@gmail.com; user@spark.apache.org; premsure...@gmail.com;
>> deepakmc...@gmail.com
>>
>>
>> The code is very simple, pasted below .
>> hive-site.xml is in spark conf already. I still see this error
>>
>> Error in writeJobj(con, object) : invalid jobj 3
>>
>> after running the script  below
>>
>>
>> script
>> ===
>> Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")
>>
>>
>> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
>> library(SparkR)
>>
>> sc <<- sparkR.init()
>> sc <<- sparkRHive.init()
>> hivecontext <<- sparkRHive.init(sc)
>> df <- loadDF(hivecontext, "/data/ingest/sparktest1/", "orc")
>> #View(df)
>>
>>
>> On Wed, Jan 6, 2016 at 11:08 PM, Felix Cheung 
>> wrote:
>>
>> Yes, as Yanbo suggested, it looks like there is something wrong with the
>> sqlContext.
>>
>> Could you forward us your code please?
>>
>>
>>
>>
>>
>> On Wed, Jan 6, 2016 at 5:52 AM -0800, "Yanbo Liang" 
>> wrote:
>>
>> You should ensure your sqlContext is HiveContext.
>>
>> sc <- sparkR.init()
>>
>> sqlContext <- sparkRHive.init(sc)
>>
>>
>> 2016-01-06 20:35 GMT+08:00 Sandeep Khurana :
>>
>> Felix
>>
>> I tried the option suggested by you.  It gave below error.  I am going to
>> try the option suggested by Prem .
>>
>> Error in writeJobj(con, object) : invalid jobj 1
>> 8
>> stop("invalid jobj ", value$id)
>> 7
>> writeJobj(con, object)
>> 6
>> writeObject(con, a)
>> 5
>> writeArgs(rc, args)
>> 4
>> invokeJava(isStatic = TRUE, className, methodName, ...)
>> 3
>> callJStatic("org.apache.spark.sql.api.r.SQLUtils", "loadDF", sqlContext,
>> source, options)
>> 2
>> read.df(sqlContext, filepath, "orc") at
>> spark_api.R#108
>>
>> On Wed, Jan 6, 2016 at 10:30 AM, Felix Cheung 
>> wrote:
>>
>> Firstly I don't have ORC data to verify but this should work:
>>
>> df <- loadDF(sqlContext, "data/path", "orc")
>>
>> Secondly, could you check if sparkR.stop() was called? sparkRHive.init()
>> should be called after sparkR.init() - please check if there is any error
>> message there.
>>
>> _
>> From: Prem Sure 
>> Sent: Tuesday, January 5, 2016 8:12 AM
>> Subject: Re: sparkR ORC support.
>> To: Sandeep Khurana 
>> Cc: spark users , Deepak Sharma <
>> deepakmc...@gmail.com>
>>
>>
>>
>> Yes Sandeep, also copy hive-site.xml too to spark conf directory.
>>
>>
>> On Tue, Jan 5, 2016 at 10:07 AM, Sandeep Khurana 
>> wrote:
>>
>> Also, do I need to setup hive in spark as per the link
>> http://stackoverflow.com/questions/26360725/accesing-hive-tables-in-spark
>> ?
>>
>> We might need to copy hdfs-site.xml file to spark conf directory ?
>>
>> On Tue, Jan 5, 2016 at 8:28 PM, Sandeep Khurana 
>> wrote:
>>
>> Deepak
>>
>> Tried this. Getting this error now
>>
>> rror in sql(hivecontext, "FROM CATEGORIES SELECT category_id", "") :   
>> unused argument ("")
>>
>>
>> On Tue, Jan 5, 2016 at 6:48 PM, Deepak Sharma 
>> wrote:
>>
>> Hi Sandeep
>> can you try this ?
>>
>> results <- sql(hivecontext, "FROM test SELECT id","")
>>
>> Thanks
>> Deepak
>>
>>
>> On Tue, Jan 5, 2016 at 5:49 PM, Sandeep Khurana 
>> wrote:
>>
>> Thanks Deepak.
>>
>> I tried this as well. I created a hivecontext   with  "hivecontext <<-
>> sparkRHive.init(sc) "  .
>>
>> When I tried to read hive table from this ,
>>
>> results <- sql(hivecontext, "FROM test SELECT id")
>>
>> I get below error,
>>
>> Error in callJMethod(sqlContext, "sql", sqlQuery) :   Invalid jobj 2. If 
>>

model deployment in spark

2016-01-12 Thread Chandan Verma

Does spark support model deployment through web/rest api. Is there any other 
method for deployment of a predictive model in spark apart from PMML.



Regards,

Chandan




===
DISCLAIMER:
The information contained in this message (including any attachments) is 
confidential and may be privileged. If you have received it by mistake please 
notify the sender by return e-mail and permanently delete this message and any 
attachments from your system. Any dissemination, use, review, distribution, 
printing or copying of this message in whole or in part is strictly prohibited. 
Please note that e-mails are susceptible to change. CitiusTech shall not be 
liable for the improper or incomplete transmission of the information contained 
in this communication nor for any delay in its receipt or damage to your 
system. CitiusTech does not guarantee that the integrity of this communication 
has been maintained or that this communication is free of viruses, 
interceptions or interferences.

Re: JMXSink for YARN deployment

2016-01-12 Thread Kyle Lin

Hello guys

I got a solution.

I set -Dcom.sun.management.jmxremote.port=0 to let system assign a unused
port.

Kyle

2016-01-12 16:54 GMT+08:00 Kyle Lin :

> Hello there
>
>
> I run both driver and master on the same node, so I got "Port already in
> use" exception.
>
> Is there any solution to set different port for each component?
>
> Kyle
>
>
> 2015-12-05 5:54 GMT+08:00 spearson23 :
>
>> Run "spark-submit --help" to see all available options.
>>
>> To get JMX to work you need to:
>>
>> spark-submit --driver-java-options "-Dcom.sun.management.jmxremote
>> -Dcom.sun.management.jmxremote.authenticate=false
>> -Dcom.sun.management.jmxremote.ssl=false
>> -Dcom.sun.management.jmxremote.port=JMX_PORT" --conf
>> spark.metrics.conf=metrics.properties --class 'CLASS_NAME' --master
>> yarn-cluster --files /PATH/TO/metrics.properties /PATH/TO/JAR.FILE
>>
>>
>> This will run JMX on the driver node on or "JMX_PORT".  Note that the
>> driver
>> node and the YARN master node are not the same, you'll have to look where
>> spark put the driver node and then connect there.
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/JMXSink-for-YARN-deployment-tp13958p25572.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Using lz4 in Kafka seems to be broken by jpountz dependency upgrade in Spark 1.5.x+

2016-01-12 Thread Stefan Schadwinkel

Hi all,

we'd like to upgrade one of our Spark jobs from 1.4.1 to 1.5.2 (we run
Spark on Amazon EMR).

The job consumes and pushes lz4 compressed data from/to Kafka.

When upgrading to 1.5.2 everything works fine, except we get the following
exception:

java.lang.NoSuchMethodError: net.jpountz.util.Utils.checkRange([BII)V
at
org.apache.kafka.common.message.KafkaLZ4BlockOutputStream.write(KafkaLZ4BlockOutputStream.java:179)
at java.io.DataOutputStream.writeLong(DataOutputStream.java:224)
at org.apache.kafka.common.record.Compressor.putLong(Compressor.java:132)
at
org.apache.kafka.common.record.MemoryRecords.append(MemoryRecords.java:85)
at
org.apache.kafka.clients.producer.internals.RecordBatch.tryAppend(RecordBatch.java:63)
at
org.apache.kafka.clients.producer.internals.RecordAccumulator.append(RecordAccumulator.java:171)
at
org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:338)


Some digging yields:

- The net.jpountz.lz4/lz4 dependency was upgraded from 1.2.0 to 1.3.0 in
Spark 1.5+ to fix some issues with IBM JDK:
https://issues.apache.org/jira/browse/SPARK-7063

- The net.jpountz.lz4 1.3.0 version refactored net.jpountz.util.Utils to
net.jpountz.util.SafeUtils, thus yielding the above inconsistency:
https://github.com/jpountz/lz4-java/blob/1.3.0/src/java/net/jpountz/util/SafeUtils.java

- Kafka on github up to 0.9.0.0 uses jpountz 1.2.0 (
https://github.com/apache/kafka/blob/0.9.0.0/build.gradle), however, the
latest trunk upgrades to 1.3.0.


Patching Kafka to use SafeUtils is easy and creating Jar for our projects
that includes the correct depencies as well, the option of compiling Spark
1.5.x with jpountz 1.2.0 should also work, but I didn't try yet.

The main problem is that Spark 1.5.x+ and all Kafka 0.8 releases are
incompatible in regard to lz4 compression and we would like to avoid
provisioning EMR with a custom Spark through bootstrapping due to the
operational overhead.

One could try to play with the classpath and a Jar file with compatible
dependencies, but I was wondering if nobody else uses Kafka with lz4 and
Spark and has run into the same issue?

Maybe there's also an easier way to reconcile the situation?

BTW: There's a similar issue regarding Druid as well, but no reconciliation
beyond patching Kafka was discussed:
https://groups.google.com/forum/#!topic/druid-user/ZW_Clovf42k

Any input would be highly appreciated.


Best regards,
Stefan


-- 

*Dr. Stefan Schadwinkel*
Senior Big Data Developer
stefan.schadwin...@smaato.com




Smaato Inc.
San Francisco – New York - Hamburg - Singapore
www.smaato.com





Germany:
Valentinskamp 70, Emporio, 19th Floor

20355 Hamburg


T  +49 (40) 3480 949 0
F  +49 (40) 492 19 055



The information contained in this communication may be CONFIDENTIAL and is
intended only for the use of the recipient(s) named above. If you are not
the intended recipient, you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of its contents, is
strictly prohibited. If you have received this communication in error,
please notify the sender and delete/destroy the original message and any
copy of it from your computer or paper files.

Re: Fwd: how to submit multiple jar files when using spark-submit script in shell?

2016-01-12 Thread Jim Lohse

Thanks for your answer, you are correct, it's just a different approach 
than the one I am asking for :)


Building an uber- or assembly- jar goes against the idea of placing the 
jars on all workers. Uber-jars increase network traffic, using local:/ 
in the classpath reduces network traffic.


Eventually, depending on uber-jars can run into various problems.

Really the question is narrowly geared toward understand what arguments 
can setup the classpath using the --jars argument. Using an uber-jar is 
a workaround, true, but with downsides.


Thanks!

On 01/12/2016 12:06 AM, UMESH CHAUDHARY wrote:



Could you build a fat jar by including all your dependencies along 
with you application. See here 
 and 
here 
 . 



Also:
/*So this application-jar can point to a directory and will be 
expanded? Or

needs to be a path to a single specific jar?*/
/
/
*This will be path to a single specific JAR.*

On Tue, Jan 12, 2016 at 12:04 PM, jiml > wrote:


Question is: Looking for all the ways to specify a set of jars
using --jars
on spark-submit

I know this is old but I am about to submit a proposed docs change on
--jars, and I had an issue with --jars today

When this user submitted the following command line, is that a
proper way to
reference a jar?

hdfs://master:8000/srcdata/kmeans  (is that a directory? or a jar that
doesn't end with .jar? I have not gotten into the machine learning
libs yet
to recognize this)

I know the docs say, "Path to a bundled jar including your
application and
all dependencies. The URL must be globally visible inside of your
cluster,
for instance, an hdfs:// path or a file:// path that is present on all
nodes."

*So this application-jar can point to a directory and will be
expanded? Or
needs to be a path to a single specific jar?*

I ask because when I was testing --jars today, we had to
explicitly provide
a path to each jar:

//usr/local/spark/bin/spark-submit --class
jpsgcs.thold.PipeLinkageData

---jars=local:/usr/local/spark/jars/groovy-all-2.3.3.jar,local:/usr/local/spark/jars/guava-14.0.1.jar,local:/usr/local/spark/jars/jopt-simple-4.6.jar,local:/usr/local/spark/jars/jpsgcs-core-1.0.8-2.jar,local:/usr/local/spark/jars/jpsgcs-pipe-1.0.6-7.jar
/usr/local/spark/jars/thold-0.0.1-1.jar/

(The only way I figured out to use the commas was a StackOverflow
answer
that led me to look beyond the docs to the command line:
spark-submit --help
results in :

 --jars JARS Comma-separated list of local jars to
include
on the driver
  and executor classpaths.


And it seems that we do not need to put the main jar in the --jars
argument,
I have not tested yet if other classes in the application-jar
(/usr/local/spark/jars/thold-0.0.1-1.jar) are shipped to workers,
or if I
need to put the application-jar in the --jars path to get classes
not named
after --class to be seen?

Thanks for any ideas




--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/how-to-submit-multiple-jar-files-when-using-spark-submit-script-in-shell-tp16662p25942.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org

Re: [KafkaRDD]: rdd.cache() does not seem to work

2016-01-12 Thread Понькин Алексей

Hi Charles,

I have created very simplified job - https://github.com/ponkin/KafkaSnapshot to 
illustrate the problem. 
https://github.com/ponkin/KafkaSnapshot/blob/master/src/main/scala/ru/ponkin/KafkaSnapshot.scala

In a short - may be persist method is working but not like I expected.
I thought that spark will fetch all data from kafka topic once and cache it in 
memory, instead add is calculating every time I call saveAsObjectFile method

-- 
Яндекс.Почта — надёжная почта
http://mail.yandex.ru/neo2/collect/?exp=1=1


12.01.2016, 10:56, "charles li" :
> cache is the default storage level of persist, and it is lazy [ not cached 
> indeed ] until the first time it is computed.
>
> 
>
> On Tue, Jan 12, 2016 at 5:13 AM, ponkin  wrote:
>> Hi,
>>
>> Here is my use case :
>> I have kafka topic. The job is fairly simple - it reads topic and save data 
>> to several hdfs paths.
>> I create rdd with the following code
>>  val r =  
>> KafkaUtils.createRDD[Array[Byte],Array[Byte],DefaultDecoder,DefaultDecoder](context,kafkaParams,range)
>> Then I am trying to cache that rdd with
>>  r.cache()
>> and then save this rdd to several hdfs locations.
>> But it seems that KafkaRDD is fetching data from kafka broker every time I 
>> call saveAsNewAPIHadoopFile.
>>
>> How can I cache data from Kafka in memory?
>>
>> P.S. When I do repartition add it seems to work properly( read kafka only 
>> once) but spark store shuffled data localy.
>> Is it possible to keep data in memory?
>>
>> 
>> View this message in context: [KafkaRDD]: rdd.cache() does not seem to work
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-12 Thread Umesh Kacha

Hi Prabhu thanks for the response. I did the same the problem is when I get
process id using jps or ps - ef I don't get user in the very first column I
see number in place of user name so can't run jstack on it because of
permission issue it gives something like following

728852   3553   9833   0   04:30   ?  00:00:00   /bin/bash blabal
On Jan 12, 2016 12:02, "Prabhu Joseph"  wrote:

> Umesh,
>
>   Running task is a thread within the executor process. We need to take
> stack trace for the executor process. The executor will be running in any
> NodeManager machine as a container.
>
>   YARN RM UI running jobs will have the host details where executor is
> running. Login to that NodeManager machine and jps -l will list all java
> processes, jstack -l  will give the stack trace.
>
>
> Thanks,
> Prabhu Joseph
>
> On Mon, Jan 11, 2016 at 7:56 PM, Umesh Kacha 
> wrote:
>
>> Hi Prabhu thanks for the response. How do I find pid of a slow running
>> task. Task is running in yarn cluster node. When I try to see pid of a
>> running task using my user I see some 7-8 digit number instead of user
>> running process any idea why spark creates this number instead of
>> displaying user
>> On Jan 3, 2016 6:06 AM, "Prabhu Joseph" 
>> wrote:
>>
>>> The attached image just has thread states, and WAITING threads need not
>>> be the issue. We need to take thread stack traces and identify at which
>>> area of code, threads are spending lot of time.
>>>
>>> Use jstack -l  or kill -3 , where pid is the process id of the
>>> executor process. Take jstack stack trace for every 2 seconds and total 1
>>> minute. This will help to identify the code where threads are spending lot
>>> of time and then try to tune.
>>>
>>> Thanks,
>>> Prabhu Joseph
>>>
>>>
>>>
>>> On Sat, Jan 2, 2016 at 1:28 PM, Umesh Kacha 
>>> wrote:
>>>
 Hi thanks I did that and I have attached thread dump images. That was
 the intention of my question asking for help to identify which waiting
 thread is culprit.

 Regards,
 Umesh

 On Sat, Jan 2, 2016 at 8:38 AM, Prabhu Joseph <
 prabhujose.ga...@gmail.com> wrote:

> Take thread dump of Executor process several times in a short time
> period and check what each threads are doing at different times which will
> help to identify the expensive sections in user code.
>
> Thanks,
> Prabhu Joseph
>
> On Sat, Jan 2, 2016 at 3:28 AM, unk1102  wrote:
>
>> Sorry please see attached waiting thread log
>>
>> <
>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25851/Screen_Shot_2016-01-02_at_2.jpg
>> >
>> <
>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25851/Screen_Shot_2016-01-02_at_2.jpg
>> >
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-find-cause-waiting-threads-etc-of-hanging-job-for-7-hours-tp25850p25851.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

>>>
>

PCA OutOfMemoryError

2016-01-12 Thread Bharath Ravi Kumar

We're running PCA (selecting 100 principal components) on a dataset that
has ~29K columns and is 70G in size stored in ~600 parts on HDFS. The
matrix in question is mostly sparse with tens of columns populate in most
rows, but a few rows with thousands of columns populated. We're running
spark on mesos with driver memory set to 40G and executor memory set to
80G. We're however encountering an out of memory error (included at the end
of the message) regardless of the number of rdd partitions or the degree of
task parallelism being set. I noticed a warning at the beginning of the PCA
computation stage: " WARN
org.apache.spark.mllib.linalg.distributed.RowMatrix: 29604 columns will
require at least 7011 megabyte  of memory!"
I don't understand which memory this refers to. Is this the executor
memory?  The driver memory? Any other?
The stacktrace appears to indicate that a large array is probably being
passed along with the task. Could this array have been passed as a
broadcast variable instead ? Any suggestions / workarounds other than
re-implementing the algorithm?

Thanks,
Bharath



Exception in thread "main" java.lang.OutOfMemoryError: Requested array size
exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:2271)
at
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2030)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:702)
at
org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1100)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1091)
at
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeGramianMatrix(RowMatrix.scala:124)
at
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:350)
at
org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponents(RowMatrix.scala:386)
at org.apache.spark.mllib.feature.PCA.fit(PCA.scala:46)

Does spark restart the executors if its nodemanager crashes?

2016-01-12 Thread Bing Jiang

hi, guys.
We have set up the dynamic allocation resource on spark-yarn. Now we use
spark 1.5.
One executor tries to fetch data from another nodemanager's shuffle
service, and the nodemanager crashes, which makes the executor stop on the
states util the crashed nodemanager has been launched again.

I just want to know whether spark will resubmit the completed tasks if the
latter tasks being executing cannot find the output?

Thanks for any explanation.

-- 
Bing Jiang

RE: sparkR ORC support.

2016-01-12 Thread Felix Cheung

It looks like you have overwritten sc. Could you try this:
 
 
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), 
.libPaths()))library(SparkR)
sc <- sparkR.init()hivecontext <- sparkRHive.init(sc)df <- loadDF(hivecontext, 
"/data/ingest/sparktest1/", "orc") 

 
Date: Tue, 12 Jan 2016 14:28:58 +0530
Subject: Re: sparkR ORC support.
From: sand...@infoworks.io
To: felixcheun...@hotmail.com
CC: yblia...@gmail.com; user@spark.apache.org; premsure...@gmail.com; 
deepakmc...@gmail.com

The code is very simple, pasted below .  hive-site.xml is in spark conf 
already. I still see this error Error in writeJobj(con, object) : invalid jobj 3
after running the script  below

script===Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")

.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), 
.libPaths()))library(SparkR)
sc <<- sparkR.init()sc <<- sparkRHive.init()hivecontext <<- 
sparkRHive.init(sc)df <- loadDF(hivecontext, "/data/ingest/sparktest1/", 
"orc")#View(df)

On Wed, Jan 6, 2016 at 11:08 PM, Felix Cheung  wrote:





Yes, as Yanbo suggested, it looks like there is something wrong with the 
sqlContext.



Could you forward us your code please?













On Wed, Jan 6, 2016 at 5:52 AM -0800, "Yanbo Liang" 
 wrote:





You should ensure your sqlContext is HiveContext.

sc <- sparkR.init()
sqlContext <- sparkRHive.init(sc)




2016-01-06 20:35 GMT+08:00 Sandeep Khurana 
:


Felix



I tried the option suggested by you.  It gave below error.  I am going to try 
the option suggested by Prem .





Error in writeJobj(con, object) : invalid jobj 1




8

stop("invalid jobj ", value$id)




7

writeJobj(con, object)




6

writeObject(con, a)




5

writeArgs(rc, args)




4

invokeJava(isStatic = TRUE, className, methodName, ...)




3

callJStatic("org.apache.spark.sql.api.r.SQLUtils", "loadDF", sqlContext, 
source, options)




2

read.df(sqlContext, filepath, "orc") at
spark_api.R#108








On Wed, Jan 6, 2016 at 10:30 AM, Felix Cheung 
 wrote:



Firstly I don't have ORC data to verify but this should work:



df <- loadDF(sqlContext, "data/path", "orc")



Secondly, could you check if sparkR.stop() was called? sparkRHive.init() should 
be called after sparkR.init() - please check if there is any error message 
there.






_

From: Prem Sure 

Sent: Tuesday, January 5, 2016 8:12 AM

Subject: Re: sparkR ORC support.

To: Sandeep Khurana 

Cc: spark users , Deepak Sharma 







Yes Sandeep, also copy hive-site.xml too to spark conf directory. 






On Tue, Jan 5, 2016 at 10:07 AM, Sandeep Khurana 
 wrote:



Also, do I need to setup hive in spark as per the link  
http://stackoverflow.com/questions/26360725/accesing-hive-tables-in-spark ?



We might need to copy hdfs-site.xml file to spark conf directory ? 





On Tue, Jan 5, 2016 at 8:28 PM, Sandeep Khurana 
 wrote:



Deepak



Tried this. Getting this error now 

rror in sql(hivecontext, "FROM CATEGORIES SELECT category_id", "") :   unused 
argument ("")






On Tue, Jan 5, 2016 at 6:48 PM, Deepak Sharma 
 wrote:




Hi Sandeep 
can you try this ? 




results <- sql(hivecontext, "FROM test SELECT id","") 



Thanks 

Deepak 









On Tue, Jan 5, 2016 at 5:49 PM, Sandeep Khurana 
 wrote:



Thanks Deepak.



I tried this as well. I created a hivecontext   with  "hivecontext <<- 
sparkRHive.init(sc) "  .




When I tried to read hive table from this ,  



results <- sql(hivecontext, "FROM test SELECT id") 



I get below error,  




Error in callJMethod(sqlContext, "sql", sqlQuery) :   Invalid jobj 2. If SparkR 
was restarted, Spark operations need to be re-executed.


Not sure what is causing this? Any leads or ideas? I am using rstudio. 








On Tue, Jan 5, 2016 at 5:35 PM, Deepak Sharma 
 wrote:




Hi Sandeep 
I am not sure if ORC can be read directly in R. 
But there can be a workaround .First create hive table on top of ORC files and 
then access hive table in R.




Thanks 
Deepak 





On Tue, Jan 5, 2016 at 4:57 PM, Sandeep Khurana 
 wrote:



Hello



I need to read an ORC files in hdfs in R using spark. I am not able to find a 
package to do that. 




Can anyone help with documentation or example for this purpose? 




-- 




Architect 


Infoworks.io 


http://Infoworks.io 















-- 


Thanks 

Deepak 

www.bigdatabig.com 

www.keosha.net 











-- 




Architect 


Infoworks.io 


http://Infoworks.io 














-- 


Thanks 

Deepak 

www.bigdatabig.com 

www.keosha.net 













-- 




Architect 


Infoworks.io 


http://Infoworks.io 















-- 




Architect 


Infoworks.io

Re: sparkR ORC support.

2016-01-12 Thread Sandeep Khurana

Running this gave

16/01/12 04:06:54 INFO BlockManagerMaster: Registered
BlockManagerError in writeJobj(con, object) : invalid jobj 3


How does it know which hive schema to connect to?



On Tue, Jan 12, 2016 at 2:34 PM, Felix Cheung 
wrote:

> It looks like you have overwritten sc. Could you try this:
>
>
> Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")
>
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> library(SparkR)
>
> sc <- sparkR.init()
> hivecontext <- sparkRHive.init(sc)
> df <- loadDF(hivecontext, "/data/ingest/sparktest1/", "orc")
>
>
>
> --
> Date: Tue, 12 Jan 2016 14:28:58 +0530
> Subject: Re: sparkR ORC support.
> From: sand...@infoworks.io
> To: felixcheun...@hotmail.com
> CC: yblia...@gmail.com; user@spark.apache.org; premsure...@gmail.com;
> deepakmc...@gmail.com
>
>
> The code is very simple, pasted below .
> hive-site.xml is in spark conf already. I still see this error
>
> Error in writeJobj(con, object) : invalid jobj 3
>
> after running the script  below
>
>
> script
> ===
> Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")
>
>
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> library(SparkR)
>
> sc <<- sparkR.init()
> sc <<- sparkRHive.init()
> hivecontext <<- sparkRHive.init(sc)
> df <- loadDF(hivecontext, "/data/ingest/sparktest1/", "orc")
> #View(df)
>
>
> On Wed, Jan 6, 2016 at 11:08 PM, Felix Cheung 
> wrote:
>
> Yes, as Yanbo suggested, it looks like there is something wrong with the
> sqlContext.
>
> Could you forward us your code please?
>
>
>
>
>
> On Wed, Jan 6, 2016 at 5:52 AM -0800, "Yanbo Liang" 
> wrote:
>
> You should ensure your sqlContext is HiveContext.
>
> sc <- sparkR.init()
>
> sqlContext <- sparkRHive.init(sc)
>
>
> 2016-01-06 20:35 GMT+08:00 Sandeep Khurana :
>
> Felix
>
> I tried the option suggested by you.  It gave below error.  I am going to
> try the option suggested by Prem .
>
> Error in writeJobj(con, object) : invalid jobj 1
> 8
> stop("invalid jobj ", value$id)
> 7
> writeJobj(con, object)
> 6
> writeObject(con, a)
> 5
> writeArgs(rc, args)
> 4
> invokeJava(isStatic = TRUE, className, methodName, ...)
> 3
> callJStatic("org.apache.spark.sql.api.r.SQLUtils", "loadDF", sqlContext,
> source, options)
> 2
> read.df(sqlContext, filepath, "orc") at
> spark_api.R#108
>
> On Wed, Jan 6, 2016 at 10:30 AM, Felix Cheung 
> wrote:
>
> Firstly I don't have ORC data to verify but this should work:
>
> df <- loadDF(sqlContext, "data/path", "orc")
>
> Secondly, could you check if sparkR.stop() was called? sparkRHive.init()
> should be called after sparkR.init() - please check if there is any error
> message there.
>
> _
> From: Prem Sure 
> Sent: Tuesday, January 5, 2016 8:12 AM
> Subject: Re: sparkR ORC support.
> To: Sandeep Khurana 
> Cc: spark users , Deepak Sharma <
> deepakmc...@gmail.com>
>
>
>
> Yes Sandeep, also copy hive-site.xml too to spark conf directory.
>
>
> On Tue, Jan 5, 2016 at 10:07 AM, Sandeep Khurana 
> wrote:
>
> Also, do I need to setup hive in spark as per the link
> http://stackoverflow.com/questions/26360725/accesing-hive-tables-in-spark
> ?
>
> We might need to copy hdfs-site.xml file to spark conf directory ?
>
> On Tue, Jan 5, 2016 at 8:28 PM, Sandeep Khurana 
> wrote:
>
> Deepak
>
> Tried this. Getting this error now
>
> rror in sql(hivecontext, "FROM CATEGORIES SELECT category_id", "") :   unused 
> argument ("")
>
>
> On Tue, Jan 5, 2016 at 6:48 PM, Deepak Sharma 
> wrote:
>
> Hi Sandeep
> can you try this ?
>
> results <- sql(hivecontext, "FROM test SELECT id","")
>
> Thanks
> Deepak
>
>
> On Tue, Jan 5, 2016 at 5:49 PM, Sandeep Khurana 
> wrote:
>
> Thanks Deepak.
>
> I tried this as well. I created a hivecontext   with  "hivecontext <<-
> sparkRHive.init(sc) "  .
>
> When I tried to read hive table from this ,
>
> results <- sql(hivecontext, "FROM test SELECT id")
>
> I get below error,
>
> Error in callJMethod(sqlContext, "sql", sqlQuery) :   Invalid jobj 2. If 
> SparkR was restarted, Spark operations need to be re-executed.
>
>
> Not sure what is causing this? Any leads or ideas? I am using rstudio.
>
>
>
> On Tue, Jan 5, 2016 at 5:35 PM, Deepak Sharma 
> wrote:
>
> Hi Sandeep
> I am not sure if ORC can be read directly in R.
> But there can be a workaround .First create hive table on top of ORC files
> and then access hive table in R.
>
> Thanks
> Deepak
>
> On Tue, Jan 5, 2016 at 4:57 PM, Sandeep Khurana 
> wrote:
>
> Hello
>
> I need to read an ORC files in hdfs in R using spark. I am not able to
> find a package to do that.
>
> Can anyone help with documentation or example

Lost tasks due to OutOfMemoryError (GC overhead limit exceeded)

2016-01-12 Thread Barak Yaish

Hello,

I've a 5 nodes cluster which hosts both hdfs datanodes and spark workers.
Each node has 8 cpu and 16G memory. Spark version is 1.5.2, spark-env.sh is
as follow:

export SPARK_MASTER_IP=10.52.39.92

export SPARK_WORKER_INSTANCES=4

export SPARK_WORKER_CORES=8
export SPARK_WORKER_MEMORY=4g

And more settings done in the application code:

sparkConf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer");
sparkConf.set("spark.kryo.registrator",InternalKryoRegistrator.class.getName());
sparkConf.set("spark.kryo.registrationRequired","true");
sparkConf.set("spark.kryoserializer.buffer.max.mb","512");
sparkConf.set("spark.default.parallelism","300");
sparkConf.set("spark.rpc.askTimeout","500");

I'm trying to load data from hdfs and running some sqls on it (mostly
groupby) using DataFrames. The logs keep saying that tasks are lost due to
OutOfMemoryError (GC overhead limit exceeded).

Can you advice what is the recommended settings (memory, cores, partitions,
etc.) for the given hardware?

Thanks!

Re: Job History Logs for spark jobs submitted on YARN

2016-01-12 Thread Arkadiusz Bicz

Hi,

You can checkout http://spark.apache.org/docs/latest/monitoring.html,
you can monitor hdfs, memory usage per job and executor and driver. I
have connected it to Graphite for storage and Grafana for
visualization. I have also connected to collectd which provides me all
server nodes metrics like disc, memory and cpu utilization.

On Tue, Jan 12, 2016 at 10:50 AM, laxmanvemula  wrote:
> I observe that YARN jobs history logs are created in /user/history/done
> (*.jhist files) for all the mapreduce jobs like hive, pig etc. But for spark
> jobs submitted in yarn-cluster mode, the logs are not being created.
>
> I would like to see resource utilization by spark jobs. Is there any other
> place where I can find the resource utilization by spark jobs (CPU, Memory
> etc). Or is there any configuration to be set so that the job history logs
> are created just like other mapreduce jobs.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Job-History-Logs-for-spark-jobs-submitted-on-YARN-tp25946.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Jörn Franke

Ignite can also cache rdd 

> On 12 Jan 2016, at 13:06, Dmitry Goldenberg  wrote:
> 
> Jorn, you said Ignite or ... ? What was the second choice you were thinking 
> of? It seems that got omitted.
> 
>> On Jan 12, 2016, at 2:44 AM, Jörn Franke  wrote:
>> 
>> You can look at ignite as a HDFS cache or for  storing rdds. 
>> 
>>> On 11 Jan 2016, at 21:14, Dmitry Goldenberg  
>>> wrote:
>>> 
>>> We have a bunch of Spark jobs deployed and a few large resource files such 
>>> as e.g. a dictionary for lookups or a statistical model.
>>> 
>>> Right now, these are deployed as part of the Spark jobs which will 
>>> eventually make the mongo-jars too bloated for deployments.
>>> 
>>> What are some of the best practices to consider for maintaining and sharing 
>>> large resource files like these?
>>> 
>>> Thanks.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Use TCP client for id lookup

2016-01-12 Thread Kristoffer Sjögren

Hi

I'm trying to understand how to lookup certain id fields of RDDs to an
external mapping table. The table is accessed through a two-way binary
tcp client where an id is provided and entry returned. Entries cannot
be listed/scanned.

What's the simplest way of managing the tcp client and its connections
towards the external system? I suppose I cannot use the tcp client in
a mapToPair() call?

Cheers,
-Kristoffer

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Mllib Word2Vec vector representations are very high in value

2016-01-12 Thread Nick Pentreath

The similarities returned are not in fact true cosine similarities as they
are not properly normalized - this will be fixed in this PR:
https://github.com/apache/spark/pull/10152

On Tue, Dec 15, 2015 at 2:54 AM, jxieeducation 
wrote:

> Hi,
>
> For Word2Vec in Mllib, when I use a large number of partitions (e.g. 256),
> my vectors turn out to be very large. I am looking for a representation
> that
> is between (-1, 1) like all other Word2Vec implementations (e.g. Gensim,
> google's Word2Vec).
>
> E.g.
>
> scala> var m = model.transform("SOMETHING")
>
> m: org.apache.spark.mllib.linalg.Vector =
>
> [1.61478590965271,-13.385428428649902,-19.518991470336914,12.05420970916748,-6.141440391540527...]
>
>
>
> Thanks so much!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-Word2Vec-vector-representations-are-very-high-in-value-tp25702.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: sparkR ORC support.

2016-01-12 Thread Sandeep Khurana

It worked for sometime. Then I did  sparkR.stop() an re-ran again to get
the same error. Any idea why it ran fine before ( while running fine it
kept giving warning reusing existing spark-context and that I should
restart) ? There is one more R code which instantiated spark , I ran that
too again.


On Tue, Jan 12, 2016 at 3:05 PM, Sandeep Khurana 
wrote:

> Complete stacktrace is. Can it be something wih java versions?
>
>
> stop("invalid jobj ", value$id)
> 8
> writeJobj(con, object)
> 7
> writeObject(con, a)
> 6
> writeArgs(rc, args)
> 5
> invokeJava(isStatic = TRUE, className, methodName, ...)
> 4
> callJStatic("org.apache.spark.sql.api.r.SQLUtils", "loadDF", sqlContext,
> source, options)
> 3
> read.df(sqlContext, path, source, schema, ...)
> 2
> loadDF(hivecontext, filepath, "orc")
>
> On Tue, Jan 12, 2016 at 2:41 PM, Sandeep Khurana 
> wrote:
>
>> Running this gave
>>
>> 16/01/12 04:06:54 INFO BlockManagerMaster: Registered BlockManagerError in 
>> writeJobj(con, object) : invalid jobj 3
>>
>>
>> How does it know which hive schema to connect to?
>>
>>
>>
>> On Tue, Jan 12, 2016 at 2:34 PM, Felix Cheung 
>> wrote:
>>
>>> It looks like you have overwritten sc. Could you try this:
>>>
>>>
>>> Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")
>>>
>>> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"),
>>> .libPaths()))
>>> library(SparkR)
>>>
>>> sc <- sparkR.init()
>>> hivecontext <- sparkRHive.init(sc)
>>> df <- loadDF(hivecontext, "/data/ingest/sparktest1/", "orc")
>>>
>>>
>>>
>>> --
>>> Date: Tue, 12 Jan 2016 14:28:58 +0530
>>> Subject: Re: sparkR ORC support.
>>> From: sand...@infoworks.io
>>> To: felixcheun...@hotmail.com
>>> CC: yblia...@gmail.com; user@spark.apache.org; premsure...@gmail.com;
>>> deepakmc...@gmail.com
>>>
>>>
>>> The code is very simple, pasted below .
>>> hive-site.xml is in spark conf already. I still see this error
>>>
>>> Error in writeJobj(con, object) : invalid jobj 3
>>>
>>> after running the script  below
>>>
>>>
>>> script
>>> ===
>>> Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")
>>>
>>>
>>> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"),
>>> .libPaths()))
>>> library(SparkR)
>>>
>>> sc <<- sparkR.init()
>>> sc <<- sparkRHive.init()
>>> hivecontext <<- sparkRHive.init(sc)
>>> df <- loadDF(hivecontext, "/data/ingest/sparktest1/", "orc")
>>> #View(df)
>>>
>>>
>>> On Wed, Jan 6, 2016 at 11:08 PM, Felix Cheung >> > wrote:
>>>
>>> Yes, as Yanbo suggested, it looks like there is something wrong with the
>>> sqlContext.
>>>
>>> Could you forward us your code please?
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jan 6, 2016 at 5:52 AM -0800, "Yanbo Liang" 
>>> wrote:
>>>
>>> You should ensure your sqlContext is HiveContext.
>>>
>>> sc <- sparkR.init()
>>>
>>> sqlContext <- sparkRHive.init(sc)
>>>
>>>
>>> 2016-01-06 20:35 GMT+08:00 Sandeep Khurana :
>>>
>>> Felix
>>>
>>> I tried the option suggested by you.  It gave below error.  I am going
>>> to try the option suggested by Prem .
>>>
>>> Error in writeJobj(con, object) : invalid jobj 1
>>> 8
>>> stop("invalid jobj ", value$id)
>>> 7
>>> writeJobj(con, object)
>>> 6
>>> writeObject(con, a)
>>> 5
>>> writeArgs(rc, args)
>>> 4
>>> invokeJava(isStatic = TRUE, className, methodName, ...)
>>> 3
>>> callJStatic("org.apache.spark.sql.api.r.SQLUtils", "loadDF", sqlContext,
>>> source, options)
>>> 2
>>> read.df(sqlContext, filepath, "orc") at
>>> spark_api.R#108
>>>
>>> On Wed, Jan 6, 2016 at 10:30 AM, Felix Cheung >> > wrote:
>>>
>>> Firstly I don't have ORC data to verify but this should work:
>>>
>>> df <- loadDF(sqlContext, "data/path", "orc")
>>>
>>> Secondly, could you check if sparkR.stop() was called? sparkRHive.init()
>>> should be called after sparkR.init() - please check if there is any error
>>> message there.
>>>
>>> _
>>> From: Prem Sure 
>>> Sent: Tuesday, January 5, 2016 8:12 AM
>>> Subject: Re: sparkR ORC support.
>>> To: Sandeep Khurana 
>>> Cc: spark users , Deepak Sharma <
>>> deepakmc...@gmail.com>
>>>
>>>
>>>
>>> Yes Sandeep, also copy hive-site.xml too to spark conf directory.
>>>
>>>
>>> On Tue, Jan 5, 2016 at 10:07 AM, Sandeep Khurana 
>>> wrote:
>>>
>>> Also, do I need to setup hive in spark as per the link
>>> http://stackoverflow.com/questions/26360725/accesing-hive-tables-in-spark
>>> ?
>>>
>>> We might need to copy hdfs-site.xml file to spark conf directory ?
>>>
>>> On Tue, Jan 5, 2016 at 8:28 PM, Sandeep Khurana 
>>> wrote:
>>>
>>> Deepak
>>>
>>> Tried this. Getting this error now
>>>
>>> rror in sql(hivecontext, "FROM CATEGORIES SELECT category_id", "") :   
>>> unused argument ("")
>>>
>>>
>>> On Tue, Jan 5, 2016 at 6:48 PM, Deepak Sharma

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-12 Thread Muthu Jayakumar

Thanks Micheal. Let me test it with a recent master code branch.

Also for every mapping step should I have to create a new case class? I
cannot use Tuple as I have ~130 columns to process. Earlier I had used a
Seq[Any] (actually Array[Any] to optimize on serialization) but processed
it using RDD (by building the Schema at runtime). Now I am attempting to
replace this using Dataset.

>the problem is that at compile time we don't know if its an inner or outer
join.
May I suggest to have different methods for different kind of joins
(similar to RDD api)? This way the typesafety is enforced.

Here is the error message.

Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task not serializable: java.io.NotSerializableException:
scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1
Serialization stack: - object not serializable (class:
scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1,
value: package lang) - field (class: scala.reflect.internal.Types$ThisType,
name: sym, type: class scala.reflect.internal.Symbols$Symbol) - object
(class scala.reflect.internal.Types$UniqueThisType, java.lang.type) - field
(class: scala.reflect.internal.Types$TypeRef, name: pre, type: class
scala.reflect.internal.Types$Type) - object (class
scala.reflect.internal.Types$ClassNoArgsTypeRef, String) - field (class:
scala.reflect.internal.Types$TypeRef, name: normalized, type: class
scala.reflect.internal.Types$Type) - object (class
scala.reflect.internal.Types$AliasNoArgsTypeRef, String) - field (class:
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, name: keyType$1,
type: class scala.reflect.api.Types$TypeApi) - object (class
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, ) - field (class:
org.apache.spark.sql.catalyst.expressions.MapObjects, name: function, type:
interface scala.Function1) - object (class
org.apache.spark.sql.catalyst.expressions.MapObjects,
mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- field
(class: "scala.collection.immutable.Map", name: "map"),- root class:
"collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType)) -
field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name:
targetObject, type: class
org.apache.spark.sql.catalyst.expressions.Expression) - object (class
org.apache.spark.sql.catalyst.expressions.Invoke,
invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
field (class: "scala.collection.immutable.Map", name: "map"),- root class:
"collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
[Ljava.lang.Object;))) - writeObject data (class:
scala.collection.immutable.List$SerializationProxy) - object (class
scala.collection.immutable.List$SerializationProxy,
scala.collection.immutable.List$SerializationProxy@7e78c3cf) - writeReplace
data (class: scala.collection.immutable.List$SerializationProxy) - object
(class scala.collection.immutable.$colon$colon,
List(invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
field (class: "scala.collection.immutable.Map", name: "map"),- root class:
"collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
[Ljava.lang.Object;)),
invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
field (class: "scala.collection.immutable.Map", name: "map"),- root class:
"collector.MyMap"),valueArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
[Ljava.lang.Object; - field (class:
org.apache.spark.sql.catalyst.expressions.StaticInvoke, name: arguments,
type: interface scala.collection.Seq) - object (class
org.apache.spark.sql.catalyst.expressions.StaticInvoke, staticinvoke(class
org.apache.spark.sql.catalyst.util.ArrayBasedMapData$,ObjectType(interface
scala.collection.Map),toScalaMap,invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
field (class: "scala.collection.immutable.Map", name: "map"),- root class:
"collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
[Ljava.lang.Object;)),invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
field (class: "scala.collection.immutable.Map", name: "map"),- root class:
"collector.MyMap"),valueArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
[Ljava.lang.Object;)),true)) - writeObject data (class:
scala.collection.immutable.List$SerializationProxy) - object (class
scala.collection.immutable.List$SerializationProxy,
scala.collection.immutable.List$SerializationProxy@377795c5) - writeReplace
data (class: scala.collection.immutable.List$SerializationProxy) - object
(class scala.collection.immutable.$colon$colon, List(staticinvoke(class
org.apache.spark.sql.catalyst.util.ArrayBasedMapData$,ObjectType(interface
scala.collection.Map),toScalaMap,invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
field (class:

Re: How to view the RDD data based on Partition

2016-01-12 Thread Prem Sure

 try mapPartitionsWithIndex .. below is an example I used earlier. myfunc
logic can be further modified as per your need.
val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9), 3)
def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = {
  iter.toList.map(x => index + "," + x).iterator
}
x.mapPartitionsWithIndex(myfunc).collect()
res10: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9)

On Tue, Jan 12, 2016 at 2:06 PM, Gokula Krishnan D 
wrote:

> Hello All -
>
> I'm just trying to understand aggregate() and in the meantime got an
> question.
>
> *Is there any way to view the RDD databased on the partition ?.*
>
> For the instance, the following RDD has 2 partitions
>
> val multi2s = List(2,4,6,8,10,12,14,16,18,20)
> val multi2s_RDD = sc.parallelize(multi2s,2)
>
> is there anyway to view the data based on the partitions (0,1).
>
>
> Thanks & Regards,
> Gokula Krishnan* (Gokul)*
>

Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

2016-01-12 Thread Alex Nastetsky

Ran into this need myself. Does Spark have an equivalent of  "mapreduce.
input.fileinputformat.list-status.num-threads"?

Thanks.

On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park  wrote:

> Hi,
>
> I am wondering if anyone has successfully enabled
> "mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I
> usually set this property to 25 to speed up file listing in MR jobs (Hive
> and Pig). But for some reason, this property does not take effect in Spark
> HadoopRDD resulting in serious delay in file listing.
>
> I verified that the property is indeed set in HadoopRDD by logging the
> value of the property in the getPartitions() function. I also tried to
> attach VisualVM to Spark and Pig clients, which look as follows-
>
> In Pig, I can see 25 threads running in parallel for file listing-
> [image: Inline image 1]
>
> In Spark, I only see 2 threads running in parallel for file listing-
> [image: Inline image 2]
>
> What's strange is that the # of concurrent threads in Spark is throttled
> no matter how high I
> set "mapreduce.input.fileinputformat.list-status.num-threads".
>
> Is anyone using Spark with this property enabled? If so, can you please
> share how you do it?
>
> Thanks!
> Cheolsoo
>

How to view the RDD data based on Partition

2016-01-12 Thread Gokula Krishnan D

Hello All -

I'm just trying to understand aggregate() and in the meantime got an
question.

*Is there any way to view the RDD databased on the partition ?.*

For the instance, the following RDD has 2 partitions

val multi2s = List(2,4,6,8,10,12,14,16,18,20)
val multi2s_RDD = sc.parallelize(multi2s,2)

is there anyway to view the data based on the partitions (0,1).


Thanks & Regards,
Gokula Krishnan* (Gokul)*

Big data job only finishes with Legacy memory management

2016-01-12 Thread Saif.A.Ellafi

Hello,

I am tinkering with Spark 1.6. I have this 1.5 Billion rows data, to which I 
apply several window functions such as lag, first, etc. The job is quite 
expensive, I am running a small cluster with executors running with 70GB of ram.

Using new memory management system, the job fails around the middle with heap 
memory limit exceeded problem. Tried also tinkering with different of the new 
memory settings with no success. 70GB * 4 nodes is a lot of resources for this 
kind of job.

Legacy mode memory management runs this job succesfully with default memory 
settings.

How could I further analyze this problem to provide assistance and better 
diagnostics??
All the job goes around the dataframe api, with nothing strange (no udf or 
custom operations).

Saif

Eigenvalue solver

2016-01-12 Thread Lydia Ickler

Hi,

I wanted to know if there are any implementations yet within the Machine 
Learning Library or generally that can efficiently solve eigenvalue problems?
Or if not do you have suggestions on how to approach a parallel execution maybe 
with BLAS or Breeze?

Thanks in advance!
Lydia


Von meinem iPhone gesendet
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-12 Thread Michael Armbrust

>
> df1.as[TestCaseClass].map(_.toMyMap).show() //fails
>
> This looks like a bug.  What is the error?  It might be fixed in
branch-1.6/master if you can test there.

> Please advice on what I may be missing here?
>
>
> Also for join, may I suggest to have a custom encoder / transformation to
> say how 2 datasets can merge?
> Also, when a join in made using something like 'left outer join' the right
> side object should ideally be Option kind (similar to what's seen in RDD).
> And I think this may make it strongly typed?
>

I think you can actually use as to convert this to an Option if you'd like
typesafety.  the problem is that at compile time we don't know if its an
inner or outer join.

Re: Lost tasks due to OutOfMemoryError (GC overhead limit exceeded)

2016-01-12 Thread Muthu Jayakumar

>export SPARK_WORKER_MEMORY=4g
May be you could increase the max heapsize on the worker? In case if the
OutOfMemory is for the driver, then you may want to set it up explicitly
for the driver.

Thanks,



On Tue, Jan 12, 2016 at 2:04 AM, Barak Yaish  wrote:

> Hello,
>
> I've a 5 nodes cluster which hosts both hdfs datanodes and spark workers.
> Each node has 8 cpu and 16G memory. Spark version is 1.5.2, spark-env.sh is
> as follow:
>
> export SPARK_MASTER_IP=10.52.39.92
>
> export SPARK_WORKER_INSTANCES=4
>
> export SPARK_WORKER_CORES=8
> export SPARK_WORKER_MEMORY=4g
>
> And more settings done in the application code:
>
>
> sparkConf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer");
>
> sparkConf.set("spark.kryo.registrator",InternalKryoRegistrator.class.getName());
> sparkConf.set("spark.kryo.registrationRequired","true");
> sparkConf.set("spark.kryoserializer.buffer.max.mb","512");
> sparkConf.set("spark.default.parallelism","300");
> sparkConf.set("spark.rpc.askTimeout","500");
>
> I'm trying to load data from hdfs and running some sqls on it (mostly
> groupby) using DataFrames. The logs keep saying that tasks are lost due to
> OutOfMemoryError (GC overhead limit exceeded).
>
> Can you advice what is the recommended settings (memory, cores,
> partitions, etc.) for the given hardware?
>
> Thanks!
>

Re: Regarding sliding window example from Databricks for DStream

2016-01-12 Thread Cassa L

Any thoughts over this? I want to know when  window duration is complete
and not the sliding window.  Is there a way I can catch end of Window
Duration or do I need to keep track of it and how?

LCassa

On Mon, Jan 11, 2016 at 3:09 PM, Cassa L  wrote:

> Hi,
>  I'm trying to work with sliding window example given by databricks.
>
> https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/windows.html
>
> It works fine as expected.
> My question is how do I determine when the last phase of of slider has
> reached. I want to perform final operation and notify other system when end
> of the slider has reched to the window duarions. e.g. in below example
> from databricks,
>
>
> JavaDStream windowDStream =
> accessLogDStream.window(WINDOW_LENGTH, SLIDE_INTERVAL);
> windowDStream.foreachRDD(accessLogs -> {
>   if (accessLogs.count() == 0) {
> System.out.println("No access logs in this time interval");
> return null;
>   }
>
>   // Insert code verbatim from LogAnalyzer.java or LogAnalyzerSQL.java here.
>
>   // Calculate statistics based on the content size.
>   JavaRDD contentSizes =
>   accessLogs.map(ApacheAccessLog::getContentSize).cache();
>   System.out.println(String.format("Content Size Avg: %s, Min: %s, Max: %s",
>   contentSizes.reduce(SUM_REDUCER) / contentSizes.count(),
>   contentSizes.min(Comparator.naturalOrder()),
>   contentSizes.max(Comparator.naturalOrder(;
>
>//.
> }
>
> I want to check if entire average at the end of window falls below certain 
> value and send alert. How do I get this?
>
>
> Thanks,
> LCassa
>
>

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread Vijay Kiran

I think it would be this: https://github.com/onetapbeyond/opencpu-spark-executor

> On 12 Jan 2016, at 18:32, Corey Nolet  wrote:
> 
> David,
> 
> Thank you very much for announcing this! It looks like it could be very 
> useful. Would you mind providing a link to the github?
> 
> On Tue, Jan 12, 2016 at 10:03 AM, David  
> wrote:
> Hi all,
> 
> I'd like to share news of the recent release of a new Spark package, ROSE. 
> 
> ROSE is a Scala library offering access to the full scientific computing 
> power of the R programming language to Apache Spark batch and streaming 
> applications on the JVM. Where Apache SparkR lets data scientists use Spark 
> from R, ROSE is designed to let Scala and Java developers use R from Spark. 
> 
> The project is available and documented on GitHub and I would encourage you 
> to take a look. Any feedback, questions etc very welcome.
> 
> David
> 
> "All that is gold does not glitter, Not all those who wander are lost."
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread Vijay Kiran

I think it would be this: https://github.com/onetapbeyond/opencpu-spark-executor

> On 12 Jan 2016, at 18:32, Corey Nolet  wrote:
> 
> David,
> 
> Thank you very much for announcing this! It looks like it could be very 
> useful. Would you mind providing a link to the github?
> 
> On Tue, Jan 12, 2016 at 10:03 AM, David  
> wrote:
> Hi all,
> 
> I'd like to share news of the recent release of a new Spark package, ROSE. 
> 
> ROSE is a Scala library offering access to the full scientific computing 
> power of the R programming language to Apache Spark batch and streaming 
> applications on the JVM. Where Apache SparkR lets data scientists use Spark 
> from R, ROSE is designed to let Scala and Java developers use R from Spark. 
> 
> The project is available and documented on GitHub and I would encourage you 
> to take a look. Any feedback, questions etc very welcome.
> 
> David
> 
> "All that is gold does not glitter, Not all those who wander are lost."
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-12 Thread Muthu Jayakumar

I tried to rerun the same code with current snapshot version of 1.6 and 2.0
from
https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-core_2.11/

But I still see an exception around the same line. Here is the exception
below. Filed an issue against the same SPARK-12783


.13:49:07.388 [main] ERROR o.a.s.s.c.e.c.GenerateSafeProjection - failed to
compile: org.codehaus.commons.compiler.CompileException: File
'generated.java', Line 140, Column 47: No applicable constructor/method
found for actual parameters "scala.collection.Map"; candidates are:
"collector.MyMap(scala.collection.immutable.Map)"
/* 001 */
/* 002 */ public java.lang.Object
generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
/* 003 */   return new SpecificSafeProjection(expr);
/* 004 */ }
/* 005 */
/* 006 */ class SpecificSafeProjection extends
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
/* 007 */
/* 008 */   private org.apache.spark.sql.catalyst.expressions.Expression[]
expressions;
/* 009 */   private org.apache.spark.sql.catalyst.expressions.MutableRow
mutableRow;
/* 010 */
/* 011 */
/* 012 */
/* 013 */   public
SpecificSafeProjection(org.apache.spark.sql.catalyst.expressions.Expression[]
expr) {
/* 014 */ expressions = expr;
/* 015 */ mutableRow = new
org.apache.spark.sql.catalyst.expressions.GenericMutableRow(1);
/* 016 */
/* 017 */   }
/* 018 */
/* 019 */   public java.lang.Object apply(java.lang.Object _i) {
/* 020 */ InternalRow i = (InternalRow) _i;
/* 021 */ /* newinstance(class collector.MyMap,staticinvoke(class
org.apache.spark.sql.catalyst.util.ArrayBasedMapData$,ObjectType(interface
scala.collection.Map),toScalaMap,invoke(mapobjects(lambdavariable(MapObjects_loopValue4,MapObjects_loopIsNull5,StringType),invoke(lambdavariable(MapObjects_loopValue4,MapObjects_loopIsNull5,StringType),toString,ObjectType(class
java.lang.String)),invoke(input[0,
MapType(StringType,StringType,true)],keyArray,ArrayType(StringType,true))),array,ObjectType(class
[Ljava.lang.Object;)),invoke(mapobjects(lambdavariable(MapObjects_loopValue6,MapObjects_loopIsNull7,StringType),invoke(lambdavariable(MapObjects_loopValue6,MapObjects_loopIsNull7,StringType),toString,ObjectType(class
java.lang.String)),invoke(input[0,
MapType(StringType,StringType,true)],valueArray,ArrayType(StringType,true))),array,ObjectType(class
[Ljava.lang.Object;)),true),false,ObjectType(class collector.MyMap),None) */
/* 022 */ /* staticinvoke(class
org.apache.spark.sql.catalyst.util.ArrayBasedMapData$,ObjectType(interface
scala.collection.Map),toScalaMap,invoke(mapobjects(lambdavariable(MapObjects_loopValue4,MapObjects_loopIsNull5,StringType),invoke(lambdavariable(MapObjects_loopValue4,MapObjects_loopIsNull5,StringType),toString,ObjectType(class
java.lang.String)),invoke(input[0,
MapType(StringType,StringType,true)],keyArray,ArrayType(StringType,true))),array,ObjectType(class
[Ljava.lang.Object;)),invoke(mapobjects(lambdavariable(MapObjects_loopValue6,MapObjects_loopIsNull7,StringType),invoke(lambdavariable(MapObjects_loopValue6,MapObjects_loopIsNull7,StringType),toString,ObjectType(class
java.lang.String)),invoke(input[0,
MapType(StringType,StringType,true)],valueArray,ArrayType(StringType,true))),array,ObjectType(class
[Ljava.lang.Object;)),true) */
/* 023 */ /*
invoke(mapobjects(lambdavariable(MapObjects_loopValue4,MapObjects_loopIsNull5,StringType),invoke(lambdavariable(MapObjects_loopValue4,MapObjects_loopIsNull5,StringType),toString,ObjectType(class
java.lang.String)),invoke(input[0,
MapType(StringType,StringType,true)],keyArray,ArrayType(StringType,true))),array,ObjectType(class
[Ljava.lang.Object;)) */
/* 024 */ /*
mapobjects(lambdavariable(MapObjects_loopValue4,MapObjects_loopIsNull5,StringType),invoke(lambdavariable(MapObjects_loopValue4,MapObjects_loopIsNull5,StringType),toString,ObjectType(class
java.lang.String)),invoke(input[0,
MapType(StringType,StringType,true)],keyArray,ArrayType(StringType,true)))
*/
/* 025 */ /* invoke(input[0,
MapType(StringType,StringType,true)],keyArray,ArrayType(StringType,true)) */
/* 026 */ /* input[0, MapType(StringType,StringType,true)] */
/* 027 */ boolean isNull10 = i.isNullAt(0);
/* 028 */ MapData primitive11 = isNull10 ? null : (i.getMap(0));
/* 029 */
/* 030 */
/* 031 */ boolean isNull8 = isNull10;
/* 032 */ ArrayData primitive9 =
/* 033 */ isNull8 ?
/* 034 */ null : (ArrayData) primitive11.keyArray();
/* 035 */ isNull8 = primitive9 == null;
/* 036 */
/* 037 */ boolean isNull6 = primitive9 == null;
/* 038 */ ArrayData primitive7 = null;
/* 039 */
/* 040 */ if (!isNull6) {
/* 041 */   java.lang.String[] convertedArray15 = null;
/* 042 */   int dataLength14 = primitive9.numElements();
/* 043 */   convertedArray15 = new java.lang.String[dataLength14];
/* 044 */
/* 045 */   int loopIndex16 = 0;
/* 046 */   while (loopIndex16 <

rdd join very slow when rdd created from data frame

2016-01-12 Thread Koert Kuipers

we are having a join of 2 rdds thats fast (< 1 min), and suddenly it
wouldn't even finish overnight anymore. the change was that the rdd was now
derived from a dataframe.

so the new code that runs forever is something like this:
dataframe.rdd.map(row => (Row(row(0)), row)).join(...)

any idea why?
i imagined it had something to do with recomputing parts of the data frame,
but even a small change like this makes the issue go away:
dataframe.rdd.map(row => Row.fromSeq(row.toSeq)).map(row => (Row(row(0)),
row)).join(...)

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-12 Thread Michael Armbrust

Awesome, thanks for opening the JIRA!  We'll take a look.

On Tue, Jan 12, 2016 at 1:53 PM, Muthu Jayakumar  wrote:

> I tried to rerun the same code with current snapshot version of 1.6 and
> 2.0 from
> https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-core_2.11/
>
> But I still see an exception around the same line. Here is the exception
> below. Filed an issue against the same SPARK-12783
> 
>
> .13:49:07.388 [main] ERROR o.a.s.s.c.e.c.GenerateSafeProjection - failed
> to compile: org.codehaus.commons.compiler.CompileException: File
> 'generated.java', Line 140, Column 47: No applicable constructor/method
> found for actual parameters "scala.collection.Map"; candidates are:
> "collector.MyMap(scala.collection.immutable.Map)"
> /* 001 */
> /* 002 */ public java.lang.Object
> generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
> /* 003 */   return new SpecificSafeProjection(expr);
> /* 004 */ }
> /* 005 */
> /* 006 */ class SpecificSafeProjection extends
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
> /* 007 */
> /* 008 */   private org.apache.spark.sql.catalyst.expressions.Expression[]
> expressions;
> /* 009 */   private org.apache.spark.sql.catalyst.expressions.MutableRow
> mutableRow;
> /* 010 */
> /* 011 */
> /* 012 */
> /* 013 */   public
> SpecificSafeProjection(org.apache.spark.sql.catalyst.expressions.Expression[]
> expr) {
> /* 014 */ expressions = expr;
> /* 015 */ mutableRow = new
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow(1);
> /* 016 */
> /* 017 */   }
> /* 018 */
> /* 019 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 020 */ InternalRow i = (InternalRow) _i;
> /* 021 */ /* newinstance(class collector.MyMap,staticinvoke(class
> org.apache.spark.sql.catalyst.util.ArrayBasedMapData$,ObjectType(interface
> scala.collection.Map),toScalaMap,invoke(mapobjects(lambdavariable(MapObjects_loopValue4,MapObjects_loopIsNull5,StringType),invoke(lambdavariable(MapObjects_loopValue4,MapObjects_loopIsNull5,StringType),toString,ObjectType(class
> java.lang.String)),invoke(input[0,
> MapType(StringType,StringType,true)],keyArray,ArrayType(StringType,true))),array,ObjectType(class
> [Ljava.lang.Object;)),invoke(mapobjects(lambdavariable(MapObjects_loopValue6,MapObjects_loopIsNull7,StringType),invoke(lambdavariable(MapObjects_loopValue6,MapObjects_loopIsNull7,StringType),toString,ObjectType(class
> java.lang.String)),invoke(input[0,
> MapType(StringType,StringType,true)],valueArray,ArrayType(StringType,true))),array,ObjectType(class
> [Ljava.lang.Object;)),true),false,ObjectType(class collector.MyMap),None) */
> /* 022 */ /* staticinvoke(class
> org.apache.spark.sql.catalyst.util.ArrayBasedMapData$,ObjectType(interface
> scala.collection.Map),toScalaMap,invoke(mapobjects(lambdavariable(MapObjects_loopValue4,MapObjects_loopIsNull5,StringType),invoke(lambdavariable(MapObjects_loopValue4,MapObjects_loopIsNull5,StringType),toString,ObjectType(class
> java.lang.String)),invoke(input[0,
> MapType(StringType,StringType,true)],keyArray,ArrayType(StringType,true))),array,ObjectType(class
> [Ljava.lang.Object;)),invoke(mapobjects(lambdavariable(MapObjects_loopValue6,MapObjects_loopIsNull7,StringType),invoke(lambdavariable(MapObjects_loopValue6,MapObjects_loopIsNull7,StringType),toString,ObjectType(class
> java.lang.String)),invoke(input[0,
> MapType(StringType,StringType,true)],valueArray,ArrayType(StringType,true))),array,ObjectType(class
> [Ljava.lang.Object;)),true) */
> /* 023 */ /*
> invoke(mapobjects(lambdavariable(MapObjects_loopValue4,MapObjects_loopIsNull5,StringType),invoke(lambdavariable(MapObjects_loopValue4,MapObjects_loopIsNull5,StringType),toString,ObjectType(class
> java.lang.String)),invoke(input[0,
> MapType(StringType,StringType,true)],keyArray,ArrayType(StringType,true))),array,ObjectType(class
> [Ljava.lang.Object;)) */
> /* 024 */ /*
> mapobjects(lambdavariable(MapObjects_loopValue4,MapObjects_loopIsNull5,StringType),invoke(lambdavariable(MapObjects_loopValue4,MapObjects_loopIsNull5,StringType),toString,ObjectType(class
> java.lang.String)),invoke(input[0,
> MapType(StringType,StringType,true)],keyArray,ArrayType(StringType,true)))
> */
> /* 025 */ /* invoke(input[0,
> MapType(StringType,StringType,true)],keyArray,ArrayType(StringType,true)) */
> /* 026 */ /* input[0, MapType(StringType,StringType,true)] */
> /* 027 */ boolean isNull10 = i.isNullAt(0);
> /* 028 */ MapData primitive11 = isNull10 ? null : (i.getMap(0));
> /* 029 */
> /* 030 */
> /* 031 */ boolean isNull8 = isNull10;
> /* 032 */ ArrayData primitive9 =
> /* 033 */ isNull8 ?
> /* 034 */ null : (ArrayData) primitive11.keyArray();
> /* 035 */ isNull8 = primitive9 == null;
> /* 036 */
> /* 037 */ boolean isNull6 = primitive9 == null;
> /* 038 */ ArrayData primitive7 = null;
> /* 039 */
> /*

Maintain a state till the end of the application

2016-01-12 Thread turing.us

Hi!

I have some application (skeleton):

val sc = new SparkContext($SOME_CONF)
val input = sc.textFile(inputFile)

val result = input.map(record => {
  val myState = new MyState() // state
})
.filter($SOME_FILTER)
.sortBy($SOME_SORT)
.partitionBy(new HashPartitioner(100))
.reduceByKey((first,current) => current.updateRecord(first))

val report = result.map {x =>  x._2.generateReport(x._1)}.coalesce(1)
report.saveAsTextFile($SOME_OUTPUT)


// classes
case class AccumulatedState() {
...
}

case class MyState()  {
  var data: AccumulatedState = new AccumulatedState()
...
}
// EOC



I map the input by UUID,
Also, I have a custom state (MyState that holds AccumulatedState),
If the application run in a local mode, or on a real cluster with a single
reducer - the behavior is correct,
but, if I am trying to run it with a multi-reducers, the state was reset
somewhere in the processing.

How a shared state maintained in the lifecycle of the application?
I know that there are Accumulators & Broadcast variables, but those standing
for different use-case (counters & global static (like lookup tables)).

How can I have such shared state across the program till the end to generate
my results?

Thanks!!!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Maintain-a-state-till-the-end-of-the-application-tp25944.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Job History Logs for spark jobs submitted on YARN

2016-01-12 Thread laxmanvemula

I observe that YARN jobs history logs are created in /user/history/done
(*.jhist files) for all the mapreduce jobs like hive, pig etc. But for spark
jobs submitted in yarn-cluster mode, the logs are not being created.

I would like to see resource utilization by spark jobs. Is there any other
place where I can find the resource utilization by spark jobs (CPU, Memory
etc). Or is there any configuration to be set so that the job history logs
are created just like other mapreduce jobs.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Job-History-Logs-for-spark-jobs-submitted-on-YARN-tp25946.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg

I'd guess that if the resources are broadcast Spark would put them into 
Tachyon...

> On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg  
> wrote:
> 
> Would it make sense to load them into Tachyon and read and broadcast them 
> from there since Tachyon is already a part of the Spark stack?
> 
> If so I wonder if I could do that Tachyon read/write via a Spark API?
> 
> 
>> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan 
>>  wrote:
>> 
>> One option could be to store them as blobs in a cache like Redis and then 
>> read + broadcast them from the driver. Or you could store them in HDFS and 
>> read + broadcast from the driver.
>> 
>> Regards
>> Sab
>> 
>>> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg 
>>>  wrote:
>>> We have a bunch of Spark jobs deployed and a few large resource files such 
>>> as e.g. a dictionary for lookups or a statistical model.
>>> 
>>> Right now, these are deployed as part of the Spark jobs which will 
>>> eventually make the mongo-jars too bloated for deployments.
>>> 
>>> What are some of the best practices to consider for maintaining and sharing 
>>> large resource files like these?
>>> 
>>> Thanks.
>> 
>> 
>> 
>> -- 
>> 
>> Architect - Big Data
>> Ph: +91 99805 99458
>> 
>> Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan 
>> India ICT)
>> +++

Re: Unshaded google guava classes in spark-network-common jar

2016-01-12 Thread Sean Owen

No, this is on purpose. Have a look at the build POM. A few Guava classes
were used in the public API for Java and have had to stay unshaded. In 2.x
/ master this is already changed such that no unshaded Guava classes should
be included.

On Tue, Jan 12, 2016, 07:28 Jake Yoon  wrote:

> I found an unshaded google guava classes used internally in
> spark-network-common while working with ElasticSearch.
>
> Following link discusses about duplicate dependencies conflict cause by
> guava classes and how I solved the build conflict issue.
>
>
> https://discuss.elastic.co/t/exception-when-using-elasticsearch-spark-and-elasticsearch-core-together/38471/4
>
> Is this worth raising an issue?
>
> --
> Dynamicscope
>

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg

Jorn, you said Ignite or ... ? What was the second choice you were thinking of? 
It seems that got omitted.

> On Jan 12, 2016, at 2:44 AM, Jörn Franke  wrote:
> 
> You can look at ignite as a HDFS cache or for  storing rdds. 
> 
>> On 11 Jan 2016, at 21:14, Dmitry Goldenberg  wrote:
>> 
>> We have a bunch of Spark jobs deployed and a few large resource files such 
>> as e.g. a dictionary for lookups or a statistical model.
>> 
>> Right now, these are deployed as part of the Spark jobs which will 
>> eventually make the mongo-jars too bloated for deployments.
>> 
>> What are some of the best practices to consider for maintaining and sharing 
>> large resource files like these?
>> 
>> Thanks.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Read from AWS s3 with out having to hard-code sensitive keys

2016-01-12 Thread ayan guha

On EMR, you can add fs.* params in emrfs-site.xml.

On Tue, Jan 12, 2016 at 7:27 AM, Jonathan Kelly 
wrote:

> Yes, IAM roles are actually required now for EMR. If you use Spark on EMR
> (vs. just EC2), you get S3 configuration for free (it goes by the name
> EMRFS), and it will use your IAM role for communicating with S3. Here is
> the corresponding documentation:
> http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-fs.html
>
> On Mon, Jan 11, 2016 at 11:37 AM Matei Zaharia 
> wrote:
>
>> In production, I'd recommend using IAM roles to avoid having keys
>> altogether. Take a look at
>> http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html
>> .
>>
>> Matei
>>
>> On Jan 11, 2016, at 11:32 AM, Sabarish Sasidharan <
>> sabarish.sasidha...@manthan.com> wrote:
>>
>> If you are on EMR, these can go into your hdfs site config. And will work
>> with Spark on YARN by default.
>>
>> Regards
>> Sab
>> On 11-Jan-2016 5:16 pm, "Krishna Rao"  wrote:
>>
>>> Hi all,
>>>
>>> Is there a method for reading from s3 without having to hard-code keys?
>>> The only 2 ways I've found both require this:
>>>
>>> 1. Set conf in code e.g.:
>>> sc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "")
>>> sc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey",
>>> "")
>>>
>>> 2. Set keys in URL, e.g.:
>>> sc.textFile("s3n://@/bucket/test/testdata")
>>>
>>>
>>> Both if which I'm reluctant to do within production code!
>>>
>>>
>>> Cheers
>>>
>>
>>


-- 
Best Regards,
Ayan Guha

Re: pre-install 3-party Python package on spark cluster

2016-01-12 Thread ayan guha

2 cents:

1. You should use an environment management tool, such as ansible, puppet
or chef to handle this kind of use cases (and lot more, Eg what if you want
to add more nodes or to replace one bad node)
2. There are options such as -py-files to provide a zip file

On Tue, Jan 12, 2016 at 6:11 AM, Annabel Melongo <
melongo_anna...@yahoo.com.invalid> wrote:

> When you run spark submit in either client or cluster mode, you can either
> use the options --packages or -jars to automatically copy your packages to
> the worker machines.
>
> Thanks
>
>
> On Monday, January 11, 2016 12:52 PM, Andy Davidson
>  wrote:
>
>
> I use https://code.google.com/p/parallel-ssh/ to upgrade all my slaves
>
>
>
> From: "taotao.li" 
> Date: Sunday, January 10, 2016 at 9:50 PM
> To: "user @spark" 
> Subject: pre-install 3-party Python package on spark cluster
>
> I have a spark cluster, from machine-1 to machine 100, and machine-1 acts
> as
> the master.
>
> Then one day my program need use a 3-party python package which is not
> installed on every machine of the cluster.
>
> so here comes my problem: to make that 3-party python package usable on
> master and slaves, should I manually ssh to every machine and use pip to
> install that package?
>
> I believe there should be some deploy scripts or other things to make this
> grace, but I can't find anything after googling.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/pre-install-3-party-Python-package-on-spark-cluster-tp25930.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>


-- 
Best Regards,
Ayan Guha

Top K Parallel FPGrowth Implementation

2016-01-12 Thread jcbarton

Hi,

Are there any plans to implement the "Top K Parallel FPGrowth" algorithm in
Spark, that aggregates the frequent items together (similar to the mahout
implementation).


Thanks

John



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Top-K-Parallel-FPGrowth-Implementation-tp25945.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg

Would it make sense to load them into Tachyon and read and broadcast them from 
there since Tachyon is already a part of the Spark stack?

If so I wonder if I could do that Tachyon read/write via a Spark API?


> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan 
>  wrote:
> 
> One option could be to store them as blobs in a cache like Redis and then 
> read + broadcast them from the driver. Or you could store them in HDFS and 
> read + broadcast from the driver.
> 
> Regards
> Sab
> 
>> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg 
>>  wrote:
>> We have a bunch of Spark jobs deployed and a few large resource files such 
>> as e.g. a dictionary for lookups or a statistical model.
>> 
>> Right now, these are deployed as part of the Spark jobs which will 
>> eventually make the mongo-jars too bloated for deployments.
>> 
>> What are some of the best practices to consider for maintaining and sharing 
>> large resource files like these?
>> 
>> Thanks.
> 
> 
> 
> -- 
> 
> Architect - Big Data
> Ph: +91 99805 99458
> 
> Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan 
> India ICT)
> +++

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg

Thanks, Gene.

Does Spark use Tachyon under the covers anyway for implementing its
"cluster memory" support?

It seems that the practice I hear the most about is the idea of loading
resources as RDD's and then doing join's against them to achieve the lookup
effect.

The other approach would be to load the resources into broadcast variables
but I've heard concerns about memory.  Could we run out of memory if we
load too much into broadcast vars?  Is there any memory_to_disk/spill to
disk capability for broadcast variables in Spark?


On Tue, Jan 12, 2016 at 11:19 AM, Gene Pang  wrote:

> Hi Dmitry,
>
> Yes, Tachyon can help with your use case. You can read and write to
> Tachyon via the filesystem api (
> http://tachyon-project.org/documentation/File-System-API.html). There is
> a native Java API as well as a Hadoop-compatible API. Spark is also able to
> interact with Tachyon via the Hadoop-compatible API, so Spark jobs can read
> input files from Tachyon and write output files to Tachyon.
>
> I hope that helps,
> Gene
>
> On Tue, Jan 12, 2016 at 4:26 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> I'd guess that if the resources are broadcast Spark would put them into
>> Tachyon...
>>
>> On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg 
>> wrote:
>>
>> Would it make sense to load them into Tachyon and read and broadcast them
>> from there since Tachyon is already a part of the Spark stack?
>>
>> If so I wonder if I could do that Tachyon read/write via a Spark API?
>>
>>
>> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan <
>> sabarish.sasidha...@manthan.com> wrote:
>>
>> One option could be to store them as blobs in a cache like Redis and then
>> read + broadcast them from the driver. Or you could store them in HDFS and
>> read + broadcast from the driver.
>>
>> Regards
>> Sab
>>
>> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg <
>> dgoldenberg...@gmail.com> wrote:
>>
>>> We have a bunch of Spark jobs deployed and a few large resource files
>>> such as e.g. a dictionary for lookups or a statistical model.
>>>
>>> Right now, these are deployed as part of the Spark jobs which will
>>> eventually make the mongo-jars too bloated for deployments.
>>>
>>> What are some of the best practices to consider for maintaining and
>>> sharing large resource files like these?
>>>
>>> Thanks.
>>>
>>
>>
>>
>> --
>>
>> Architect - Big Data
>> Ph: +91 99805 99458
>>
>> Manthan Systems | *Company of the year - Analytics (2014 Frost and
>> Sullivan India ICT)*
>> +++
>>
>>
>

Re: [Spark SQL]: Issues with writing dataframe with Append Mode to Parquet

2016-01-12 Thread Michael Armbrust

There can be dataloss when you are using the DirectOutputCommitter and
speculation is turned on, so we disable it automatically.

On Tue, Jan 12, 2016 at 1:11 PM, Jerry Lam  wrote:

> Hi spark users and developers,
>
> I wonder if the following observed behaviour is expected. I'm writing
> dataframe to parquet into s3. I'm using append mode when I'm writing to it.
> Since I'm using org.apache.spark.sql.
> parquet.DirectParquetOutputCommitter as
> the spark.sql.parquet.output.committer.class, I expected that no _temporary
> files will be generated.
>
> I appended the same dataframe twice to the same directory. The first
> "append" works as expected; no _temporary files are generated because of
> the DirectParquetOutputCommitter but the second "append" does generate
> _temporary files and then it moved the files under the _temporary to the
> output directory.
>
> Is this behavior expected? Or is it a bug?
>
> I'm using Spark 1.5.2.
>
> Best Regards,
>
> Jerry
>

How to change the no of cores assigned for a Submitted Job

2016-01-12 Thread Ashish Soni

Hi ,

I have a strange behavior when i creating standalone spark container using
docker
Not sure why by default it is assigning 4 cores to the first Job it submit
and then all the other jobs are in wait state  , Please suggest if there is
an setting to change this

i tried --executor-cores 1 but it has no effect

[image: Inline image 1]

[Spark SQL]: Issues with writing dataframe with Append Mode to Parquet

2016-01-12 Thread Jerry Lam

Hi spark users and developers,

I wonder if the following observed behaviour is expected. I'm writing
dataframe to parquet into s3. I'm using append mode when I'm writing to it.
Since I'm using org.apache.spark.sql.
parquet.DirectParquetOutputCommitter as
the spark.sql.parquet.output.committer.class, I expected that no _temporary
files will be generated.

I appended the same dataframe twice to the same directory. The first
"append" works as expected; no _temporary files are generated because of
the DirectParquetOutputCommitter but the second "append" does generate
_temporary files and then it moved the files under the _temporary to the
output directory.

Is this behavior expected? Or is it a bug?

I'm using Spark 1.5.2.

Best Regards,

Jerry

Re: How to view the RDD data based on Partition

2016-01-12 Thread Gokula Krishnan D

Hello Prem -

Thanks for sharing and I also found the similar example from the link
http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#aggregate


But trying the understand the actual functionality or behavior.

Thanks & Regards,
Gokula Krishnan* (Gokul)*

On Tue, Jan 12, 2016 at 2:50 PM, Prem Sure  wrote:

>  try mapPartitionsWithIndex .. below is an example I used earlier. myfunc
> logic can be further modified as per your need.
> val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9), 3)
> def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = {
>   iter.toList.map(x => index + "," + x).iterator
> }
> x.mapPartitionsWithIndex(myfunc).collect()
> res10: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9)
>
> On Tue, Jan 12, 2016 at 2:06 PM, Gokula Krishnan D 
> wrote:
>
>> Hello All -
>>
>> I'm just trying to understand aggregate() and in the meantime got an
>> question.
>>
>> *Is there any way to view the RDD databased on the partition ?.*
>>
>> For the instance, the following RDD has 2 partitions
>>
>> val multi2s = List(2,4,6,8,10,12,14,16,18,20)
>> val multi2s_RDD = sc.parallelize(multi2s,2)
>>
>> is there anyway to view the data based on the partitions (0,1).
>>
>>
>> Thanks & Regards,
>> Gokula Krishnan* (Gokul)*
>>
>
>

Re: How to view the RDD data based on Partition

2016-01-12 Thread Prem Sure

I had explored these examples couple of months back. very good link for RDD
operations. see if below explanation helps, try to understand the
difference between below 2 examples.. initial value in both is """
Example 1;
val z = sc.parallelize(List("12","23","","345"),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x
+ y)
res144: String = 11

Partition 1 ("12", "23")
("","12") => "0" . here  "" is the initial value
("0","23") => "1" -- here above 0 is used again for the next element with
in the same partition

Partition 2 ("","345")
("","") => "0"   -- resulting length is 0
("0","345") => "1"  -- zero is again used and min length becomes 1

Final merge:
("1","1") => "11"

Example 2:
val z = sc.parallelize(List("12","23","345",""),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x
+ y)
res143: String = 10

Partition 1 ("12", "23")
("","12") => "0" . here  "" is the initial value
("0","23") => "1" -- here above 0 is used again for the next element with
in the same partition

Partition 2 ("345","")
("","345") => "0"   -- resulting length is 0
("0","") => "0"  -- min length becomes zero again.

Final merge:
("1","0") => "10"

Hope this helps


On Tue, Jan 12, 2016 at 2:53 PM, Gokula Krishnan D 
wrote:

> Hello Prem -
>
> Thanks for sharing and I also found the similar example from the link
> http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#aggregate
>
>
> But trying the understand the actual functionality or behavior.
>
> Thanks & Regards,
> Gokula Krishnan* (Gokul)*
>
> On Tue, Jan 12, 2016 at 2:50 PM, Prem Sure  wrote:
>
>>  try mapPartitionsWithIndex .. below is an example I used earlier. myfunc
>> logic can be further modified as per your need.
>> val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9), 3)
>> def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = {
>>   iter.toList.map(x => index + "," + x).iterator
>> }
>> x.mapPartitionsWithIndex(myfunc).collect()
>> res10: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9)
>>
>> On Tue, Jan 12, 2016 at 2:06 PM, Gokula Krishnan D 
>> wrote:
>>
>>> Hello All -
>>>
>>> I'm just trying to understand aggregate() and in the meantime got an
>>> question.
>>>
>>> *Is there any way to view the RDD databased on the partition ?.*
>>>
>>> For the instance, the following RDD has 2 partitions
>>>
>>> val multi2s = List(2,4,6,8,10,12,14,16,18,20)
>>> val multi2s_RDD = sc.parallelize(multi2s,2)
>>>
>>> is there anyway to view the data based on the partitions (0,1).
>>>
>>>
>>> Thanks & Regards,
>>> Gokula Krishnan* (Gokul)*
>>>
>>
>>
>

Re: adding jars - hive on spark cdh 5.4.3

2016-01-12 Thread Ophir Etzion

btw, this issue happens only with classes needed for the inputFormat. if
the input format is org.apache.hadoop.mapred.TextInputFormat and the serde
is from an additional jar it works just fine.

I don't want to upgrade cdh for this. also, if it should work on cdh5.5 why
is that. what patch fixes that? (cdh 5.5 is the same hive version as
cdh5.4. is it spark related and not hive?)

On Sun, Jan 10, 2016 at 9:26 AM, sandeep vura  wrote:

> Upgrade to CDH 5.5 for spark. It should work
>
> On Sat, Jan 9, 2016 at 12:17 AM, Ophir Etzion 
> wrote:
>
>> It didn't work. assuming I did the right thing.
>> in the properties  you could see
>>
>> {"key":"hive.aux.jars.path","value":"file:///data/loko/foursquare.web-hiverc/current/hadoop-hive-serde.jar,file:///data/loko/foursquare.web-hiverc/current/hadoop-hive-udf.jar","isFinal":false,"resource":"programatically"}
>> which includes the jar that has the class I need but I still get
>>
>> org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find 
>> class: com.foursquare.hadoop.hive.io.HiveThriftSequenceFileInputFormat
>>
>>
>>
>> On Fri, Jan 8, 2016 at 12:24 PM, Edward Capriolo 
>> wrote:
>>
>>> You can not 'add jar' input formats and serde's. They need to be part of
>>> your auxlib.
>>>
>>> On Fri, Jan 8, 2016 at 12:19 PM, Ophir Etzion 
>>> wrote:
>>>
 I tried now. still getting

 16/01/08 16:37:34 ERROR exec.Utilities: Failed to load plan: 
 hdfs://hadoop-alidoro-nn-vip/tmp/hive/hive/c2af9882-38a9-42b0-8d17-3f56708383e8/hive_2016-01-08_16-36-41_370_3307331506800215903-3/-mr-10004/3c90a796-47fc-4541-bbec-b196c40aefab/map.xml:
  org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find 
 class: com.foursquare.hadoop.hive.io.HiveThriftSequenceFileInputFormat
 Serialization trace:
 inputFileFormatClass (org.apache.hadoop.hive.ql.plan.PartitionDesc)
 aliasToPartnInfo (org.apache.hadoop.hive.ql.plan.MapWork)
 org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find 
 class: com.foursquare.hadoop.hive.io.HiveThriftSequenceFileInputFormat


 HiveThriftSequenceFileInputFormat is in one of the jars I'm trying to add.


 On Thu, Jan 7, 2016 at 9:58 PM, Prem Sure 
 wrote:

> did you try -- jars property in spark submit? if your jar is of huge
> size, you can pre-load the jar on all executors in a common available
> directory to avoid network IO.
>
> On Thu, Jan 7, 2016 at 4:03 PM, Ophir Etzion 
> wrote:
>
>> I' trying to add jars before running a query using hive on spark on
>> cdh 5.4.3.
>> I've tried applying the patch in
>> https://issues.apache.org/jira/browse/HIVE-12045 (manually as the
>> patch is done on a different hive version) but still hasn't succeeded.
>>
>> did anyone manage to do ADD JAR successfully with CDH?
>>
>> Thanks,
>> Ophir
>>
>
>

>>>
>>
>

RE: Spark on Apache Ingnite?

2016-01-12 Thread Boavida, Rodrigo

I also had a quick look and agree it’s not very clear. I believe if one reads 
through the clustering logic and the replication settings would get a good idea 
of how it works.
https://apacheignite.readme.io/docs/cluster
I believe it integrates with Hadoop and other file based systems for persisting 
when needed. Not sure about the details on how does it recover.
Also  resource manager such as Mesos can add recoverability for at least 
scenarios where there isn’t any state to recover.

Resilience is a feature and not every use case needs it. For example, I’m 
currently considering Ignite for caching purposes of transient data where we 
have the need to share RDDs between different Spark Contexts where one context 
produces data and the other consumes

From: Koert Kuipers [mailto:ko...@tresata.com]
Sent: 11 January 2016 16:08
To: Boavida, Rodrigo 
Cc: user@spark.apache.org
Subject: Re: Spark on Apache Ingnite?

where is ignite's resilience/fault-tolerance design documented?
i can not find it. i would generally stay away from it if fault-tolerance is an 
afterthought.

On Mon, Jan 11, 2016 at 10:31 AM, RodrigoB 
> wrote:
Although I haven't work explicitly with either, they do seem to differ in
design and consequently in usage scenarios.

Ignite is claimed to be a pure in-memory distributed database.
With Ignite, updating existing keys is something that is self-managed
comparing with Tachyon. In Tachyon once a value is created for a given key,
becomes immutable, so you either delete and insert again, or need to
manage/update the tachyon keys yourself.
Also, Tachyon's resilience design is based on the underlying file system
(typically hadoop), which means that if a node goes down, to recover the
lost data, it would need first to have been persisted on the corresponding
file partition.
With Ignite, there is no master dependency like with Tachyon, and my
understanding is that API calls will depend on master's availability in
Tachyon. I believe Ignite has some options for replication which would be
more aligned with the in-memory datastore.

If you are looking for persisting some RDD's output into an in-memory store
and query it outside of Spark, on the paper Ignite sounds like a better
solution.

Since you are asking about Ignite benefits that was the focus of my
response. Tachyon has its own benefits like the community support and the
Spark lineage persistency integration. If you are doing batch based
processing and want to persist fast Spark RDDs, Tachyon is your friend.

Hope this helps.

Tnks,
Rod



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Apache-Ingnite-tp25884p25933.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org

This email (including any attachments) is proprietary to Aspect Software, Inc. 
and may contain information that is confidential. If you have received this 
message in error, please do not read, copy or forward this message. Please 
notify the sender immediately, delete it from your system and destroy any 
copies. You may not further disclose or distribute this email or its 
attachments.

ROSE: Spark + R on the JVM.

2016-01-12 Thread David

Hi all,

I'd like to share news of the recent release of a new Spark package, 
[ROSE](http://spark-packages.org/package/onetapbeyond/opencpu-spark-executor).

ROSE is a Scala library offering access to the full scientific computing power 
of the R programming language to Apache Spark batch and streaming applications 
on the JVM. Where Apache SparkR lets data scientists use Spark from R, ROSE is 
designed to let Scala and Java developers use R from Spark.

The project is available and documented [on 
GitHub](https://github.com/onetapbeyond/opencpu-spark-executor) and I would 
encourage you to [take a 
look](https://github.com/onetapbeyond/opencpu-spark-executor). Any feedback, 
questions etc very welcome.

David

"All that is gold does not glitter, Not all those who wander are lost."

Re: sparkR ORC support.

2016-01-12 Thread Sandeep Khurana

I call stop from console as R studio warns  and advises it. And yes. after
stop was called the whole script was run again together. It means init
 "hivecontext <- sparkRHive.init(sc)" is called after stop always.

On Tue, Jan 12, 2016 at 8:31 PM, Felix Cheung 
wrote:

> As you can see from my reply below from Jan 6, calling sparkR.stop()
> invalidates both sc and hivecontext you have and results in this invalid
> jobj error.
>
> If you start R and run this, it should work:
>
> Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")
>
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> library(SparkR)
>
> sc <- sparkR.init()
> hivecontext <- sparkRHive.init(sc)
> df <- loadDF(hivecontext, "/data/ingest/sparktest1/", "orc")
>
>
> Is there a reason you want to call stop? If you do, you would need to call
> the line hivecontext <- sparkRHive.init(sc) again.
>
>
> _
> From: Sandeep Khurana 
> Sent: Tuesday, January 12, 2016 5:20 AM
> Subject: Re: sparkR ORC support.
> To: Felix Cheung 
> Cc: spark users , Prem Sure ,
> Deepak Sharma , Yanbo Liang 
>
>
> It worked for sometime. Then I did  sparkR.stop() an re-ran again to get
> the same error. Any idea why it ran fine before ( while running fine it
> kept giving warning reusing existing spark-context and that I should
> restart) ? There is one more R code which instantiated spark , I ran that
> too again.
>
>
> On Tue, Jan 12, 2016 at 3:05 PM, Sandeep Khurana 
> wrote:
>
>> Complete stacktrace is. Can it be something wih java versions?
>>
>>
>> stop("invalid jobj ", value$id)
>> 8
>> writeJobj(con, object)
>> 7
>> writeObject(con, a)
>> 6
>> writeArgs(rc, args)
>> 5
>> invokeJava(isStatic = TRUE, className, methodName, ...)
>> 4
>> callJStatic("org.apache.spark.sql.api.r.SQLUtils", "loadDF", sqlContext,
>> source, options)
>> 3
>> read.df(sqlContext, path, source, schema, ...)
>> 2
>> loadDF(hivecontext, filepath, "orc")
>>
>> On Tue, Jan 12, 2016 at 2:41 PM, Sandeep Khurana 
>> wrote:
>>
>>> Running this gave
>>>
>>> 16/01/12 04:06:54 INFO BlockManagerMaster: Registered BlockManagerError in 
>>> writeJobj(con, object) : invalid jobj 3
>>>
>>>
>>> How does it know which hive schema to connect to?
>>>
>>>
>>>
>>> On Tue, Jan 12, 2016 at 2:34 PM, Felix Cheung >> > wrote:
>>>
 It looks like you have overwritten sc. Could you try this:


 Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")

 .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"),
 .libPaths()))
 library(SparkR)

 sc <- sparkR.init()
 hivecontext <- sparkRHive.init(sc)
 df <- loadDF(hivecontext, "/data/ingest/sparktest1/", "orc")



 --
 Date: Tue, 12 Jan 2016 14:28:58 +0530
 Subject: Re: sparkR ORC support.
 From: sand...@infoworks.io
 To: felixcheun...@hotmail.com
 CC: yblia...@gmail.com; user@spark.apache.org; premsure...@gmail.com;
 deepakmc...@gmail.com


 The code is very simple, pasted below .
 hive-site.xml is in spark conf already. I still see this error

 Error in writeJobj(con, object) : invalid jobj 3

 after running the script  below


 script
 ===
 Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")


 .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"),
 .libPaths()))
 library(SparkR)

 sc <<- sparkR.init()
 sc <<- sparkRHive.init()
 hivecontext <<- sparkRHive.init(sc)
 df <- loadDF(hivecontext, "/data/ingest/sparktest1/", "orc")
 #View(df)


 On Wed, Jan 6, 2016 at 11:08 PM, Felix Cheung <
 felixcheun...@hotmail.com> wrote:

 Yes, as Yanbo suggested, it looks like there is something wrong with
 the sqlContext.

 Could you forward us your code please?





 On Wed, Jan 6, 2016 at 5:52 AM -0800, "Yanbo Liang"  wrote:

 You should ensure your sqlContext is HiveContext.

 sc <- sparkR.init()

 sqlContext <- sparkRHive.init(sc)


 2016-01-06 20:35 GMT+08:00 Sandeep Khurana :

 Felix

 I tried the option suggested by you.  It gave below error.  I am going
 to try the option suggested by Prem .

 Error in writeJobj(con, object) : invalid jobj 1
 8
 stop("invalid jobj ", value$id)
 7
 writeJobj(con, object)
 6
 writeObject(con, a)
 5
 writeArgs(rc, args)
 4
 invokeJava(isStatic = TRUE, className, methodName, ...)
 3
 callJStatic("org.apache.spark.sql.api.r.SQLUtils", "loadDF",
 sqlContext, source, options)
 2
 read.df(sqlContext, filepath, "orc")

Re: sparkR ORC support.

2016-01-12 Thread Felix Cheung

As you can see from my reply below from Jan 6, calling sparkR.stop() 
invalidates both sc and hivecontext you have and results in this invalid jobj 
error.
If you start R and run this, it should work:
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), 
.libPaths()))library(SparkR)
sc <- sparkR.init()hivecontext <- sparkRHive.init(sc)df <- loadDF(hivecontext, 
"/data/ingest/sparktest1/", "orc") 
Is there a reason you want to call stop? If you do, you would need to call the 
line hivecontext <- sparkRHive.init(sc) again.



_
From: Sandeep Khurana 
Sent: Tuesday, January 12, 2016 5:20 AM
Subject: Re: sparkR ORC support.
To: Felix Cheung 
Cc: spark users , Prem Sure , 
Deepak Sharma , Yanbo Liang 


   It worked for sometime. Then I did  sparkR.stop() an re-ran again to get 
the same error. Any idea why it ran fine before ( while running fine it kept 
giving warning reusing existing spark-context and that I should restart) ? 
There is one more R code which instantiated spark , I ran that too again.   
   
 On Tue, Jan 12, 2016 at 3:05 PM, Sandeep Khurana
 wrote:   
Complete stacktrace is. Can it be something wih java 
versions?    
 

  stop("invalid jobj ", value$id)   
   8
writeJobj(con, object)  
7   
 writeObject(con, a)
  6 
   writeArgs(rc, args)  
5   
invokeJava(isStatic = TRUE, className, 
methodName, ...)
 4  
 callJStatic("org.apache.spark.sql.api.r.SQLUtils", "loadDF", 
sqlContext, source, options)
 3  
  read.df(sqlContext, path, source, schema, ...)
  2 
   loadDF(hivecontext, filepath, "orc") 


   On Tue, Jan 12, 2016 at 2:41 PM, Sandeep Khurana 
 wrote:
   Running this gave
 
   16/01/12 04:06:54 INFO 
BlockManagerMaster: Registered BlockManagerError in writeJobj(con, object) : 
invalid jobj 3   
   How does it know which hive schema to connect to?
   

 
 On Tue, Jan 12, 2016 at 2:34 PM, Felix Cheung  
 wrote: 
   It 
looks like you have overwritten sc. Could you try this:
 
 
   
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")  
  
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))  
  library(SparkR)   
 
sc <- sparkR.init() 
   hivecontext <- sparkRHive.init(sc)   
 df <- loadDF(hivecontext, 
"/data/ingest/sparktest1/", "orc")  

 
 Date: Tue, 12 Jan 2016 14:28:58 +0530  
   
Subject: Re: sparkR ORC support.
From:  sand...@infoworks.io 
To:  felixcheun...@hotmail.com 
CC:  yblia...@gmail.com;  
user@spark.apache.org;

Re: ibsnappyjava.so: failed to map segment from shared object

2016-01-12 Thread Mikey T.

Thanks!  Setting java.io.tmpdir did the trick.  Sadly, I still ran into an
issue with the amount of RAM pyspark was grabbing.  In fact I got a message
from my web provider warning that I was exceeding the memory limit for my
(entry level) account.  So I won't be pursuing it farther.  Oh well, it was
still good to get past that first issue.  Pyspark runs fine on my laptop so
for now I'll just have to use it locally.

- Mike

On Mon, Jan 11, 2016 at 7:20 PM, Josh Rosen 
wrote:

> This is due to the snappy-java library; I think that you'll have to
> configure either java.io.tmpdir or org.xerial.snappy.tempdir; see
> https://github.com/xerial/snappy-java/blob/1198363176ad671d933fdaf0938b8b9e609c0d8a/src/main/java/org/xerial/snappy/SnappyLoader.java#L335
>
>
>
> On Mon, Jan 11, 2016 at 7:12 PM, yatinla  wrote:
>
>> I'm trying to get pyspark running on a shared web host.  I can get into
>> the
>> pyspark shell but whenever I run a simple command like
>> sc.parallelize([1,2,3,4]).sum() I get an error that seems to stem from
>> some
>> kind of permission issue with libsnappyjava.so:
>>
>> Caused by: java.lang.UnsatisfiedLinkError:
>> /tmp/snappy-1.1.2-b7abadd6-9b05-4dee-885a-c80434db68e2-libsnappyjava.so:
>> /tmp/snappy-1.1.2-b7abadd6-9b05-4dee-885a-c80434db68e2-libsnappyjava.so:
>> failed to map segment from shared object: Operation not permitted
>>
>> I'm no Linux expert but I suspect it has something to do with noexec maybe
>> on the /tmp folder?  So I tried setting the TMP, TEMP, and TMPDIR
>> environment variables to a tmp folder in my own home directory but I get
>> the
>> same error and it still says /tmp/snappy... not the folder in my my home
>> directory.  So then I also tried, in pyspark using SparkConf, setting the
>> spark.local.dir property to my personal tmp folder, and same for the
>> spark.externalBlockStore.baseDir.  But no matter what, it seems like the
>> error happens and always refers to /tmp not my personal folder.
>>
>> Any help would be greatly appreciated.  It all works great on my laptop,
>> just not on the web host which is a shared linux hosting plan so it
>> doesn't
>> seem surprising that there would be permission issues with /tmp.
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/ibsnappyjava-so-failed-to-map-segment-from-shared-object-tp25937.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Too many tasks killed the scheduler

2016-01-12 Thread Daniel Siegmann

As I understand it, your initial number of partitions will always depend on
the initial data. I'm not aware of any way to change this, other than
changing the configuration of the underlying data store.

Have you tried reading the data in several data frames (e.g. one data frame
per day), coalescing each data frame, and *then* unioning them? You could
try with and without a shuffle. Not sure if it'll work, but might be worth
a shot.

On Mon, Jan 11, 2016 at 8:39 PM, Gavin Yue  wrote:

> Thank you for the suggestion.
>
> I tried the df.coalesce(1000).write.parquet() and yes, the parquet file
> number drops to 1000, but the parition of parquet stills is like 5000+.
> When I read the parquet and do a count, it still has the 5000+ tasks.
>
> So I guess I need to do a repartition here to drop task number?  But
> repartition never works for me, always failed due to out of memory.
>
> And regarding the large number task delay problem, I found a similar
> problem: https://issues.apache.org/jira/browse/SPARK-7447.
>
> I am unionALL like 10 parquet folder, with totally 70K+ parquet files,
> generating 70k+ taskes. It took around 5-8 mins before all tasks start just
> like the ticket abover.
>
> It also happens if I do a partition discovery with base path.Is there
> any schema inference or checking doing, which causes the slowness?
>
> Thanks,
> Gavin
>
>
>
> On Mon, Jan 11, 2016 at 1:21 PM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Could you use "coalesce" to reduce the number of partitions?
>>
>>
>> Shixiong Zhu
>>
>>
>> On Mon, Jan 11, 2016 at 12:21 AM, Gavin Yue 
>> wrote:
>>
>>> Here is more info.
>>>
>>> The job stuck at:
>>> INFO cluster.YarnScheduler: Adding task set 1.0 with 79212 tasks
>>>
>>> Then got the error:
>>> Caused by: org.apache.spark.rpc.RpcTimeoutException: Futures timed out
>>> after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
>>>
>>> So I increased spark.network.timeout from 120s to 600s.  It sometimes
>>> works.
>>>
>>> Each task is a parquet file.  I could not repartition due to out of GC
>>> problem.
>>>
>>> Is there any way I could to improve the performance?
>>>
>>> Thanks,
>>> Gavin
>>>
>>>
>>> On Sun, Jan 10, 2016 at 1:51 AM, Gavin Yue 
>>> wrote:
>>>
 Hey,

 I have 10 days data, each day has a parquet directory with over 7000
 partitions.
 So when I union 10 days and do a count, then it submits over 70K tasks.

 Then the job failed silently with one container exit with code 1.  The
 union with like 5, 6 days data is fine.
 In the spark-shell, it just hang showing: Yarn scheduler submit 7+
 tasks.

 I am running spark 1.6 over hadoop 2.7.  Is there any setting I could
 change to make this work?

 Thanks,
 Gavin

>>>
>>
>

Re: rdd join very slow when rdd created from data frame

2016-01-12 Thread Koert Kuipers

it spark 1.5.1
the dataframe has simply 2 columns, both string

a sql query would be more efficient probably, but doesnt fit out purpose
(we are doing a lot more stuff where we need rdds).

also i am just trying to understand in general what in that rdd coming from
a dataframe could slow things down from 1 min to overnight...

On Tue, Jan 12, 2016 at 5:29 PM, Kevin Mellott 
wrote:

> Can you please provide the high-level schema of the entities that you are
> attempting to join? I think that you may be able to use a more efficient
> technique to join these together; perhaps by registering the Dataframes as
> temp tables and constructing a Spark SQL query.
>
> Also, which version of Spark are you using?
>
> On Tue, Jan 12, 2016 at 4:16 PM, Koert Kuipers  wrote:
>
>> we are having a join of 2 rdds thats fast (< 1 min), and suddenly it
>> wouldn't even finish overnight anymore. the change was that the rdd was now
>> derived from a dataframe.
>>
>> so the new code that runs forever is something like this:
>> dataframe.rdd.map(row => (Row(row(0)), row)).join(...)
>>
>> any idea why?
>> i imagined it had something to do with recomputing parts of the data
>> frame, but even a small change like this makes the issue go away:
>> dataframe.rdd.map(row => Row.fromSeq(row.toSeq)).map(row => (Row(row(0)),
>> row)).join(...)
>>
>
>

Re: rdd join very slow when rdd created from data frame

2016-01-12 Thread Kevin Mellott

Can you please provide the high-level schema of the entities that you are
attempting to join? I think that you may be able to use a more efficient
technique to join these together; perhaps by registering the Dataframes as
temp tables and constructing a Spark SQL query.

Also, which version of Spark are you using?

On Tue, Jan 12, 2016 at 4:16 PM, Koert Kuipers  wrote:

> we are having a join of 2 rdds thats fast (< 1 min), and suddenly it
> wouldn't even finish overnight anymore. the change was that the rdd was now
> derived from a dataframe.
>
> so the new code that runs forever is something like this:
> dataframe.rdd.map(row => (Row(row(0)), row)).join(...)
>
> any idea why?
> i imagined it had something to do with recomputing parts of the data
> frame, but even a small change like this makes the issue go away:
> dataframe.rdd.map(row => Row.fromSeq(row.toSeq)).map(row => (Row(row(0)),
> row)).join(...)
>

Re: parquet repartitions and parquet.enable.summary-metadata does not work

2016-01-12 Thread Cheng Lian

I see. So there are actually 3000 tasks instead of 3000 jobs right?

Would you mind to provide the full stack trace of the GC issue? At first
I thought it's identical to the _metadata one in the mail thread you
mentioned.

Cheng

On 1/11/16 5:30 PM, Gavin Yue wrote:
Here is how I set the conf:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

This actually works, I do not see the _metadata file anymore.

I think I made a mistake. The 3000 jobs are coming from
repartition("id").

I have 7600 json files and want to save as parquet.

So if I use: df.write.parquet(path), it would generate 7600 parquet
files with 7600 parititions which has no problem.

But if I use repartition to change partition number, say:
df.reparition(3000).write.parquet

This would generate 7600 + 3000 tasks. 3000 tasks always fails due to
GC problem.

Best,
Gavin

On Mon, Jan 11, 2016 at 4:31 PM, Cheng Lian > wrote:

Hey Gavin,

Could you please provide a snippet of your code to show how did
you disabled "parquet.enable.summary-metadata" and wrote the
files? Especially, you mentioned you saw "3000 jobs" failed. Were
you writing each Parquet file with an individual job? (Usually
people use write.partitionBy(...).parquet(...) to write multiple
Parquet files.)

Cheng

On 1/10/16 10:12 PM, Gavin Yue wrote:

Hey,

I am trying to convert a bunch of json files into parquet,
which would output over 7000 parquet files. But tthere are too
many files, so I want to repartition based on id to 3000.

But I got the error of GC problem like this one:

https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%3CCAB4bC7_LR2rpHceQw3vyJ=l6xq9+9sjl3wgiispzyfh2xmt...@mail.gmail.com%3E#archives

So I set parquet.enable.summary-metadata to false. But when I
write.parquet, I could still see the 3000 jobs run after the
writing parquet and they failed due to GC.

Basically repartition never succeeded for me. Is there any
other settings which could be optimized?

Thanks,
Gavin

1.6.0: Standalone application: Getting ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory

2016-01-12 Thread Egor Pahomov

Hi, I'm moving my infrastructure from 1.5.2 to 1.6.0 and experiencing
serious issue. I successfully updated spark thrift server from 1.5.2 to
1.6.0. But I have standalone application, which worked fine with 1.5.2 but
failing on 1.6.0 with:

*NestedThrowables:*
*java.lang.ClassNotFoundException:
org.datanucleus.api.jdo.JDOPersistenceManagerFactory*
* at
javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175)*
* at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)*
* at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)*

Inside this application I work with hive table, which have data in json
format.

When I add


org.datanucleus
datanucleus-core
4.0.0-release



org.datanucleus
datanucleus-api-jdo
4.0.0-release



org.datanucleus
datanucleus-rdbms
3.2.9


I'm getting:

*Caused by: org.datanucleus.exceptions.NucleusUserException: Persistence
process has been specified to use a ClassLoaderResolver of name
"datanucleus" yet this has not been found by the DataNucleus plugin
mechanism. Please check your CLASSPATH and plugin specification.*
* at
org.datanucleus.AbstractNucleusContext.(AbstractNucleusContext.java:102)*
* at
org.datanucleus.PersistenceNucleusContextImpl.(PersistenceNucleusContextImpl.java:162)*

I have CDH 5.5. I build spark with

*./make-distribution.sh -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.5.0
-Phive -DskipTests*

Than I publish fat jar locally:

*mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file
-Dfile=./spark-assembly.jar -DgroupId=org.spark-project
-DartifactId=my-spark-assembly -Dversion=1.6.0-SNAPSHOT -Dpackaging=jar*

Than I include dependency on this fat jar:


org.spark-project
my-spark-assembly
1.6.0-SNAPSHOT


Than I build my application with assembly plugin:


org.apache.maven.plugins
maven-shade-plugin



*:*




*:*

META-INF/*.SF
META-INF/*.DSA
META-INF/*.RSA






package

shade






META-INF/services/org.apache.hadoop.fs.FileSystem


reference.conf


log4j.properties









Configuration of assembly plugin is copy-past from spark assembly pom.

This workflow worked for 1.5.2 and broke for 1.6.0. If I have not good
approach of creating this standalone application, please recommend
other approach, but spark-submit does not work for me - it hard for me
to connect it to Oozie.

Any suggestion would be appreciated - I'm stuck.

My spark config:

lazy val sparkConf = new SparkConf()
  .setMaster("yarn-client")
  .setAppName(appName)
  .set("spark.yarn.queue", "jenkins")
  .set("spark.executor.memory", "10g")
  .set("spark.yarn.executor.memoryOverhead", "2000")
  .set("spark.executor.cores", "3")
  .set("spark.driver.memory", "4g")
  .set("spark.shuffle.io.numConnectionsPerPeer", "5")
  .set("spark.sql.autoBroadcastJoinThreshold", "200483647")
  .set("spark.network.timeout", "1000s")
  .set("spark.executor.extraJavaOptions", "-XX:MaxPermSize=2g")
  .set("spark.driver.maxResultSize", "2g")
  .set("spark.rpc.lookupTimeout", "1000s")
  .set("spark.sql.hive.convertMetastoreParquet", "false")
  .set("spark.kryoserializer.buffer.max", "200m")
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .set("spark.yarn.driver.memoryOverhead", "1000")
  .set("spark.dynamicAllocation.enabled", "true")
  .set("spark.shuffle.service.enabled", "true")
  .set("spark.dynamicAllocation.minExecutors", "1")
  .set("spark.dynamicAllocation.maxExecutors", "20")
  .set("spark.dynamicAllocation.executorIdleTimeout", "60s")
  .set("spark.sql.tungsten.enabled", "false")
  .set("spark.dynamicAllocation.cachedExecutorIdleTimeout", "100s")
.setJars(List(this.getClass.getProtectionDomain().getCodeSource().getLocation().toURI().getPath()))

-- 



*Sincerely yoursEgor Pakhomov*

Re: [Spark SQL]: Issues with writing dataframe with Append Mode to Parquet

2016-01-12 Thread Jerry Lam

Hi Michael,

Thanks for the hint! So if I turn off speculation, consecutive appends like
above will not produce temporary files right?
Which class is responsible for disabling the use of DirectOutputCommitter?

Thank you,

Jerry


On Tue, Jan 12, 2016 at 4:12 PM, Michael Armbrust 
wrote:

> There can be dataloss when you are using the DirectOutputCommitter and
> speculation is turned on, so we disable it automatically.
>
> On Tue, Jan 12, 2016 at 1:11 PM, Jerry Lam  wrote:
>
>> Hi spark users and developers,
>>
>> I wonder if the following observed behaviour is expected. I'm writing
>> dataframe to parquet into s3. I'm using append mode when I'm writing to it.
>> Since I'm using org.apache.spark.sql.
>> parquet.DirectParquetOutputCommitter as
>> the spark.sql.parquet.output.committer.class, I expected that no _temporary
>> files will be generated.
>>
>> I appended the same dataframe twice to the same directory. The first
>> "append" works as expected; no _temporary files are generated because of
>> the DirectParquetOutputCommitter but the second "append" does generate
>> _temporary files and then it moved the files under the _temporary to the
>> output directory.
>>
>> Is this behavior expected? Or is it a bug?
>>
>> I'm using Spark 1.5.2.
>>
>> Best Regards,
>>
>> Jerry
>>
>
>

Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

2016-01-12 Thread Cheolsoo Park

Alex, see this jira-
https://issues.apache.org/jira/browse/SPARK-9926

On Tue, Jan 12, 2016 at 10:55 AM, Alex Nastetsky <
alex.nastet...@vervemobile.com> wrote:

> Ran into this need myself. Does Spark have an equivalent of  "mapreduce.
> input.fileinputformat.list-status.num-threads"?
>
> Thanks.
>
> On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park 
> wrote:
>
>> Hi,
>>
>> I am wondering if anyone has successfully enabled
>> "mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I
>> usually set this property to 25 to speed up file listing in MR jobs (Hive
>> and Pig). But for some reason, this property does not take effect in Spark
>> HadoopRDD resulting in serious delay in file listing.
>>
>> I verified that the property is indeed set in HadoopRDD by logging the
>> value of the property in the getPartitions() function. I also tried to
>> attach VisualVM to Spark and Pig clients, which look as follows-
>>
>> In Pig, I can see 25 threads running in parallel for file listing-
>> [image: Inline image 1]
>>
>> In Spark, I only see 2 threads running in parallel for file listing-
>> [image: Inline image 2]
>>
>> What's strange is that the # of concurrent threads in Spark is throttled
>> no matter how high I
>> set "mapreduce.input.fileinputformat.list-status.num-threads".
>>
>> Is anyone using Spark with this property enabled? If so, can you please
>> share how you do it?
>>
>> Thanks!
>> Cheolsoo
>>
>
>

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread Richard Siebeling

Hi,

this looks great and seems to be very usable.
Would it be possible to access the session API from within ROSE, to get for
example the images that are generated by R / openCPU and the logging to
stdout that is logged by R?

thanks in advance,
Richard

On Tue, Jan 12, 2016 at 10:16 PM, Vijay Kiran  wrote:

> I think it would be this:
> https://github.com/onetapbeyond/opencpu-spark-executor
>
> > On 12 Jan 2016, at 18:32, Corey Nolet  wrote:
> >
> > David,
> >
> > Thank you very much for announcing this! It looks like it could be very
> useful. Would you mind providing a link to the github?
> >
> > On Tue, Jan 12, 2016 at 10:03 AM, David 
> wrote:
> > Hi all,
> >
> > I'd like to share news of the recent release of a new Spark package,
> ROSE.
> >
> > ROSE is a Scala library offering access to the full scientific computing
> power of the R programming language to Apache Spark batch and streaming
> applications on the JVM. Where Apache SparkR lets data scientists use Spark
> from R, ROSE is designed to let Scala and Java developers use R from Spark.
> >
> > The project is available and documented on GitHub and I would encourage
> you to take a look. Any feedback, questions etc very welcome.
> >
> > David
> >
> > "All that is gold does not glitter, Not all those who wander are lost."
> >
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Read HDFS file from an executor(closure)

2016-01-12 Thread Udit Mehta

Hi,

Is there a way to read a text file from inside a spark executor? I need to
do this for an streaming application where we need to read a file(whose
contents would change) from a closure.

I cannot use the "sc.textFile" method since spark context is not
serializable. I also cannot read a file using the Hadoop Api since the
"FileSystem" class is not serializable as well.

Does anyone have any idea on how I can go about this?

Thanks,
Udit

FPGrowth does not handle large result sets

2016-01-12 Thread Ritu Raj Tiwari

Folks:We are running into a problem where FPGrowth seems to choke on data sets 
that we think are not too large. We have about 200,000 transactions. Each 
transaction is composed of on an average 50 items. There are about 17,000 
unique item (SKUs) that might show up in any transaction.
When running locally with 12G ram given to the PySpark process, the FPGrowth 
code fails with out of memory error for minSupport of 0.001. The failure occurs 
when we try to enumerate and save the frequent itemsets. Looking at the 
FPGrowth code 
(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala),
 it seems this is because the genFreqItems() method tries to collect() all 
items. Is there a way the code could be rewritten so it does not try to collect 
and therefore store all frequent item sets in memory?
Thanks for any insights.
-Raj

Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

2016-01-12 Thread Alex Nastetsky

Thanks. I was actually able to get mapreduce.input.
fileinputformat.list-status.num-threads working in Spark against a regular
fileset in S3, in Spark 1.5.2 ... looks like the issue is isolated to Hive.

On Tue, Jan 12, 2016 at 6:48 PM, Cheolsoo Park  wrote:

> Alex, see this jira-
> https://issues.apache.org/jira/browse/SPARK-9926
>
> On Tue, Jan 12, 2016 at 10:55 AM, Alex Nastetsky <
> alex.nastet...@vervemobile.com> wrote:
>
>> Ran into this need myself. Does Spark have an equivalent of  "mapreduce.
>> input.fileinputformat.list-status.num-threads"?
>>
>> Thanks.
>>
>> On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park 
>> wrote:
>>
>>> Hi,
>>>
>>> I am wondering if anyone has successfully enabled
>>> "mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I
>>> usually set this property to 25 to speed up file listing in MR jobs (Hive
>>> and Pig). But for some reason, this property does not take effect in Spark
>>> HadoopRDD resulting in serious delay in file listing.
>>>
>>> I verified that the property is indeed set in HadoopRDD by logging the
>>> value of the property in the getPartitions() function. I also tried to
>>> attach VisualVM to Spark and Pig clients, which look as follows-
>>>
>>> In Pig, I can see 25 threads running in parallel for file listing-
>>> [image: Inline image 1]
>>>
>>> In Spark, I only see 2 threads running in parallel for file listing-
>>> [image: Inline image 2]
>>>
>>> What's strange is that the # of concurrent threads in Spark is throttled
>>> no matter how high I
>>> set "mapreduce.input.fileinputformat.list-status.num-threads".
>>>
>>> Is anyone using Spark with this property enabled? If so, can you please
>>> share how you do it?
>>>
>>> Thanks!
>>> Cheolsoo
>>>
>>
>>
>

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Gene Pang

Hi Dmitry,

Yes, Tachyon can help with your use case. You can read and write to Tachyon
via the filesystem api (
http://tachyon-project.org/documentation/File-System-API.html). There is a
native Java API as well as a Hadoop-compatible API. Spark is also able to
interact with Tachyon via the Hadoop-compatible API, so Spark jobs can read
input files from Tachyon and write output files to Tachyon.

I hope that helps,
Gene

On Tue, Jan 12, 2016 at 4:26 AM, Dmitry Goldenberg  wrote:

> I'd guess that if the resources are broadcast Spark would put them into
> Tachyon...
>
> On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg 
> wrote:
>
> Would it make sense to load them into Tachyon and read and broadcast them
> from there since Tachyon is already a part of the Spark stack?
>
> If so I wonder if I could do that Tachyon read/write via a Spark API?
>
>
> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
>
> One option could be to store them as blobs in a cache like Redis and then
> read + broadcast them from the driver. Or you could store them in HDFS and
> read + broadcast from the driver.
>
> Regards
> Sab
>
> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> We have a bunch of Spark jobs deployed and a few large resource files
>> such as e.g. a dictionary for lookups or a statistical model.
>>
>> Right now, these are deployed as part of the Spark jobs which will
>> eventually make the mongo-jars too bloated for deployments.
>>
>> What are some of the best practices to consider for maintaining and
>> sharing large resource files like these?
>>
>> Thanks.
>>
>
>
>
> --
>
> Architect - Big Data
> Ph: +91 99805 99458
>
> Manthan Systems | *Company of the year - Analytics (2014 Frost and
> Sullivan India ICT)*
> +++
>
>

ROSE: Spark + R on the JVM, now available.

2016-01-12 Thread David Russell

Hi all,

I'd like to share news of the recent release of a new Spark package, 
[ROSE](http://spark-packages.org/package/onetapbeyond/opencpu-spark-executor).

ROSE is a Scala library offering access to the full scientific computing power 
of the R programming language to Apache Spark batch and streaming applications 
on the JVM. Where Apache SparkR lets data scientists use Spark from R, ROSE is 
designed to let Scala and Java developers use R from Spark.

The project is available and documented [on 
GitHub](https://github.com/onetapbeyond/opencpu-spark-executor) and I would 
encourage you to [take a 
look](https://github.com/onetapbeyond/opencpu-spark-executor). Any feedback, 
questions etc very welcome.

David

"All that is gold does not glitter, Not all those who wander are lost."

How to optimiz and make this code faster using coalesce(1) and mapPartitionIndex

2016-01-12 Thread unk1102

Hi I have the following code which I run as part of thread which becomes
child job of my main Spark job it takes hours to run for large data around
1-2GB because of coalesce(1) and if data is in MB/KB then it finishes faster
with more data sets size sometimes it does not complete at all. Please guide
what I am doing wrong please help. Thanks in advance.

JavaRDD maksedRDD =
sourceRdd.coalesce(1,true).mapPartitionsWithIndex(new Function2, Iterator>() {
@Override
public Iterator call(Integer ind, Iterator
rowIterator) throws Exception {
List rowList = new ArrayList<>();

while (rowIterator.hasNext()) {
Row row = rowIterator.next();
List rowAsList =
updateRowsMethod(JavaConversions.seqAsJavaList(row.toSeq()));
Row updatedRow = RowFactory.create(rowAsList.toArray());
rowList.add(updatedRow);
}   
return rowList.iterator();
}
}, true);



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimiz-and-make-this-code-faster-using-coalesce-1-and-mapPartitionIndex-tp25947.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread Corey Nolet

David,

Thank you very much for announcing this! It looks like it could be very
useful. Would you mind providing a link to the github?

On Tue, Jan 12, 2016 at 10:03 AM, David 
wrote:

> Hi all,
>
> I'd like to share news of the recent release of a new Spark package, ROSE.
>
>
> ROSE is a Scala library offering access to the full scientific computing
> power of the R programming language to Apache Spark batch and streaming
> applications on the JVM. Where Apache SparkR lets data scientists use Spark
> from R, ROSE is designed to let Scala and Java developers use R from Spark.
>
> The project is available and documented on GitHub and I would encourage
> you to take a look. Any feedback, questions etc very welcome.
>
> David
>
> "All that is gold does not glitter, Not all those who wander are lost."
>

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread David Russell

Hi Corey,

> Would you mind providing a link to the github?

Sure, here is the github link you're looking for:

https://github.com/onetapbeyond/opencpu-spark-executor

David

"All that is gold does not glitter, Not all those who wander are lost."



 Original Message 
Subject: Re: ROSE: Spark + R on the JVM.
Local Time: January 12 2016 12:32 pm
UTC Time: January 12 2016 5:32 pm
From: cjno...@gmail.com
To: themarchoffo...@protonmail.com
CC: user@spark.apache.org,d...@spark.apache.org



David,
Thank you very much for announcing this! It looks like it could be very useful. 
Would you mind providing a link to the github?



On Tue, Jan 12, 2016 at 10:03 AM, David  wrote:

Hi all,

I'd like to share news of the recent release of a new Spark package, ROSE.

ROSE is a Scala library offering access to the full scientific computing power 
of the R programming language to Apache Spark batch and streaming applications 
on the JVM. Where Apache SparkR lets data scientists use Spark from R, ROSE is 
designed to let Scala and Java developers use R from Spark.

The project is available and documented on GitHub and I would encourage you to 
take a look. Any feedback, questions etc very welcome.

David

"All that is gold does not glitter, Not all those who wander are lost."

RE: ROSE: Spark + R on the JVM.

2016-01-12 Thread Chandan Verma

Definitely a great news for all the R and spark guys over here.

From: Corey Nolet [mailto:cjno...@gmail.com]
Sent: Tuesday, January 12, 2016 11:02 PM
To: David
Cc: user@spark.apache.org; d...@spark.apache.org
Subject: Re: ROSE: Spark + R on the JVM.

David,

Thank you very much for announcing this! It looks like it could be very useful. 
Would you mind providing a link to the github?

On Tue, Jan 12, 2016 at 10:03 AM, David  wrote:

Hi all,

I'd like to share news of the recent release of a new Spark package, ROSE.

ROSE is a Scala library offering access to the full scientific computing power 
of the R programming language to Apache Spark batch and streaming applications 
on the JVM. Where Apache SparkR lets data scientists use Spark from R, ROSE is 
designed to let Scala and Java developers use R from Spark.

The project is available and documented on GitHub and I would encourage you to 
take a look. Any feedback, questions etc very welcome.

David

"All that is gold does not glitter, Not all those who wander are lost."

===
DISCLAIMER:
The information contained in this message (including any attachments) is 
confidential and may be privileged. If you have received it by mistake please 
notify the sender by return e-mail and permanently delete this message and any 
attachments from your system. Any dissemination, use, review, distribution, 
printing or copying of this message in whole or in part is strictly prohibited. 
Please note that e-mails are susceptible to change. CitiusTech shall not be 
liable for the improper or incomplete transmission of the information contained 
in this communication nor for any delay in its receipt or damage to your 
system. CitiusTech does not guarantee that the integrity of this communication 
has been maintained or that this communication is free of viruses, 
interceptions or interferences.

84 matches

Mail list logo