Re: Filter on Grouped Data

2015-07-03 Thread Raghavendra Pandey
Why dont you apply filter first and then Group the data and run
aggregations..
On Jul 3, 2015 1:29 PM, Megha Sridhar- Cynepia megha.sridh...@cynepia.com
wrote:

 Hi,


 I have a Spark DataFrame object, which when trimmed, looks like,



 FromTo  SubjectMessage-ID
 karen@xyz.com['vance.me...@enron.com', SEC Inquiry
 19952575.1075858
  'jeannie.mandel...@enron.com',
  'mary.cl...@enron.com',
  'sarah.pal...@enron.com']



 elyn.hug...@xyz.com['dennis.ve...@enron.com',Revised
 documents33499184.1075858
  'gina.tay...@enron.com',
  'kelly.kimbe...@enron.com']
 .
 .
 .


 I have run a groupBy(From) on the above dataFrame and obtained a
 GroupedData object as a result. I need to apply a filter on the grouped
 data (for instance, getting the sender who sent maximum number of the mails
 that were addressed to a particular receiver in the To list).
 Is there a way to accomplish this by applying filter on grouped data?


 Thanks,
 Megha


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM

2015-07-03 Thread Kostas Kougios
I have this problem with a job. A random executor gets this

ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM

Almost always at the same point in the processing of the data. I am
processing 1 mil files with sc.wholeText. At around the 600.000th file, a
container receives this signal. On the driver i get:

15/07/03 14:20:11 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated
or disconnected! Shutting down. cruncher03.stratified:44617
15/07/03 14:20:11 ERROR cluster.YarnClusterScheduler: Lost executor 3 on
cruncher03.stratified: remote Rpc client disassociated
15/07/03 14:20:11 WARN remote.ReliableDeliverySupervisor: Association with
remote system [akka.tcp://sparkExecutor@cruncher03.stratified:44617] has
failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/07/03 14:20:11 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated
or disconnected! Shutting down. cruncher03.stratified:44617
15/07/03 14:20:11 INFO scheduler.TaskSetManager: Re-queueing tasks for 3
from TaskSet 5.0


There is plenty of memory on the machine and container jvm, so I don't think
it is an OOM (after all it would be a SIGKILL) or an OutOfMemory (there is
no out of mem exception)

What can be causing this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-executor-CoarseGrainedExecutorBackend-RECEIVED-SIGNAL-15-SIGTERM-tp23613.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Streaming: updating broadcast variables

2015-07-03 Thread Raghavendra Pandey
You cannot update the broadcasted variable.. It wont get reflected on
workers.
On Jul 3, 2015 12:18 PM, James Cole ja...@binarism.net wrote:

 Hi all,

 I'm filtering a DStream using a function. I need to be able to change this
 function while the application is running (I'm polling a service to see if
 a user has changed their filtering). The filter function is a
 transformation and runs on the workers, so that's where the updates need to
 go. I'm not sure of the best way to do this.

 Initially broadcasting seemed like the way to go: the filter is actually
 quite large. But I don't think I can update something I've broadcasted.
 I've tried unpersisting and re-creating the broadcast variable but it
 became obvious this wasn't updating the reference on the worker. So am I
 correct in thinking I can't use broadcasted variables for this purpose?

 The next option seems to be: stopping the JavaStreamingContext, creating a
 new one from the SparkContext, updating the filter function, and
 re-creating the DStreams (I'm using direct streams from Kafka).

 If I re-created the JavaStreamingContext would the accumulators (which are
 created from the SparkContext) keep working? (Obviously I'm going to try
 this soon)

 In summary:

 1) Can broadcasted variables be updated?

 2) Is there a better way than re-creating the JavaStreamingContext and
 DStreams?

 Thanks,

 James




Re: Optimizations

2015-07-03 Thread Raghavendra Pandey
This is the basic design of spark that it runs all actions in different
stages...
Not sure you can achieve what you r looking for.
On Jul 3, 2015 12:43 PM, Marius Danciu marius.dan...@gmail.com wrote:

 Hi all,

 If I have something like:

 rdd.join(...).mapPartitionToPair(...)

 It looks like mapPartitionToPair runs in a different stage then join. Is
 there a way to piggyback this computation inside the join stage ? ... such
 that each result partition after join is passed to
 the mapPartitionToPair function, all running in the same state without any
 other costs.

 Best,
 Marius



Re: Kryo fails to serialise output

2015-07-03 Thread Will Briggs
Kryo serialization is used internally by Spark for spilling or shuffling 
intermediate results, not for writing out an RDD as an action. Look at Sandy 
Ryza's examples for some hints on how to do this: 
https://github.com/sryza/simplesparkavroapp

Regards,
Will

On July 3, 2015, at 2:45 AM, Dominik Hübner cont...@dhuebner.com wrote:

I have a rather simple avro schema to serialize Tweets (message, username, 
timestamp).
Kryo and twitter chill are used to do so.

For my dev environment the Spark context is configured as below

val conf: SparkConf = new SparkConf()
conf.setAppName(kryo_test)
conf.setMaster(“local[4])
conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
conf.set(spark.kryo.registrator, co.feeb.TweetRegistrator”)

Serialization is setup with

override def registerClasses(kryo: Kryo): Unit = {
kryo.register(classOf[Tweet], 
AvroSerializer.SpecificRecordBinarySerializer[Tweet])
}

(This method gets called)


Using this configuration to persist some object fails with 
java.io.NotSerializableException: co.feeb.avro.Tweet 
(which seems to be ok as this class is not Serializable)

I used the following code:

val ctx: SparkContext = new SparkContext(conf)
val tweets: RDD[Tweet] = ctx.parallelize(List(
new Tweet(a, b, 1L),
new Tweet(c, d, 2L),
new Tweet(e, f, 3L)
  )
)

tweets.saveAsObjectFile(file:///tmp/spark”)

Using saveAsTextFile works, but persisted files are not binary but JSON

cat /tmp/spark/part-0
{username: a, text: b, timestamp: 1}
{username: c, text: d, timestamp: 2}
{username: e, text: f, timestamp: 3}

Is this intended behaviour, a configuration issue, avro serialisation not 
working in local mode or something else?





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark-csv into labeled points with null values

2015-07-03 Thread Saif.A.Ellafi
Hello all,

I am learning scala spark and going through some applications with data I have. 
Please allow me to put a couple questions:

spark-csv: The data I have, ain't malformed, but there are empty values in some 
rows, properly comma-sepparated and not catched by DROPMALFORMED mode
These values are taken into account as null values. My final mission is to 
create a LabeledPoint vector for MLLIB, so my steps are:
a.  load csv
b.  cast column types to have a proper DataFrame schema
c.  apply map() to create a LabeledPoint with denseVector. Using map( Row 
= Row.getDouble(col_index) )

To this point:
res173: org.apache.spark.mllib.regression.LabeledPoint = 
(-1.530132691E9,[162.89431,13.55811,18.3346818,-1.6653182])

As running the following code:

  val model = new LogisticRegressionWithLBFGS().
  setNumClasses(2).
  setValidateData(true).
  run(data_map)

  java.lang.RuntimeException: Failed to check null bit for primitive double 
value.

Debugging this, I am pretty sure this is because rows that look like 
-2.593849123898,392.293891

Any suggestions to get round this?

Saif





Re: Spark performance issue

2015-07-03 Thread Silvio Fiorito
It’ll help to see the code or at least understand what transformations you’re 
using.

Also, you have 15 nodes but not using all of them, so that means you may be 
losing data locality. You can see this in the job UI for Spark if any jobs do 
not have node or process local.

From: diplomatic Guru
Date: Friday, July 3, 2015 at 8:58 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Spark performance issue

Hello guys,

I'm after some advice on Spark performance.

I've a MapReduce job that read inputs carry out a simple calculation and write 
the results into HDFS. I've implemented the same logic in Spark job.

When I tried both jobs on same datasets, I'm getting different execution time, 
which is expected.

BUT
..
In my example, MapReduce job is performing much better than Spark.

The difference is that I'm not changing much with the MR job configuration, 
e.g., memory, cores, etc...But this is not the case with Spark as it's very 
flexible. So I'm sure my configuration isn't correct which is why MR is 
outperforming Spark but need your advice.

For example:

Test 1:
4.5GB data -  MR job took ~55 seconds to compute, but Spark took ~3 minutes and 
20 seconds.

Test 2:
25GB data -MR took 2 minutes and 15 seconds, whereas Spark job is still 
running, and it's already been 15 minutes.


I have a cluster of 15 nodes. The maximum memory that I could allocate to each 
executor is 6GB. Therefore, for Test 1, this is the config I used:

--executor-memory 6G --num-executors 4 --driver-memory 6G  --executor-cores 2 
(also I set spark.storage.memoryFraction to 0.3)


For Test 2:
--executor-memory 6G --num-executors 10 --driver-memory 6G  --executor-cores 2 
(also I set spark.storage.memoryFraction to 0.3)

I tried all possible combination but couldn't get better performance. Any 
suggestions will be much appreciated.








Re: thrift-server does not load jars files (Azure HDInsight)

2015-07-03 Thread Ted Yu
Alternatively, setting spark.driver.extraClassPath should work.

Cheers

On Fri, Jul 3, 2015 at 2:59 AM, Steve Loughran ste...@hortonworks.com
wrote:


 On Thu, Jul 2, 2015 at 7:38 AM, Daniel Haviv 
 daniel.ha...@veracity-group.com wrote:

 Hi,
 I'm trying to start the thrift-server and passing it azure's blob
 storage jars but I'm failing on :
  Caused by: java.io.IOException: No FileSystem for scheme: wasb
 at
 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
 at
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
 at
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
 at
 org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
 at
 org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:342)
 ... 16 more

  If I start the spark-shell the same way, everything works fine.

  spark-shell command:
   ./bin/spark-shell --master yarn --jars
 /home/hdiuser/azureclass/azure-storage-1.2.0.jar,/home/hdiuser/azureclass/hadoop-azure-2.7.0.jar
 --num-executors 4

  thrift-server command:
  ./sbin/start-thriftserver.sh --master yarn--jars
 /home/hdiuser/azureclass/azure-storage-1.2.0.jar,/home/hdiuser/azureclass/hadoop-azure-2.7.0.jar
 --num-executors 4

  How can I pass dependency jars to the thrift server?

  Thanks,
 Daniel




  you should be able to add the JARs to the environment variable
 SPARK_SUBMIT_CLASSPATH or SPARK_CLASSPATH and have them picked up when
 bin/compute-classpath.{cmd.sh} builds up the classpath





Spark performance issue

2015-07-03 Thread diplomatic Guru
Hello guys,

I'm after some advice on Spark performance.

I've a MapReduce job that read inputs carry out a simple calculation and
write the results into HDFS. I've implemented the same logic in Spark job.

When I tried both jobs on same datasets, I'm getting different execution
time, which is expected.

BUT
..
In my example, MapReduce job is performing much better than Spark.

The difference is that I'm not changing much with the MR job configuration,
e.g., memory, cores, etc...But this is not the case with Spark as it's very
flexible. So I'm sure my configuration isn't correct which is why MR is
outperforming Spark but need your advice.

For example:

Test 1:
4.5GB data -  MR job took ~55 seconds to compute, but Spark took ~3 minutes
and 20 seconds.

Test 2:
25GB data -MR took 2 minutes and 15 seconds, whereas Spark job is still
running, and it's already been 15 minutes.


I have a cluster of 15 nodes. The maximum memory that I could allocate to
each executor is 6GB. Therefore, for Test 1, this is the config I used:

--executor-memory 6G --num-executors 4 --driver-memory 6G  --executor-cores
2 (also I set spark.storage.memoryFraction to 0.3)


For Test 2:
--executor-memory 6G --num-executors 10 --driver-memory 6G
 --executor-cores 2 (also I set spark.storage.memoryFraction to 0.3)

I tried all possible combination but couldn't get better performance. Any
suggestions will be much appreciated.


Re: Spark Streaming broadcast to all keys

2015-07-03 Thread Silvio Fiorito
updateStateByKey will run for all keys, whether they have new data in a batch 
or not so you should be able to still use it.



On 7/3/15, 7:34 AM, micvog mich...@micvog.com wrote:

UpdateStateByKey is useful but what if I want to perform an operation to all
existing keys (not only the ones in this RDD).

Word count for example - is there a way to decrease *all* words seen so far
by 1?

I was thinking of keeping a static class per node with the count information
and issuing a broadcast command to take a certain action, but could not find
a broadcast-to-all-nodes functionality or a better way.

Thanks,
Michael



-
Michael Vogiatzis
@mvogiatzis 
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-broadcast-to-all-keys-tp23609.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Float type coercion on SparkR with hiveContext

2015-07-03 Thread Evgeny Sinelnikov
Hello,

I'm got a trouble with float type coercion on SparkR with hiveContext.

 result - sql(hiveContext, SELECT offset, percentage from data limit 100)

 show(result)
DataFrame[offset:float, percentage:float]

 head(result)
Error in as.data.frame.default(x[[i]], optional = TRUE) :
cannot coerce class jobj to a data.frame


This trouble looks like already exists (SPARK-2863 - Emulate Hive type
coercion in native reimplementations of Hive functions) with same
reason - not completed native reimplementations of Hive... not
...functions only.

So, anybody met in this issue before? And, how can I test it more
precisely if it not looks like a bug?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkSQL cache table with multiple replicas

2015-07-03 Thread David Sabater Dinter
Hi all,
Do you know if there is an option to specify how many replicas we want
while caching in memory a table in SparkSQL Thrift server? I have not seen
any option so far but I assumed there is an option as you can see in the
Storage section of the UI that there is 1 x replica of your
Dataframe/Table...

I believe there can be a good use case on where you want to replicate a
dimension table across your nodes to improve response times when running
typical BI DWH types of queries (Just to avoid having to broadcast data
every time and again).

Do you think that would be a good addition to SparkSQL?



Regards.


Re: duplicate names in sql allowed?

2015-07-03 Thread Koert Kuipers
https://issues.apache.org/jira/browse/SPARK-8817

On Fri, Jul 3, 2015 at 11:43 AM, Koert Kuipers ko...@tresata.com wrote:

 i see the relaxation to allow duplicate field names was done on purpose,
 since some data sources can have dupes due to case insensitive resolution.

 apparently the issue is now dealt with during query analysis.

 although this might work for sql it does not seem a good thing for
 DataFrame to me. it seems desirable that a DataFrame should have unique
 column names. not having this guarantee will complicate building other DSLs
 on top of DataFrame (this is how i ran into this issue). its also
 counterintuitive... do R dataframes and pandas allow dupes in column names?
  On Jul 3, 2015 3:27 AM, Akhil Das ak...@sigmoidanalytics.com wrote:

 I think you can open up a jira, not sure if this PR
 https://github.com/apache/spark/pull/2209/files (SPARK-2890
 https://issues.apache.org/jira/browse/SPARK-2890) broke the validation
 piece.

 Thanks
 Best Regards

 On Fri, Jul 3, 2015 at 4:29 AM, Koert Kuipers ko...@tresata.com wrote:

 i am surprised this is allowed...

 scala sqlContext.sql(select name as boo, score as boo from
 candidates).schema

 res7: org.apache.spark.sql.types.StructType =
 StructType(StructField(boo,StringType,true),
 StructField(boo,IntegerType,true))


 should StructType check for duplicate field names?





Experience with centralised logging for Spark?

2015-07-03 Thread Edward Sargisson
Hi all,
I'm wondering if anybody as any experience with centralised logging for
Spark - or even has felt that there was  need for this given the WebUI.

At my organization we use Log4j2 and Flume as the front end of our
centralised logging system. I was looking into modifying Spark to use that
system and I'm reconsidering my approach. I thought I'd ask the community
to see what people have tried.

Log4j2 is important because it works nicely with Flume. The problem I've
got is that all of the Spark processes (master, worker, spark-submit) use
the same conf directory and so would get the same log4j2.xml. This then
means that they would try and use the same directory for the file channel
(which will fail because Flume locks its directory). Secondly, if I want to
add an interceptor to stamp every event with the component name then I
cannot tell the difference between the components - everything would get
'apache-spark'.

This could be fixed by modifying the start up scripts to pass the component
name around; but that's more modification than I really want to make.

So are people generally  happy with the WebUI approach for getting access
to stderr and stdout or have other peopled rolled better solutions?

Yes, I'm aware of https://issues.apache.org/jira/browse/SPARK-6305 and the
associated pull request.

Many thanks, in advance, for your thoughts.

Cheers,
Edward


Re: build spark 1.4 source code for sparkR with maven

2015-07-03 Thread Shivaram Venkataraman
You need to add -Psparkr to build SparkR code

Shivaram

On Fri, Jul 3, 2015 at 2:14 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Did you try:

 build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package



 Thanks
 Best Regards

 On Fri, Jul 3, 2015 at 2:27 PM, 1106944...@qq.com 1106944...@qq.com
 wrote:

 Hi all,
Anyone  build spark 1.4 source code  for sparkR with maven/sbt, what's
 comand ?  using sparkR must build from source code  about 1.4 version .
 thank you

 --
 1106944...@qq.com





Re: duplicate names in sql allowed?

2015-07-03 Thread Koert Kuipers
i see the relaxation to allow duplicate field names was done on purpose,
since some data sources can have dupes due to case insensitive resolution.

apparently the issue is now dealt with during query analysis.

although this might work for sql it does not seem a good thing for
DataFrame to me. it seems desirable that a DataFrame should have unique
column names. not having this guarantee will complicate building other DSLs
on top of DataFrame (this is how i ran into this issue). its also
counterintuitive... do R dataframes and pandas allow dupes in column names?
 On Jul 3, 2015 3:27 AM, Akhil Das ak...@sigmoidanalytics.com wrote:

 I think you can open up a jira, not sure if this PR
 https://github.com/apache/spark/pull/2209/files (SPARK-2890
 https://issues.apache.org/jira/browse/SPARK-2890) broke the validation
 piece.

 Thanks
 Best Regards

 On Fri, Jul 3, 2015 at 4:29 AM, Koert Kuipers ko...@tresata.com wrote:

 i am surprised this is allowed...

 scala sqlContext.sql(select name as boo, score as boo from
 candidates).schema

 res7: org.apache.spark.sql.types.StructType =
 StructType(StructField(boo,StringType,true),
 StructField(boo,IntegerType,true))


 should StructType check for duplicate field names?





Re: Spark SQL groupby timestamp

2015-07-03 Thread sim
@bastien, in those situations, I prefer to use Unix timestamps (millisecond
or second granularity) because you can apply math operations to them easily.
If you don't have a Unix timestamp, you can use unix_timestamp() from Hive
SQL to get one with second granularity.Then doing grouping by hour becomes
very simple:
select  3600*floor(timestamp/3600) as timestamp,  count(error) as
errors,from logsgroup by 3600*floor(timestamp/3600)
Hope this helps./Sim



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-groupby-timestamp-tp23470p23615.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Optimizations

2015-07-03 Thread Marius Danciu
Thanks for your feedback. Yes I am aware of stages design and Silvio what
you are describing is essentially map-side join which is not applicable
when you have both RDDs quite large.

It appears that

rdd.join(...).mapToPair(f)
f is piggybacked inside join stage  (right in the reducers I believe)

whereas

rdd.join(...).mapPartitionToPair( f )

f is executed in a different stage. This is surprising because at least
intuitively the difference between mapToPair and mapPartitionToPair is that
that former is about the push model whereas the latter is about polling
records out of the iterator (*I suspect there are other technical reasons*).
If anyone know the depths of the problem if would be of great help.

Best,
Marius

On Fri, Jul 3, 2015 at 6:43 PM Silvio Fiorito silvio.fior...@granturing.com
wrote:

   One thing you could do is a broadcast join. You take your smaller RDD,
 save it as a broadcast variable. Then run a map operation to perform the
 join and whatever else you need to do. This will remove a shuffle stage but
 you will still have to collect the joined RDD and broadcast it. All depends
 on the size of your data if it’s worth it or not.

   From: Marius Danciu
 Date: Friday, July 3, 2015 at 3:13 AM
 To: user
 Subject: Optimizations

   Hi all,

  If I have something like:

  rdd.join(...).mapPartitionToPair(...)

  It looks like mapPartitionToPair runs in a different stage then join. Is
 there a way to piggyback this computation inside the join stage ? ... such
 that each result partition after join is passed to
 the mapPartitionToPair function, all running in the same state without any
 other costs.

  Best,
 Marius



Contineous errors trying to start spark-shell

2015-07-03 Thread Mohamed Lrhazi
Hello,

I am trying to just start spark-shell... it starts, the prompt appears,
then a never ending (literally) stream of these log lines proceeds
What is it trying to do? Why is it failing?

To start it I do:

$ docker run -it ncssm/spark-base /spark/bin/spark-shell --master spark://
devzero.cs.georgetown.edu:7077




The log lines look like:


5/07/04 00:36:36 INFO Worker: Asked to launch executor
app-20150704003631-0004/45 for Spark shell
15/07/04 00:36:36 INFO ExecutorRunner: Launch command:
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp
/spark/sbin/../conf/:/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar:/spark/lib/datanucleus-api-jdo-3.2.6.jar:/spark/lib/datanucleus-rdbms-3.2.9.jar:/spark/lib/datanucleus-core-3.2.10.jar
-Xms512M -Xmx512M -Dspark.driver.port=50266
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
akka.tcp://sparkDriver@172.17.0.45:50266/user/CoarseGrainedScheduler
--executor-id 45 --hostname 10.212.55.41 --cores 16 --app-id
app-20150704003631-0004 --worker-url akka.tcp://
sparkWorker@10.212.55.41:7078/user/Worker
15/07/04 00:36:39 INFO Worker: Executor app-20150704003631-0004/45 finished
with state EXITED message Command exited with code 1 exitStatus 1


On an example worker node, I see a corresponding unending stream of errors:

15/07/04 00:36:31 INFO Worker: Asked to launch executor
app-20150704003631-0004/7 for Spark shell
15/07/04 00:36:31 INFO ExecutorRunner: Launch command:
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp
/spark/sbin/../conf/:/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar:/spark/lib/datanucleus-api-jdo-3.2.6.jar:/spark/lib/datanucleus-rdbms-3.2.9.jar:/spark/lib/datanucleus-core-3.2.10.jar
-Xms512M -Xmx512M -Dspark.driver.port=50266
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
akka.tcp://sparkDriver@172.17.0.45:50266/user/CoarseGrainedScheduler
--executor-id 7 --hostname 10.212.55.41 --cores 16 --app-id
app-20150704003631-0004 --worker-url akka.tcp://
sparkWorker@10.212.55.41:7078/user/Worker
15/07/04 00:36:36 INFO Worker: Executor app-20150704003631-0004/7 finished
with state EXITED message Command exited with code 1 exitStatus 1


Thanks,
Mohamed.


Are Spark Streaming RDDs always processed in order?

2015-07-03 Thread khaledh
I'm writing a Spark Streaming application that uses RabbitMQ to consume
events. One feature of RabbitMQ that I intend to make use of is bulk ack of
messages, i.e. no need to ack one-by-one, but only ack the last event in a
batch and that would ack the entire batch.

Before I commit to doing so, I'd like to know if Spark Streaming always
processes RDDs in the same order they arrive in, i.e. if RDD1 arrives before
RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 is
finished?

This is crucial to the ack logic, since if RDD2 can be potentially processed
while RDD1 is still being processed, then if I ack the the last event in
RDD2 that would also ack all events in RDD1, even though they may have not
been completely processed yet.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkR and Spark Mlib

2015-07-03 Thread ayan guha
No. Spark R is language binding for spark. MLlib is machine learning
project on top of spark core
On 4 Jul 2015 12:23, praveen S mylogi...@gmail.com wrote:

 Hi,
 Is sparkR and spark Mlib same?



SparkR and Spark Mlib

2015-07-03 Thread praveen S
Hi,
Is sparkR and spark Mlib same?


Re: How to timeout a task?

2015-07-03 Thread William Ferrell
Ted,

Thanks very much for your reply. It took me almost a week but I have
finally had a chance to implement what you noted and it appears to be
working locally. However, when I launch this onto a cluster on EC2 -- this
doesn't work reliably.

To expand, I think the issue is that some of the code we have takes the
python GIL and hence no internal timeout will work. That is why I was
hoping to learn of a task level timeout -- something at the Spark level --
the management level -- such that it can decide a task has taken to long
and just kill it and move on.

Does this make sense?  Are you familiar with any such options?

Best,

- Bill


On Sat, Jun 27, 2015 at 9:26 AM, Ted Yu yuzhih...@gmail.com wrote:

 Have you looked at:

 http://stackoverflow.com/questions/2281850/timeout-function-if-it-takes-too-long-to-finish

 FYI

 On Sat, Jun 27, 2015 at 8:33 AM, wasauce wferr...@gmail.com wrote:

 Hello!

 We use pyspark to run a set of data extractors (think regex). The
 extractors
 (regexes) generally run quite quickly and find a few matches which are
 returned and stored into a database.

 My question is -- is it possible to make the function that runs the
 extractors have a timeout? I.E. if for a given file the extractor runs for
 more than X seconds it terminates and returns a default value?

 Here is a code snippet of what we are doing with some comments as to which
 function I am looking to timeout.

 code: https://gist.github.com/wasauce/42a956a1371a2b564918

 Thank you

 - Bill



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-timeout-a-task-tp23513.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Are Spark Streaming RDDs always processed in order?

2015-07-03 Thread Raghavendra Pandey
I dont think you can expect any order guarantee except the records in one
partition.
 On Jul 4, 2015 7:43 AM, khaledh khal...@gmail.com wrote:

 I'm writing a Spark Streaming application that uses RabbitMQ to consume
 events. One feature of RabbitMQ that I intend to make use of is bulk ack of
 messages, i.e. no need to ack one-by-one, but only ack the last event in a
 batch and that would ack the entire batch.

 Before I commit to doing so, I'd like to know if Spark Streaming always
 processes RDDs in the same order they arrive in, i.e. if RDD1 arrives
 before
 RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 is
 finished?

 This is crucial to the ack logic, since if RDD2 can be potentially
 processed
 while RDD1 is still being processed, then if I ack the the last event in
 RDD2 that would also ack all events in RDD1, even though they may have not
 been completely processed yet.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark 1.4 MLLib Bug?: Multiclass Classification requirement failed: sizeInBytes was negative

2015-07-03 Thread Burak Yavuz
How many partitions do you have? It might be that one partition is too
large, and there is Integer overflow. Could you double your number of
partitions?

Burak

On Fri, Jul 3, 2015 at 4:41 AM, Danny kont...@dannylinden.de wrote:

 hi,

 i want to run a multiclass classification with 390 classes on120k label
 points(tf-idf vectors). but i get the following exception. If i reduce the
 number of classes to ~20 everythings work fine. How can i fix this?

  i use the LogisticRegressionWithLBFGS class for my classification on a 8
 Node Cluster with


 total-executor-cores = 30

 executor-memory = 20g

 My Exception:

 15/07/02 15:55:00 INFO DAGScheduler: Job 11 finished: count at
 LBFGS.scala:170, took 0,521823 s
 15/07/02 15:55:02 INFO MemoryStore: ensureFreeSpace(-1069858488) called
 with
 curMem=308280107, maxMem=3699737
 15/07/02 15:55:02 INFO MemoryStore: Block broadcast_22 stored as values in
 memory (estimated size -1069858488.0 B, free 11.1 GB)
 Exception in thread main java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
 at
 org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
 Caused by: java.lang.IllegalArgumentException: requirement failed:
 sizeInBytes was negative: -1069858488
 at scala.Predef$.require(Predef.scala:233)
 at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55)
 at
 org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:812)
 at
 org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:635)
 at
 org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:993)
 at

 org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99)
 at

 org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:85)
 at

 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
 at

 org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
 at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289)
 at

 org.apache.spark.mllib.optimization.LBFGS$CostFun.calculate(LBFGS.scala:215)
 at

 org.apache.spark.mllib.optimization.LBFGS$CostFun.calculate(LBFGS.scala:204)
 at
 breeze.optimize.CachedDiffFunction.calculate(CachedDiffFunction.scala:23)
 at

 breeze.optimize.FirstOrderMinimizer.calculateObjective(FirstOrderMinimizer.scala:108)
 at

 breeze.optimize.FirstOrderMinimizer.initialState(FirstOrderMinimizer.scala:101)
 at

 breeze.optimize.FirstOrderMinimizer.iterations(FirstOrderMinimizer.scala:146)
 at
 org.apache.spark.mllib.optimization.LBFGS$.runLBFGS(LBFGS.scala:178)
 at
 org.apache.spark.mllib.optimization.LBFGS.optimize(LBFGS.scala:117)
 at

 org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:282)
 at

 org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:205)
 at

 com.test.spark.SVMSimpleAppEC2$.createNaiveBayesModel(SVMSimpleAppEC2.scala:150)
 at com.test.spark.SVMSimpleAppEC2$.main(SVMSimpleAppEC2.scala:48)
 at com.test.spark.SVMSimpleAppEC2.main(SVMSimpleAppEC2.scala)
 ... 6 more
 15/07/02 15:55:02 INFO SparkContext: Invoking stop() from shutdown hook



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-MLLib-Bug-Multiclass-Classification-requirement-failed-sizeInBytes-was-negative-tp23610.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread Akhil Das
With binary i think it might not be possible, although if you can download
the sources and then build it then you can remove this function
https://github.com/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023
which initializes the SQLContext.

Thanks
Best Regards

On Thu, Jul 2, 2015 at 6:11 PM, Daniel Haviv 
daniel.ha...@veracity-group.com wrote:

 Hi,
 I've downloaded the pre-built binaries for Hadoop 2.6 and whenever I start
 the spark-shell it always start with HiveContext.

 How can I disable the HiveContext from being initialized automatically ?

 Thanks,
 Daniel



Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread Daniel Haviv
The main reason is Spark's startup time and the need to configure a component I 
don't really need (without  configs the hivecontext takes  more time to load)

Thanks,
Daniel

 On 3 ביולי 2015, at 11:13, Robin East robin.e...@xense.co.uk wrote:
 
 As Akhil mentioned there isn’t AFAIK any kind of initialisation to stop the 
 SQLContext being created. If you could articulate why you would need to do 
 this (it’s not obvious to me what the benefit would be) then maybe this is 
 something that could be included as a feature in a future release. It may 
 also suggest a way to a workaround.
 
 On 3 Jul 2015, at 08:33, Daniel Haviv daniel.ha...@veracity-group.com 
 wrote:
 
 Thanks
 I was looking for a less hack-ish way :)
 
 Daniel
 
 On Fri, Jul 3, 2015 at 10:15 AM, Akhil Das ak...@sigmoidanalytics.com 
 wrote:
 With binary i think it might not be possible, although if you can download 
 the sources and then build it then you can remove this function which 
 initializes the SQLContext.
 
 Thanks
 Best Regards
 
 On Thu, Jul 2, 2015 at 6:11 PM, Daniel Haviv 
 daniel.ha...@veracity-group.com wrote:
 Hi,
 I've downloaded the pre-built binaries for Hadoop 2.6 and whenever I start 
 the spark-shell it always start with HiveContext.
 
 How can I disable the HiveContext from being initialized automatically ?
 
 Thanks,
 Daniel
 


Optimizations

2015-07-03 Thread Marius Danciu
Hi all,

If I have something like:

rdd.join(...).mapPartitionToPair(...)

It looks like mapPartitionToPair runs in a different stage then join. Is
there a way to piggyback this computation inside the join stage ? ... such
that each result partition after join is passed to
the mapPartitionToPair function, all running in the same state without any
other costs.

Best,
Marius


Re: duplicate names in sql allowed?

2015-07-03 Thread Akhil Das
I think you can open up a jira, not sure if this PR
https://github.com/apache/spark/pull/2209/files (SPARK-2890
https://issues.apache.org/jira/browse/SPARK-2890) broke the validation
piece.

Thanks
Best Regards

On Fri, Jul 3, 2015 at 4:29 AM, Koert Kuipers ko...@tresata.com wrote:

 i am surprised this is allowed...

 scala sqlContext.sql(select name as boo, score as boo from
 candidates).schema

 res7: org.apache.spark.sql.types.StructType =
 StructType(StructField(boo,StringType,true),
 StructField(boo,IntegerType,true))


 should StructType check for duplicate field names?



Filter on Grouped Data

2015-07-03 Thread Megha Sridhar- Cynepia

Hi,


I have a Spark DataFrame object, which when trimmed, looks like,




FromTo  SubjectMessage-ID
karen@xyz.com['vance.me...@enron.com', SEC Inquiry 
 19952575.1075858

 'jeannie.mandel...@enron.com',
 'mary.cl...@enron.com',
 'sarah.pal...@enron.com']



elyn.hug...@xyz.com['dennis.ve...@enron.com',Revised 
documents33499184.1075858

 'gina.tay...@enron.com',
 'kelly.kimbe...@enron.com']
.
.
.


I have run a groupBy(From) on the above dataFrame and obtained a 
GroupedData object as a result. I need to apply a filter on the grouped 
data (for instance, getting the sender who sent maximum number of the 
mails that were addressed to a particular receiver in the To list).

Is there a way to accomplish this by applying filter on grouped data?


Thanks,
Megha


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread ayan guha
Hivecontext should be supersets of SQL context so you should be able to
perform all your tasks. Are you facing any problem with hivecontext?
On 3 Jul 2015 17:33, Daniel Haviv daniel.ha...@veracity-group.com wrote:

 Thanks
 I was looking for a less hack-ish way :)

 Daniel

 On Fri, Jul 3, 2015 at 10:15 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 With binary i think it might not be possible, although if you can
 download the sources and then build it then you can remove this function
 https://github.com/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023
 which initializes the SQLContext.

 Thanks
 Best Regards

 On Thu, Jul 2, 2015 at 6:11 PM, Daniel Haviv 
 daniel.ha...@veracity-group.com wrote:

 Hi,
 I've downloaded the pre-built binaries for Hadoop 2.6 and whenever I
 start the spark-shell it always start with HiveContext.

 How can I disable the HiveContext from being initialized automatically ?

 Thanks,
 Daniel






build spark 1.4 source code for sparkR with maven

2015-07-03 Thread 1106944...@qq.com
Hi all,
   Anyone  build spark 1.4 source code  for sparkR with maven/sbt, what's 
comand ?  using sparkR must build from source code  about 1.4 version .
thank you  



1106944...@qq.com


[spark1.4] sparkContext.stop causes exception on Mesos

2015-07-03 Thread Ayoub
Hello Spark developers, 

After upgrading to spark 1.4 on Mesos 0.22.1 existing code started to throw
this exception when calling sparkContext.stop :

(SparkListenerBus) [ERROR -
org.apache.spark.Logging$class.logError(Logging.scala:96)] Listener
EventLoggingListener threw an exception 
java.lang.reflect.InvocationTargetException 
at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) 
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
at java.lang.reflect.Method.invoke(Method.java:606) 
at
org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:146)
 
at
org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:146)
 
at scala.Option.foreach(Option.scala:236) 
at
org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:146)
 
at
org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:190)
 
at
org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:54)
 
at
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
 
at
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
 
at
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56) 
at
org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
 
at
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
 
at
org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1215) 
at
org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
 
Caused by: java.io.IOException: Filesystem closed 
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:730) 
at
org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1855) 
at
org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1816) 
at
org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130) 
... 16 more 
I0701 15:03:46.101809  1612 sched.cpp:1589] Asked to stop the driver 
I0701 15:03:46.101971  1355 sched.cpp:831] Stopping framework
'20150629-132734-1224736778-5050-6126-0028'


This problems happens only when spark.eventLog.enabled flag is set to true,
it happens also if sparkContext.stop is omitted in the code, I think because
Spark shut down indirectly the spark context. 

Does anyone know what could cause this problem ?

Thanks,
Ayoub.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark1-4-sparkContext-stop-causes-exception-on-Mesos-tp23605.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread Daniel Haviv
Thanks
I was looking for a less hack-ish way :)

Daniel

On Fri, Jul 3, 2015 at 10:15 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 With binary i think it might not be possible, although if you can download
 the sources and then build it then you can remove this function
 https://github.com/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023
 which initializes the SQLContext.

 Thanks
 Best Regards

 On Thu, Jul 2, 2015 at 6:11 PM, Daniel Haviv 
 daniel.ha...@veracity-group.com wrote:

 Hi,
 I've downloaded the pre-built binaries for Hadoop 2.6 and whenever I
 start the spark-shell it always start with HiveContext.

 How can I disable the HiveContext from being initialized automatically ?

 Thanks,
 Daniel





Spark 1.4 MLLib Bug?: Multiclass Classification requirement failed: sizeInBytes was negative

2015-07-03 Thread Danny Linden
hi, 

i want to run a multiclass classification with 390 classes on120k label 
points(tf-idf vectors). but i get the following exception. If i reduce the 
number of classes to ~20 everythings work fine. How can i fix this?

 i use the LogisticRegressionWithLBFGS class for my classification on a 8 Node 
Cluster with 


total-executor-cores = 30

executor-memory = 20g

My Exception:

15/07/02 15:55:00 INFO DAGScheduler: Job 11 finished: count at LBFGS.scala:170, 
took 0,521823 s
15/07/02 15:55:02 INFO MemoryStore: ensureFreeSpace(-1069858488) called with 
curMem=308280107, maxMem=3699737
15/07/02 15:55:02 INFO MemoryStore: Block broadcast_22 stored as values in 
memory (estimated size -1069858488.0 B, free 11.1 GB)
Exception in thread main java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at 
org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.IllegalArgumentException: requirement failed: sizeInBytes 
was negative: -1069858488
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:812)
at 
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:635)
at 
org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:993)
at 
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99)
at 
org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:85)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289)
at 
org.apache.spark.mllib.optimization.LBFGS$CostFun.calculate(LBFGS.scala:215)
at 
org.apache.spark.mllib.optimization.LBFGS$CostFun.calculate(LBFGS.scala:204)
at 
breeze.optimize.CachedDiffFunction.calculate(CachedDiffFunction.scala:23)
at 
breeze.optimize.FirstOrderMinimizer.calculateObjective(FirstOrderMinimizer.scala:108)
at 
breeze.optimize.FirstOrderMinimizer.initialState(FirstOrderMinimizer.scala:101)
at 
breeze.optimize.FirstOrderMinimizer.iterations(FirstOrderMinimizer.scala:146)
at org.apache.spark.mllib.optimization.LBFGS$.runLBFGS(LBFGS.scala:178)
at org.apache.spark.mllib.optimization.LBFGS.optimize(LBFGS.scala:117)
at 
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:282)
at 
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:205)
at 
com.test.spark.SVMSimpleAppEC2$.createNaiveBayesModel(SVMSimpleAppEC2.scala:150)
at com.test.spark.SVMSimpleAppEC2$.main(SVMSimpleAppEC2.scala:48)
at com.test.spark.SVMSimpleAppEC2.main(SVMSimpleAppEC2.scala)
... 6 more
15/07/02 15:55:02 INFO SparkContext: Invoking stop() from shutdown hook



Kryo fails to serialise output

2015-07-03 Thread Dominik Hübner
I have a rather simple avro schema to serialize Tweets (message, username, 
timestamp).
Kryo and twitter chill are used to do so.

For my dev environment the Spark context is configured as below

val conf: SparkConf = new SparkConf()
conf.setAppName(kryo_test)
conf.setMaster(“local[4])
conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
conf.set(spark.kryo.registrator, co.feeb.TweetRegistrator”)

Serialization is setup with

override def registerClasses(kryo: Kryo): Unit = {
kryo.register(classOf[Tweet], 
AvroSerializer.SpecificRecordBinarySerializer[Tweet])
}

(This method gets called)


Using this configuration to persist some object fails with 
java.io.NotSerializableException: co.feeb.avro.Tweet 
(which seems to be ok as this class is not Serializable)

I used the following code:

val ctx: SparkContext = new SparkContext(conf)
val tweets: RDD[Tweet] = ctx.parallelize(List(
new Tweet(a, b, 1L),
new Tweet(c, d, 2L),
new Tweet(e, f, 3L)
  )
)

tweets.saveAsObjectFile(file:///tmp/spark”)

Using saveAsTextFile works, but persisted files are not binary but JSON

cat /tmp/spark/part-0
{username: a, text: b, timestamp: 1}
{username: c, text: d, timestamp: 2}
{username: e, text: f, timestamp: 3}

Is this intended behaviour, a configuration issue, avro serialisation not 
working in local mode or something else?





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Dataframe 1.4 (GroupBy partial match)

2015-07-03 Thread Suraj Shetiya
Hi Salih,

Thanks for the links :) This seems very promising to me.

When do you think this would be available in the spark codeline ?

Thanks,
Suraj

On Fri, Jul 3, 2015 at 2:02 AM, Salih Oztop soz...@yahoo.com wrote:

 Hi Suraj,
 It seems your requirement is Record Linkage/Entity Resolution.
 https://en.wikipedia.org/wiki/Record_linkage
 http://www.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf

 A presentation from Spark Summit using GraphX

 https://spark-summit.org/east-2015/talk/distributed-graph-based-entity-resolution-using-spark


 Kind Regards
 Salih Oztop
 07856128843
 http://www.linkedin.com/in/salihoztop

   --
  *From:* Suraj Shetiya surajshet...@gmail.com
 *To:* Michael Armbrust mich...@databricks.com
 *Cc:* Salih Oztop soz...@yahoo.com; user@spark.apache.org 
 user@spark.apache.org; megha.sridh...@cynepia.com
 *Sent:* Thursday, July 2, 2015 10:47 AM

 *Subject:* Re: Spark Dataframe 1.4 (GroupBy partial match)

 Hi Michael,

 Thanks for a quick response.. This sounds like something that would work.
 However, Rethinking the problem statement and various other use cases,
 which are growing, there are more such scenarios, where one could have
 columns with structured and unstructured data embedded (json or xml or
 other kind of collections), it may make sense to allow probabilistic
 groupby operations where the user can get the same functionality in one
 step instead of two..

 Your thoughts on if that makes sense..

 -Suraj




 -- Forwarded message --
 From: Michael Armbrust mich...@databricks.com
 Date: Jul 2, 2015 12:49 AM
 Subject: Re: Spark Dataframe 1.4 (GroupBy partial match)
 To: Suraj Shetiya surajshet...@gmail.com
 Cc: Salih Oztop soz...@yahoo.com, user@spark.apache.org 
 user@spark.apache.org

 You should probably write a UDF that uses regular expression or other
 string munging to canonicalize the subject and then group on that derived
 column.

 On Tue, Jun 30, 2015 at 10:30 PM, Suraj Shetiya surajshet...@gmail.com
 wrote:

 Thanks Salih. :)


 The output of the groupby is as below.

 2015-01-14  SEC Inquiry
 2015-01-16   Re: SEC Inquiry
 2015-01-18   Fwd: Re: SEC Inquiry


 And subsequently, we would like to aggregate all messages with a
 particular reference subject.
 For instance the question we are trying to answer could be : Get the count
 of messages with a particular subject.

 Looking forward to any suggestion from you.


 On Tue, Jun 30, 2015 at 8:42 PM, Salih Oztop soz...@yahoo.com wrote:

 Hi Suraj
 What will be your output after group by? Since GroupBy is for aggregations
 like sum, count etc.
 If you want to count the 2015 records than it is possible.

 Kind Regards
 Salih Oztop


   --
  *From:* Suraj Shetiya surajshet...@gmail.com
 *To:* user@spark.apache.org
 *Sent:* Tuesday, June 30, 2015 3:05 PM
 *Subject:* Spark Dataframe 1.4 (GroupBy partial match)

 I have a dataset (trimmed and simplified) with 2 columns as below.

 DateSubject
 2015-01-14  SEC Inquiry
 2014-02-12   Happy birthday
 2014-02-13   Re: Happy birthday
 2015-01-16   Re: SEC Inquiry
 2015-01-18   Fwd: Re: SEC Inquiry

 I have imported the same in a Spark Dataframe. What I am looking at is
 groupBy subject field (however, I need a partial match to identify the
 discussion topic).

 For example in the above case.. I would like to group all messages, which
 have subject containing SEC Inquiry which returns following grouped
 frame:

 2015-01-14  SEC Inquiry
 2015-01-16   Re: SEC Inquiry
 2015-01-18   Fwd: Re: SEC Inquiry

 Another usecase for a similar problem could be group by year (in the above
 example), it would mean partial match of the date field, which would mean
 groupBy Date by matching year as 2014 or 2015.

 Keenly Looking forward to reply/solution to the above.

 - Suraj











-- 
Regards,
Suraj


Streaming: updating broadcast variables

2015-07-03 Thread James Cole
Hi all,

I'm filtering a DStream using a function. I need to be able to change this
function while the application is running (I'm polling a service to see if
a user has changed their filtering). The filter function is a
transformation and runs on the workers, so that's where the updates need to
go. I'm not sure of the best way to do this.

Initially broadcasting seemed like the way to go: the filter is actually
quite large. But I don't think I can update something I've broadcasted.
I've tried unpersisting and re-creating the broadcast variable but it
became obvious this wasn't updating the reference on the worker. So am I
correct in thinking I can't use broadcasted variables for this purpose?

The next option seems to be: stopping the JavaStreamingContext, creating a
new one from the SparkContext, updating the filter function, and
re-creating the DStreams (I'm using direct streams from Kafka).

If I re-created the JavaStreamingContext would the accumulators (which are
created from the SparkContext) keep working? (Obviously I'm going to try
this soon)

In summary:

1) Can broadcasted variables be updated?

2) Is there a better way than re-creating the JavaStreamingContext and
DStreams?

Thanks,

James


Re: Accessing the console from spark

2015-07-03 Thread Jem Tucker
In the driver when running spark-submit with --master yarn-client

On Fri, Jul 3, 2015 at 10:23 AM Akhil Das ak...@sigmoidanalytics.com
wrote:

 Where does it returns null? Within the driver or in the executor? I just
 tried System.console.readPassword in spark-shell and it worked.

 Thanks
 Best Regards

 On Fri, Jul 3, 2015 at 2:32 PM, Jem Tucker jem.tuc...@gmail.com wrote:

 Hi,

 We have an application that requires a username/password to be entered
 from the command line. To screen a password in java you need to use
 System.console.readPassword however when running with spark System.console
 returns null?? Any ideas on how to get the console from spark?

 Thanks,

 Jem





Re: thrift-server does not load jars files (Azure HDInsight)

2015-07-03 Thread Steve Loughran

On Thu, Jul 2, 2015 at 7:38 AM, Daniel Haviv 
daniel.ha...@veracity-group.commailto:daniel.ha...@veracity-group.com wrote:
Hi,
I'm trying to start the thrift-server and passing it azure's blob storage jars 
but I'm failing on :
Caused by: java.io.IOException: No FileSystem for scheme: wasb
at 
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:342)
... 16 more

If I start the spark-shell the same way, everything works fine.

spark-shell command:
 ./bin/spark-shell --master yarn --jars 
/home/hdiuser/azureclass/azure-storage-1.2.0.jar,/home/hdiuser/azureclass/hadoop-azure-2.7.0.jar
 --num-executors 4

thrift-server command:
 ./sbin/start-thriftserver.sh --master yarn--jars 
/home/hdiuser/azureclass/azure-storage-1.2.0.jar,/home/hdiuser/azureclass/hadoop-azure-2.7.0.jar
 --num-executors 4

How can I pass dependency jars to the thrift server?

Thanks,
Daniel



you should be able to add the JARs to the environment variable 
SPARK_SUBMIT_CLASSPATH or SPARK_CLASSPATH and have them picked up when 
bin/compute-classpath.{cmd.sh} builds up the classpath




Re: Accessing the console from spark

2015-07-03 Thread Akhil Das
Can you paste the code? Something is missing

Thanks
Best Regards

On Fri, Jul 3, 2015 at 3:14 PM, Jem Tucker jem.tuc...@gmail.com wrote:

 In the driver when running spark-submit with --master yarn-client

 On Fri, Jul 3, 2015 at 10:23 AM Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Where does it returns null? Within the driver or in the executor? I just
 tried System.console.readPassword in spark-shell and it worked.

 Thanks
 Best Regards

 On Fri, Jul 3, 2015 at 2:32 PM, Jem Tucker jem.tuc...@gmail.com wrote:

 Hi,

 We have an application that requires a username/password to be entered
 from the command line. To screen a password in java you need to use
 System.console.readPassword however when running with spark System.console
 returns null?? Any ideas on how to get the console from spark?

 Thanks,

 Jem





Multiple Join Conditions in dataframe join

2015-07-03 Thread bipin
Hi, I need to join with multiple conditions. Can anyone tell how to specify
that. For e.g. this is what I am trying to do :

val Lead_all = Leads.
 | join(Utm_Master,
Leaddetails.columns(LeadSource,Utm_Source,Utm_Medium,Utm_Campaign)
==
Utm_Master.columns(LeadSource,Utm_Source,Utm_Medium,Utm_Campaign),
left)

When I do this I get  error: too many arguments for method apply. 

Thanks
Bipin




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Join-Conditions-in-dataframe-join-tp23606.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Accessing the console from spark

2015-07-03 Thread Jem Tucker
Hi,

We have an application that requires a username/password to be entered from
the command line. To screen a password in java you need to use
System.console.readPassword however when running with spark System.console
returns null?? Any ideas on how to get the console from spark?

Thanks,

Jem


Re: build spark 1.4 source code for sparkR with maven

2015-07-03 Thread Akhil Das
Did you try:

build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package



Thanks
Best Regards

On Fri, Jul 3, 2015 at 2:27 PM, 1106944...@qq.com 1106944...@qq.com wrote:

 Hi all,
Anyone  build spark 1.4 source code  for sparkR with maven/sbt, what's
 comand ?  using sparkR must build from source code  about 1.4 version .
 thank you

 --
 1106944...@qq.com



Re: Accessing the console from spark

2015-07-03 Thread Akhil Das
Where does it returns null? Within the driver or in the executor? I just
tried System.console.readPassword in spark-shell and it worked.

Thanks
Best Regards

On Fri, Jul 3, 2015 at 2:32 PM, Jem Tucker jem.tuc...@gmail.com wrote:

 Hi,

 We have an application that requires a username/password to be entered
 from the command line. To screen a password in java you need to use
 System.console.readPassword however when running with spark System.console
 returns null?? Any ideas on how to get the console from spark?

 Thanks,

 Jem



Re: Accessing the console from spark

2015-07-03 Thread Jem Tucker
I have shown two senarios below:

// setup spark context

val user = readLine(username: )
val pass = System.console.readPassword(password: ) - null pointer
exception here

and

// setup spark context

val user = readLine(username: )
val console = System.console - null pointer exception here
val pass = console.readPassword(password: )


thanks,

Jem


On Fri, Jul 3, 2015 at 11:04 AM Akhil Das ak...@sigmoidanalytics.com
wrote:

 Can you paste the code? Something is missing

 Thanks
 Best Regards

 On Fri, Jul 3, 2015 at 3:14 PM, Jem Tucker jem.tuc...@gmail.com wrote:

 In the driver when running spark-submit with --master yarn-client

 On Fri, Jul 3, 2015 at 10:23 AM Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Where does it returns null? Within the driver or in the executor? I just
 tried System.console.readPassword in spark-shell and it worked.

 Thanks
 Best Regards

 On Fri, Jul 3, 2015 at 2:32 PM, Jem Tucker jem.tuc...@gmail.com wrote:

 Hi,

 We have an application that requires a username/password to be entered
 from the command line. To screen a password in java you need to use
 System.console.readPassword however when running with spark System.console
 returns null?? Any ideas on how to get the console from spark?

 Thanks,

 Jem






Spark Streaming broadcast to all keys

2015-07-03 Thread micvog
UpdateStateByKey is useful but what if I want to perform an operation to all
existing keys (not only the ones in this RDD).

Word count for example - is there a way to decrease *all* words seen so far
by 1?

I was thinking of keeping a static class per node with the count information
and issuing a broadcast command to take a certain action, but could not find
a broadcast-to-all-nodes functionality or a better way.

Thanks,
Michael



-
Michael Vogiatzis
@mvogiatzis 
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-broadcast-to-all-keys-tp23609.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 1.4 MLLib Bug?: Multiclass Classification requirement failed: sizeInBytes was negative

2015-07-03 Thread Danny
hi, 

i want to run a multiclass classification with 390 classes on120k label
points(tf-idf vectors). but i get the following exception. If i reduce the
number of classes to ~20 everythings work fine. How can i fix this?

 i use the LogisticRegressionWithLBFGS class for my classification on a 8
Node Cluster with 


total-executor-cores = 30

executor-memory = 20g

My Exception:

15/07/02 15:55:00 INFO DAGScheduler: Job 11 finished: count at
LBFGS.scala:170, took 0,521823 s
15/07/02 15:55:02 INFO MemoryStore: ensureFreeSpace(-1069858488) called with
curMem=308280107, maxMem=3699737
15/07/02 15:55:02 INFO MemoryStore: Block broadcast_22 stored as values in
memory (estimated size -1069858488.0 B, free 11.1 GB)
Exception in thread main java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at 
org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.IllegalArgumentException: requirement failed:
sizeInBytes was negative: -1069858488
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:812)
at
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:635)
at 
org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:993)
at
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99)
at
org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:85)
at
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289)
at
org.apache.spark.mllib.optimization.LBFGS$CostFun.calculate(LBFGS.scala:215)
at
org.apache.spark.mllib.optimization.LBFGS$CostFun.calculate(LBFGS.scala:204)
at
breeze.optimize.CachedDiffFunction.calculate(CachedDiffFunction.scala:23)
at
breeze.optimize.FirstOrderMinimizer.calculateObjective(FirstOrderMinimizer.scala:108)
at
breeze.optimize.FirstOrderMinimizer.initialState(FirstOrderMinimizer.scala:101)
at
breeze.optimize.FirstOrderMinimizer.iterations(FirstOrderMinimizer.scala:146)
at org.apache.spark.mllib.optimization.LBFGS$.runLBFGS(LBFGS.scala:178)
at org.apache.spark.mllib.optimization.LBFGS.optimize(LBFGS.scala:117)
at
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:282)
at
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:205)
at
com.test.spark.SVMSimpleAppEC2$.createNaiveBayesModel(SVMSimpleAppEC2.scala:150)
at com.test.spark.SVMSimpleAppEC2$.main(SVMSimpleAppEC2.scala:48)
at com.test.spark.SVMSimpleAppEC2.main(SVMSimpleAppEC2.scala)
... 6 more
15/07/02 15:55:02 INFO SparkContext: Invoking stop() from shutdown hook



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-MLLib-Bug-Multiclass-Classification-requirement-failed-sizeInBytes-was-negative-tp23610.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org