[jira] [Commented] (OPENNLP-857) ParserTool should take use Tokenizer instance. It should not use java.util.StringTokenizer

2016-11-12 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660268#comment-15660268
 ] 

Tristan Nixon commented on OPENNLP-857:
---

I'm not seeing this in the trunk code, where should I look for this? Is it in a 
branch or tag?

> ParserTool should take use Tokenizer instance. It should not use 
> java.util.StringTokenizer
> --
>
> Key: OPENNLP-857
> URL: https://issues.apache.org/jira/browse/OPENNLP-857
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Parser
>Affects Versions: 1.6.0
>    Reporter: Tristan Nixon
>Assignee: Joern Kottmann
> Fix For: 1.7.0
>
> Attachments: ParserToolTokenize.patch
>
>
> It would be nice if the ParserTool would make use of a real tokenizer. In 
> addition to being the "right" thing to do, it would obviate issues like 
> OPENNLP-240 when using the parser tool.
> While I realize that java.util.StringTokenizer effectively does the same work 
> as WhitespaceTokenizer, it seems odd to use the former when the latter exists.
> To this end, I'm attaching a patch that adds an additional method
> public static Parse[] parseLine(String line, Parser parser, Tokenizer 
> tokenizer, int numParses)
> I've left the existing method
> public static Parse[] parseLine(String line, Parser parser, int numParses)
> in for convenience and backwards compatibility. It simply calls the new 
> method with WhitespaceTokenizer.INSTANCE
> For good measure, I've added a new command-line argument -tk, which takes the 
> name of a tokenizer model. If none is specified, it will fall back on the 
> current behavior of using the whitespace tokenizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-11-07 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15644259#comment-15644259
 ] 

Tristan Nixon commented on OPENNLP-776:
---

I've been swamped with other work, but I should be able to look at this 
tomorrow.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>    Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, 
> serializable-basemodel-joern.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (OPENNLP-776) Model Objects should be Serializable

2016-10-04 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545918#comment-15545918
 ] 

Tristan Nixon edited comment on OPENNLP-776 at 10/4/16 4:44 PM:


Sorry, I probably should have removed that older patch and consolidated them 
into a single patch.


was (Author: tnixon):
Sorry, I probably should have removed that older patch.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>    Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-10-04 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545918#comment-15545918
 ] 

Tristan Nixon commented on OPENNLP-776:
---

Sorry, I probably should have removed that older patch.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>    Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-10-04 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545791#comment-15545791
 ] 

Tristan Nixon commented on OPENNLP-776:
---

Well, it's a bit of a messy type hierarchy, since the write( int) method is 
defined on both the abstract class OutputStream AND on the interface 
DataOutput, which is inherited by interface ObjectOutput. The 
ObjectOutputStream class inherits from BOTH OutputStream AND ObjectOutput. 
However, the Externalizable interface defines the method writeExternal( 
ObjectOutput ), which implies that there could be other implementations of this 
interface that are not necessarily subtypes of OutputStream. This is in fact 
what some other frameworks do - they provide an alternative implementation.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>    Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-08-19 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428454#comment-15428454
 ] 

Tristan Nixon commented on OPENNLP-776:
---

Good point. I thought the only way to provide custom serialization was to use 
Externalizable, which does require a no-arg constructor, but now I see one can 
put the readObject and writeObject methods into a Serializable and get the same 
effect (leaving me wondering what the point of Externalizable is...).

One slight complication with this is that if we rely on Object's no-arg 
constructor, the implicit initialization of fields like artifactMap and 
artifactSerializers does not happen, so I need to do this explicitly in the 
readObject method, meaning they cannot be final anymore (nor can 
isLoadedFromSerialized). 

Otherwise, it seems to be working fine! See the attached patch.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>    Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable

2016-08-18 Thread Tristan Nixon (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tristan Nixon updated OPENNLP-776:
--
Attachment: serialization_proxy.patch

Patch containing modifications to model classes to provide serialization 
proxies.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>    Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-08-08 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412332#comment-15412332
 ] 

Tristan Nixon commented on OPENNLP-776:
---

This pattern is quite common in frameworks that manage object state for you. 
Classes are instantiated via a no-arg constructor, and then state is set via 
setters and/or some specialized de-serialization method.

Many different serialization frameworks work this way, such as JAXB, Jackson, 
etc. Also ORM frameworks (hibernate, JPA), IOC frameworks (Spring, CDI), and 
many others.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>    Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable

2016-07-18 Thread Tristan Nixon (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tristan Nixon updated OPENNLP-776:
--
Attachment: (was: externalizable.patch)

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>    Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Attachments: externalizable.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable

2016-07-18 Thread Tristan Nixon (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tristan Nixon updated OPENNLP-776:
--
Attachment: externalizable.patch

Also model classes can't be final if we're going to inherit from them 
(TokenizerModel is). Patch revised.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>    Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Attachments: externalizable.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable

2016-07-18 Thread Tristan Nixon (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tristan Nixon updated OPENNLP-776:
--
Attachment: externalizable.patch

Actually, there is one more thing that must happen for this to be viable. 
Externalizable sub-classes must provide a public no-arg constructor, and there 
must be some constructor on some parent class that they can call, which should 
probably in-turn call BaseModel( COMPONENT_NAME, true ). It would be most 
convenient if each model type provided at least a protected no-arg constructor 
(similar to the public ones in my previous patch), as this encapsulates the 
functionality nicely. 

I'm attaching a revised patch of the necessary changes.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>    Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Attachments: externalizable.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable

2016-07-18 Thread Tristan Nixon (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tristan Nixon updated OPENNLP-776:
--
Attachment: (was: BaseModel-serialization.patch)

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>    Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Attachments: externalizable.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-07-18 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15383308#comment-15383308
 ] 

Tristan Nixon commented on OPENNLP-776:
---

Finally returning to this after more than a year. I'm not sure I really 
understand the objection to no-arg constructors. Nevertheless, creating 
Externalizable model sub-classes is an acceptable solution for my purposes.

However, in order for this to work, loadModel(InputStream in) must be made 
protected (currently it is private) so that it can be called from the 
readExternal method in the sub-classes. That change should be sufficient for a 
resolution to my issue. Thanks!

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>    Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Attachments: BaseModel-serialization.patch, model-constructors.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OPENNLP-857) ParserTool should take use Tokenizer instance. It should not use java.util.StringTokenizer

2016-07-09 Thread Tristan Nixon (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tristan Nixon updated OPENNLP-857:
--
Attachment: ParserToolTokenize.patch

My patch

> ParserTool should take use Tokenizer instance. It should not use 
> java.util.StringTokenizer
> --
>
> Key: OPENNLP-857
> URL: https://issues.apache.org/jira/browse/OPENNLP-857
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Parser
>Affects Versions: 1.6.0
>    Reporter: Tristan Nixon
> Attachments: ParserToolTokenize.patch
>
>
> It would be nice if the ParserTool would make use of a real tokenizer. In 
> addition to being the "right" thing to do, it would obviate issues like 
> OPENNLP-240 when using the parser tool.
> While I realize that java.util.StringTokenizer effectively does the same work 
> as WhitespaceTokenizer, it seems odd to use the former when the latter exists.
> To this end, I'm attaching a patch that adds an additional method
> public static Parse[] parseLine(String line, Parser parser, Tokenizer 
> tokenizer, int numParses)
> I've left the existing method
> public static Parse[] parseLine(String line, Parser parser, int numParses)
> in for convenience and backwards compatibility. It simply calls the new 
> method with WhitespaceTokenizer.INSTANCE
> For good measure, I've added a new command-line argument -tk, which takes the 
> name of a tokenizer model. If none is specified, it will fall back on the 
> current behavior of using the whitespace tokenizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (OPENNLP-857) ParserTool should take use Tokenizer instance. It should not use java.util.StringTokenizer

2016-07-09 Thread Tristan Nixon (JIRA)
Tristan Nixon created OPENNLP-857:
-

 Summary: ParserTool should take use Tokenizer instance. It should 
not use java.util.StringTokenizer
 Key: OPENNLP-857
 URL: https://issues.apache.org/jira/browse/OPENNLP-857
 Project: OpenNLP
  Issue Type: Improvement
  Components: Parser
Affects Versions: 1.6.0
Reporter: Tristan Nixon


It would be nice if the ParserTool would make use of a real tokenizer. In 
addition to being the "right" thing to do, it would obviate issues like 
OPENNLP-240 when using the parser tool.

While I realize that java.util.StringTokenizer effectively does the same work 
as WhitespaceTokenizer, it seems odd to use the former when the latter exists.

To this end, I'm attaching a patch that adds an additional method
public static Parse[] parseLine(String line, Parser parser, Tokenizer 
tokenizer, int numParses)

I've left the existing method
public static Parse[] parseLine(String line, Parser parser, int numParses)
in for convenience and backwards compatibility. It simply calls the new method 
with WhitespaceTokenizer.INSTANCE

For good measure, I've added a new command-line argument -tk, which takes the 
name of a tokenizer model. If none is specified, it will fall back on the 
current behavior of using the whitespace tokenizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: How to distribute dependent files (.so , jar ) across spark worker nodes

2016-03-14 Thread Tristan Nixon
I see - so you want the dependencies pre-installed on the cluster nodes so they 
do not need to be submitted along with the job jar?

Where are you planning on deploying/running spark? Do you have your own cluster 
or are you using AWS/other IaaS/PaaS provider?

Somehow you’ll need to get the dependencies onto each node and add them to 
Spark’s classpaths. You could modify an existing VM image or use chef to 
distribute the jars and update the class-paths.

> On Mar 14, 2016, at 5:26 PM, prateek arora <prateek.arora...@gmail.com> wrote:
> 
> Hi
> 
> I do not want create single jar that contains all the other dependencies .  
> because it will increase the size of my spark job jar . 
> so i want to copy all libraries in cluster using some automation process . 
> just like currently i am using chef .
> but i am not sure is it a right method or not ?
> 
> 
> Regards
> Prateek
> 
> 
> On Mon, Mar 14, 2016 at 2:31 PM, Jakob Odersky <ja...@odersky.com 
> <mailto:ja...@odersky.com>> wrote:
> Have you tried setting the configuration
> `spark.executor.extraLibraryPath` to point to a location where your
> .so's are available? (Not sure if non-local files, such as HDFS, are
> supported)
> 
> On Mon, Mar 14, 2016 at 2:12 PM, Tristan Nixon <st...@memeticlabs.org 
> <mailto:st...@memeticlabs.org>> wrote:
> > What build system are you using to compile your code?
> > If you use a dependency management system like maven or sbt, then you 
> > should be able to instruct it to build a single jar that contains all the 
> > other dependencies, including third-party jars and .so’s. I am a maven user 
> > myself, and I use the shade plugin for this:
> > https://maven.apache.org/plugins/maven-shade-plugin/ 
> > <https://maven.apache.org/plugins/maven-shade-plugin/>
> >
> > However, if you are using SBT or another dependency manager, someone else 
> > on this list may be able to give you help on that.
> >
> > If you’re not using a dependency manager - well, you should be. Trying to 
> > manage this manually is a pain that you do not want to get in the way of 
> > your project. There are perfectly good tools to do this for you; use them.
> >
> >> On Mar 14, 2016, at 3:56 PM, prateek arora <prateek.arora...@gmail.com 
> >> <mailto:prateek.arora...@gmail.com>> wrote:
> >>
> >> Hi
> >>
> >> Thanks for the information .
> >>
> >> but my problem is that if i want to write spark application which depend on
> >> third party libraries like opencv then whats is the best approach to
> >> distribute all .so and jar file of opencv in all cluster ?
> >>
> >> Regards
> >> Prateek
> >>
> >>
> >>
> >> --
> >> View this message in context: 
> >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-distribute-dependent-files-so-jar-across-spark-worker-nodes-tp26464p26489.html
> >>  
> >> <http://apache-spark-user-list.1001560.n3.nabble.com/How-to-distribute-dependent-files-so-jar-across-spark-worker-nodes-tp26464p26489.html>
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> >> <mailto:user-unsubscr...@spark.apache.org>
> >> For additional commands, e-mail: user-h...@spark.apache.org 
> >> <mailto:user-h...@spark.apache.org>
> >>
> >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> > <mailto:user-unsubscr...@spark.apache.org>
> > For additional commands, e-mail: user-h...@spark.apache.org 
> > <mailto:user-h...@spark.apache.org>
> >
> 



Re: How to distribute dependent files (.so , jar ) across spark worker nodes

2016-03-14 Thread Tristan Nixon
What build system are you using to compile your code?
If you use a dependency management system like maven or sbt, then you should be 
able to instruct it to build a single jar that contains all the other 
dependencies, including third-party jars and .so’s. I am a maven user myself, 
and I use the shade plugin for this:
https://maven.apache.org/plugins/maven-shade-plugin/

However, if you are using SBT or another dependency manager, someone else on 
this list may be able to give you help on that.

If you’re not using a dependency manager - well, you should be. Trying to 
manage this manually is a pain that you do not want to get in the way of your 
project. There are perfectly good tools to do this for you; use them.

> On Mar 14, 2016, at 3:56 PM, prateek arora  wrote:
> 
> Hi 
> 
> Thanks for the information .
> 
> but my problem is that if i want to write spark application which depend on
> third party libraries like opencv then whats is the best approach to
> distribute all .so and jar file of opencv in all cluster ?
> 
> Regards
> Prateek  
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-distribute-dependent-files-so-jar-across-spark-worker-nodes-tp26464p26489.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Newbie question - Help with runtime error on augmentString

2016-03-11 Thread Tristan Nixon
Right, well I don’t think the issue is with how you’re compiling the scala. I 
think it’s a conflict between different versions of several libs.
I had similar issues with my spark modules. You need to make sure you’re not 
loading a different version of the same lib that is clobbering another 
dependency. It’s very frustrating, but with patience you can weed them out. 
You’ll want to find the offending libs and put them into an  block 
under the associated dependency. I am still working with spark 1.5, scala 2.10, 
and for me the presence of scalap was the problem, and this resolved it:

 org.apache.spark
 spark-core_2.10
 1.5.1
 
  
   org.json4s
   json4s-core_2.10
  
 


 org.json4s
 json4s-core_2.10
 3.2.10
 
  
   org.scala-lang
   scalap
  
 


Unfortunately scalap is a dependency of json4s, which I want to keep. So what I 
do is exclude json4s from spark-core, then add it back in, but with its 
troublesome scalap dependency removed.


> On Mar 11, 2016, at 6:34 PM, Vasu Parameswaran <vas...@gmail.com> wrote:
> 
> Added these to the pom and still the same error :-(. I will look into sbt as 
> well.
> 
> 
> 
> On Fri, Mar 11, 2016 at 2:31 PM, Tristan Nixon <st...@memeticlabs.org 
> <mailto:st...@memeticlabs.org>> wrote:
> You must be relying on IntelliJ to compile your scala, because you haven’t 
> set up any scala plugin to compile it from maven.
> You should have something like this in your plugins:
> 
> 
>  
>   net.alchim31.maven
>   scala-maven-plugin
>   
>
> scala-compile-first
> process-resources
> 
>  compile
> 
>
>
> scala-test-compile
> process-test-resources
> 
>  testCompile
> 
>
>   
>  
> 
> 
> PS - I use maven to compile all my scala and haven’t had a problem with it. I 
> know that sbt has some wonderful things, but I’m just set in my ways ;)
> 
>> On Mar 11, 2016, at 2:02 PM, Jacek Laskowski <ja...@japila.pl 
>> <mailto:ja...@japila.pl>> wrote:
>> 
>> Hi,
>> 
>> Doh! My eyes are bleeding to go through XMLs... 
>> 
>> Where did you specify Scala version? Dunno how it's in maven.
>> 
>> p.s. I *strongly* recommend sbt.
>> 
>> Jacek
>> 
>> 11.03.2016 8:04 PM "Vasu Parameswaran" <vas...@gmail.com 
>> <mailto:vas...@gmail.com>> napisał(a):
>> Thanks Jacek.  Pom is below (Currenlty set to 1.6.1 spark but I started out 
>> with 1.6.0 with the same problem).
>> 
>> 
>> 
>> http://maven.apache.org/POM/4.0.0 
>> <http://maven.apache.org/POM/4.0.0>"
>>  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance 
>> <http://www.w3.org/2001/XMLSchema-instance>"
>>  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
>> <http://maven.apache.org/POM/4.0.0> 
>> http://maven.apache.org/xsd/maven-4.0.0.xsd 
>> <http://maven.apache.org/xsd/maven-4.0.0.xsd>">
>> 
>> spark
>> com.test
>> 1.0-SNAPSHOT
>> 
>> 4.0.0
>> 
>> sparktest
>> 
>> 
>> UTF-8
>> 
>> 
>> 
>> 
>> junit
>> junit
>> 
>> 
>> 
>> commons-cli
>> commons-cli
>> 
>> 
>> com.google.code.gson
>> gson
>> 2.3.1
>> compile
>> 
>> 
>> org.apache.spark
>> spark-core_2.11
>> 1.6.1
>> 
>> 
>> 
>> 
>> 
>> 
>> org.apache.maven.plugins
>> maven-shade-plugin
>> 2.4.2
>> 
>> 
>> package
>> 
>> shade
>> 
>> 
>> 
>> 
>> 
>> ${project.artifactId}-${project.version}-with-dependencies
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Fri, Mar 11, 2016 at 10:46 AM, Jacek Laskowski <ja...@japila.pl 
>> <mailto:ja...@japila.pl>> wrote:
>> Hi,
>> 
>> Why do you use maven not sbt for Scala?
>> 
>> Can you show the entire pom.xml and the command to execute the app?
>> 
>> Jacek
>> 
>> 

Re: udf StructField to JSON String

2016-03-11 Thread Tristan Nixon
So I think in your case you’d do something more like:

val jsontrans = new 
JsonSerializationTransformer[StructType].setInputCol(“event").setOutputCol(“eventJSON")


> On Mar 11, 2016, at 3:51 PM, Tristan Nixon <st...@memeticlabs.org> wrote:
> 
> val jsontrans = new 
> JsonSerializationTransformer[Document].setInputCol("myEntityColumn")
>  .setOutputCol("myOutputColumn")



Re: How to distribute dependent files (.so , jar ) across spark worker nodes

2016-03-11 Thread Tristan Nixon
I recommend you package all your dependencies (jars, .so’s, etc.) into a single 
uber-jar and then submit that. It’s much more convenient than trying to manage 
including everything in the --jars arg of spark-submit. If you build with maven 
than the shade plugin will do this for you nicely:
https://maven.apache.org/plugins/maven-shade-plugin/

> On Mar 11, 2016, at 2:05 PM, Jacek Laskowski  wrote:
> 
> Hi,
> 
> For jars use spark-submit --jars. Dunno about so's. Could that work through 
> jars?
> 
> Jacek
> 
> 11.03.2016 8:07 PM "prateek arora"  > napisał(a):
> Hi
> 
> I have multiple node cluster and my spark jobs depend on a native
> library (.so files) and some jar files.
> 
> Can some one please explain what are the best ways to distribute dependent
> files across nodes?
> 
> right now i copied  dependent files in all nodes using chef tool .
> 
> Regards
> Prateek
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-distribute-dependent-files-so-jar-across-spark-worker-nodes-tp26464.html
>  
> 
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 



Re: Newbie question - Help with runtime error on augmentString

2016-03-11 Thread Tristan Nixon
You must be relying on IntelliJ to compile your scala, because you haven’t set 
up any scala plugin to compile it from maven.
You should have something like this in your plugins:


 
  net.alchim31.maven
  scala-maven-plugin
  
   
scala-compile-first
process-resources

 compile

   
   
scala-test-compile
process-test-resources

 testCompile

   
  
 


PS - I use maven to compile all my scala and haven’t had a problem with it. I 
know that sbt has some wonderful things, but I’m just set in my ways ;)

> On Mar 11, 2016, at 2:02 PM, Jacek Laskowski  wrote:
> 
> Hi,
> 
> Doh! My eyes are bleeding to go through XMLs... 
> 
> Where did you specify Scala version? Dunno how it's in maven.
> 
> p.s. I *strongly* recommend sbt.
> 
> Jacek
> 
> 11.03.2016 8:04 PM "Vasu Parameswaran"  > napisał(a):
> Thanks Jacek.  Pom is below (Currenlty set to 1.6.1 spark but I started out 
> with 1.6.0 with the same problem).
> 
> 
> 
> http://maven.apache.org/POM/4.0.0 
> "
>  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance 
> "
>  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
>  
> http://maven.apache.org/xsd/maven-4.0.0.xsd 
> ">
> 
> spark
> com.test
> 1.0-SNAPSHOT
> 
> 4.0.0
> 
> sparktest
> 
> 
> UTF-8
> 
> 
> 
> 
> junit
> junit
> 
> 
> 
> commons-cli
> commons-cli
> 
> 
> com.google.code.gson
> gson
> 2.3.1
> compile
> 
> 
> org.apache.spark
> spark-core_2.11
> 1.6.1
> 
> 
> 
> 
> 
> 
> org.apache.maven.plugins
> maven-shade-plugin
> 2.4.2
> 
> 
> package
> 
> shade
> 
> 
> 
> 
> 
> ${project.artifactId}-${project.version}-with-dependencies
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Fri, Mar 11, 2016 at 10:46 AM, Jacek Laskowski  > wrote:
> Hi,
> 
> Why do you use maven not sbt for Scala?
> 
> Can you show the entire pom.xml and the command to execute the app?
> 
> Jacek
> 
> 11.03.2016 7:33 PM "vasu20" > 
> napisał(a):
> Hi
> 
> Any help appreciated on this.  I am trying to write a Spark program using
> IntelliJ.  I get a run time error as soon as new SparkConf() is called from
> main.  Top few lines of the exception are pasted below.
> 
> These are the following versions:
> 
> Spark jar:  spark-assembly-1.6.0-hadoop2.6.0.jar
> pom:  spark-core_2.11
>  1.6.0
> 
> I have installed the Scala plugin in IntelliJ and added a dependency.
> 
> I have also added a library dependency in the project structure.
> 
> Thanks for any help!
> 
> Vasu
> 
> 
> Exception in thread "main" java.lang.NoSuchMethodError:
> scala.Predef$.augmentString(Ljava/lang/String;)Ljava/lang/String;
> at org.apache.spark.util.Utils$.(Utils.scala:1682)
> at org.apache.spark.util.Utils$.(Utils.scala)
> at org.apache.spark.SparkConf.(SparkConf.scala:59)
> 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Newbie-question-Help-with-runtime-error-on-augmentString-tp26462.html
>  
> 
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 



Re: udf StructField to JSON String

2016-03-11 Thread Tristan Nixon
It’s pretty simple, really:

import com.fasterxml.jackson.databind.ObjectMapper
import org.apache.spark.ml.UnaryTransformer
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.types.{DataType, StringType}

/**
 * A SparkML Transformer that will transform an
 * entity of type T into a JSON-formatted string.
 * Created by Tristan Nixon <tris...@memeticlabs.org> on 3/11/16.
 */
class JsonSerializationTransformer[T](override val uid: String)
  extends UnaryTransformer[T,String,JsonSerializationTransformer[T]]
{
 def this() = this(Identifiable.randomUID("JsonSerializationTransformer"))
 val mapper = new ObjectMapper
 // add additional mapper configuration code here, like this:
 // mapper.setAnnotationIntrospector(new JaxbAnnotationIntrospector)
 // or this:
  // mapper.getSerializationConfig.withFeatures( 
SerializationFeature.WRITE_DATES_AS_TIMESTAMPS )

 override protected def createTransformFunc: ( T ) => String =
  mapper.writeValueAsString

 override protected def outputDataType: DataType = new StringType
}
and you would use it like any other transformer:

val jsontrans = new 
JsonSerializationTransformer[Document].setInputCol("myEntityColumn")
 .setOutputCol("myOutputColumn")

val dfWithJson = jsontrans.transform( entityDF )

Note that this implementation is for Jackson 2.x. If you want to use Jackson 
1.x, it’s a bit trickier because the ObjectMapper class is not Serializable, 
and so you need to initialize it per-partition rather than having it just be a 
standard property.

> On Mar 11, 2016, at 12:49 PM, Jacek Laskowski <ja...@japila.pl> wrote:
> 
> Hi Tristan,
> 
> Mind sharing the relevant code? I'd like to learn the way you use Transformer 
> to do so. Thanks!
> 
> Jacek
> 
> 11.03.2016 7:07 PM "Tristan Nixon" <st...@memeticlabs.org 
> <mailto:st...@memeticlabs.org>> napisał(a):
> I have a similar situation in an app of mine. I implemented a custom ML 
> Transformer that wraps the Jackson ObjectMapper - this gives you full control 
> over how your custom entities / structs are serialized.
> 
>> On Mar 11, 2016, at 11:53 AM, Caires Vinicius <caire...@gmail.com 
>> <mailto:caire...@gmail.com>> wrote:
>> 
>> Hmm. I think my problem is a little more complex. I'm using 
>> https://github.com/databricks/spark-redshift 
>> <https://github.com/databricks/spark-redshift> and when I read from JSON 
>> file I got this schema.
>> 
>> root
>> |-- app: string (nullable = true)
>> 
>>  |-- ct: long (nullable = true)
>> 
>>  |-- event: struct (nullable = true)
>> 
>> ||-- attributes: struct (nullable = true)
>> 
>>  |||-- account: string (nullable = true)
>> 
>>  |||-- accountEmail: string (nullable = true)
>> 
>> 
>>  |||-- accountId: string (nullable = true)
>> 
>> 
>> 
>> I want to transform the Column event into String (formatted as JSON). 
>> 
>> I was trying to use udf but without success.
>> 
>> 
>> On Fri, Mar 11, 2016 at 1:53 PM Tristan Nixon <st...@memeticlabs.org 
>> <mailto:st...@memeticlabs.org>> wrote:
>> Have you looked at DataFrame.write.json( path )?
>> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
>>  
>> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter>
>> 
>> > On Mar 11, 2016, at 7:15 AM, Caires Vinicius <caire...@gmail.com 
>> > <mailto:caire...@gmail.com>> wrote:
>> >
>> > I have one DataFrame with nested StructField and I want to convert to JSON 
>> > String. There is anyway to accomplish this?
>> 
> 



Re: udf StructField to JSON String

2016-03-11 Thread Tristan Nixon
I have a similar situation in an app of mine. I implemented a custom ML 
Transformer that wraps the Jackson ObjectMapper - this gives you full control 
over how your custom entities / structs are serialized.

> On Mar 11, 2016, at 11:53 AM, Caires Vinicius <caire...@gmail.com> wrote:
> 
> Hmm. I think my problem is a little more complex. I'm using 
> https://github.com/databricks/spark-redshift 
> <https://github.com/databricks/spark-redshift> and when I read from JSON file 
> I got this schema.
> 
> root
> |-- app: string (nullable = true)
> 
>  |-- ct: long (nullable = true)
> 
>  |-- event: struct (nullable = true)
> 
> ||-- attributes: struct (nullable = true)
> 
>  |||-- account: string (nullable = true)
> 
>  |||-- accountEmail: string (nullable = true)
> 
> 
>  |||-- accountId: string (nullable = true)
> 
> 
> 
> I want to transform the Column event into String (formatted as JSON). 
> 
> I was trying to use udf but without success.
> 
> 
> On Fri, Mar 11, 2016 at 1:53 PM Tristan Nixon <st...@memeticlabs.org 
> <mailto:st...@memeticlabs.org>> wrote:
> Have you looked at DataFrame.write.json( path )?
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
>  
> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter>
> 
> > On Mar 11, 2016, at 7:15 AM, Caires Vinicius <caire...@gmail.com 
> > <mailto:caire...@gmail.com>> wrote:
> >
> > I have one DataFrame with nested StructField and I want to convert to JSON 
> > String. There is anyway to accomplish this?
> 



Re: udf StructField to JSON String

2016-03-11 Thread Tristan Nixon
Have you looked at DataFrame.write.json( path )?
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter

> On Mar 11, 2016, at 7:15 AM, Caires Vinicius  wrote:
> 
> I have one DataFrame with nested StructField and I want to convert to JSON 
> String. There is anyway to accomplish this?


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can we use spark inside a web service?

2016-03-10 Thread Tristan Nixon
Hear, hear. That’s why I’m here :)

> On Mar 10, 2016, at 7:32 PM, Chris Fregly  wrote:
> 
> Anyway, thanks for the good discussion, everyone!  This is why we have these 
> lists, right!  :)



Re: Can we use spark inside a web service?

2016-03-10 Thread Tristan Nixon
Very interested, Evan, thanks for the link. It has given me some food for 
thought.

I’m also in the process of building a web application which leverage Spark on 
the back-end for some heavy lifting. I would be curious about your thoughts on 
my proposed architecture:
I was planning on running a spark-streaming app which listens for incoming 
messages on a dedicated queue, and then returns them on a separate one. The 
RESTful web service would handle incoming requests by putting an appropriate 
message on the input queue, and then listen for a response on the output queue, 
transforming the output message into an appropriate HTTP response. How do you 
think this will fair vs. interacting with the spark job service? I was hoping 
that I could minimize the time to launch spark jobs by keeping a streaming app 
running in the background.

> On Mar 10, 2016, at 12:40 PM, velvia.github  wrote:
> 
> Hi,
> 
> I just wrote a blog post which might be really useful to you -- I have just
> benchmarked being able to achieve 700 queries per second in Spark.  So, yes,
> web speed SQL queries are definitely possible.   Read my new blog post:
> 
> http://velvia.github.io/Spark-Concurrent-Fast-Queries/
> 
> and feel free to email me (at vel...@gmail.com) if you would like to follow
> up.
> 
> -Evan
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Installing Spark on Mac

2016-03-10 Thread Tristan Nixon
If you type ‘whoami’ in the terminal, and it responds with ‘root’ then you’re 
the superuser.
However, as mentioned below, I don’t think its a relevant factor.

> On Mar 10, 2016, at 12:02 PM, Aida Tefera <aida1.tef...@gmail.com> wrote:
> 
> Hi Tristan, 
> 
> I'm afraid I wouldn't know whether I'm running it as super user. 
> 
> I have java version 1.8.0_73 and SCALA version 2.11.7
> 
> Sent from my iPhone
> 
>> On 9 Mar 2016, at 21:58, Tristan Nixon <st...@memeticlabs.org> wrote:
>> 
>> That’s very strange. I just un-set my SPARK_HOME env param, downloaded a 
>> fresh 1.6.0 tarball, 
>> unzipped it to local dir (~/Downloads), and it ran just fine - the driver 
>> port is some randomly generated large number.
>> So SPARK_HOME is definitely not needed to run this.
>> 
>> Aida, you are not running this as the super-user, are you?  What versions of 
>> Java & Scala do you have installed?
>> 
>>> On Mar 9, 2016, at 3:53 PM, Aida Tefera <aida1.tef...@gmail.com> wrote:
>>> 
>>> Hi Jakob,
>>> 
>>> Tried running the command env|grep SPARK; nothing comes back 
>>> 
>>> Tried env|grep Spark; which is the directory I created for Spark once I 
>>> downloaded the tgz file; comes back with PWD=/Users/aidatefera/Spark
>>> 
>>> Tried running ./bin/spark-shell ; comes back with same error as below; i.e 
>>> could not bind to port 0 etc.
>>> 
>>> Sent from my iPhone
>>> 
>>>> On 9 Mar 2016, at 21:42, Jakob Odersky <ja...@odersky.com> wrote:
>>>> 
>>>> As Tristan mentioned, it looks as though Spark is trying to bind on
>>>> port 0 and then 1 (which is not allowed). Could it be that some
>>>> environment variables from you previous installation attempts are
>>>> polluting your configuration?
>>>> What does running "env | grep SPARK" show you?
>>>> 
>>>> Also, try running just "/bin/spark-shell" (without the --master
>>>> argument), maybe your shell is doing some funky stuff with the
>>>> brackets.
>>> 
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: log4j pains

2016-03-10 Thread Tristan Nixon
Hmmm… that should be right.

> On Mar 10, 2016, at 11:26 AM, Ashic Mahtab  wrote:
> 
> src/main/resources/log4j.properties
> 
> Subject: Re: log4j pains
> From: st...@memeticlabs.org
> Date: Thu, 10 Mar 2016 11:08:46 -0600
> CC: user@spark.apache.org
> To: as...@live.com
> 
> Where in the jar is the log4j.properties file?
> 
> On Mar 10, 2016, at 9:40 AM, Ashic Mahtab  > wrote:
> 
> 1. Fat jar with logging dependencies included. log4j.properties in fat jar. 
> Spark doesn't pick up the properties file, so uses its defaults.



Re: log4j pains

2016-03-10 Thread Tristan Nixon
Where in the jar is the log4j.properties file?

> On Mar 10, 2016, at 9:40 AM, Ashic Mahtab  wrote:
> 
> 1. Fat jar with logging dependencies included. log4j.properties in fat jar. 
> Spark doesn't pick up the properties file, so uses its defaults.



Re: Installing Spark on Mac

2016-03-09 Thread Tristan Nixon
It really shouldn’t, if anything, running as superuser should ALLOW you to bind 
to ports 0, 1 etc.
It seems very strange that it should even be trying to bind to these ports - 
maybe a JVM issue?
I wonder if the old Apple JVM implementations could have used some different 
native libraries for core networking like this...

> On Mar 10, 2016, at 12:40 AM, Gaini Rajeshwar  
> wrote:
> 
> It works just fine as super-user as well.



Re: Installing Spark on Mac

2016-03-09 Thread Tristan Nixon
That’s very strange. I just un-set my SPARK_HOME env param, downloaded a fresh 
1.6.0 tarball, 
unzipped it to local dir (~/Downloads), and it ran just fine - the driver port 
is some randomly generated large number.
So SPARK_HOME is definitely not needed to run this.

Aida, you are not running this as the super-user, are you?  What versions of 
Java & Scala do you have installed?

> On Mar 9, 2016, at 3:53 PM, Aida Tefera  wrote:
> 
> Hi Jakob,
> 
> Tried running the command env|grep SPARK; nothing comes back 
> 
> Tried env|grep Spark; which is the directory I created for Spark once I 
> downloaded the tgz file; comes back with PWD=/Users/aidatefera/Spark
> 
> Tried running ./bin/spark-shell ; comes back with same error as below; i.e 
> could not bind to port 0 etc.
> 
> Sent from my iPhone
> 
>> On 9 Mar 2016, at 21:42, Jakob Odersky  wrote:
>> 
>> As Tristan mentioned, it looks as though Spark is trying to bind on
>> port 0 and then 1 (which is not allowed). Could it be that some
>> environment variables from you previous installation attempts are
>> polluting your configuration?
>> What does running "env | grep SPARK" show you?
>> 
>> Also, try running just "/bin/spark-shell" (without the --master
>> argument), maybe your shell is doing some funky stuff with the
>> brackets.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Installing Spark on Mac

2016-03-09 Thread Tristan Nixon
SPARK_HOME and SCALA_HOME are different. I was just wondering whether spark is 
looking in a different dir for the config files than where you’re running it. 
If you have not set SPARK_HOME, it should look in the current directory for the 
/conf dir.

The defaults should be relatively safe, I’ve been using them with local mode on 
my Mac for a long while without any need to change them.

> On Mar 9, 2016, at 2:20 PM, Aida Tefera <aida1.tef...@gmail.com> wrote:
> 
> I don't think I set the SCALA_HOME environment variable
> 
> Also, I'm unsure whether or not I should launch the scripts defaults to a 
> single machine(local host)
> 
> Sent from my iPhone
> 
>> On 9 Mar 2016, at 19:59, Tristan Nixon <st...@memeticlabs.org> wrote:
>> 
>> Also, do you have the SPARK_HOME environment variable set in your shell, and 
>> if so what is it set to?
>> 
>>> On Mar 9, 2016, at 1:53 PM, Tristan Nixon <st...@memeticlabs.org> wrote:
>>> 
>>> There should be a /conf sub-directory wherever you installed spark, which 
>>> contains several configuration files.
>>> I believe that the two that you should look at are
>>> spark-defaults.conf
>>> spark-env.sh
>>> 
>>> 
>>>> On Mar 9, 2016, at 1:45 PM, Aida Tefera <aida1.tef...@gmail.com> wrote:
>>>> 
>>>> Hi Tristan, thanks for your message
>>>> 
>>>> When I look at the spark-defaults.conf.template it shows a spark 
>>>> example(spark://master:7077) where the port is 7077
>>>> 
>>>> When you say look to the conf scripts, how do you mean?
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>>> On 9 Mar 2016, at 19:32, Tristan Nixon <st...@memeticlabs.org> wrote:
>>>>> 
>>>>> Yeah, according to the standalone documentation
>>>>> http://spark.apache.org/docs/latest/spark-standalone.html
>>>>> 
>>>>> the default port should be 7077, which means that something must be 
>>>>> overriding this on your installation - look to the conf scripts!
>>>>> 
>>>>>> On Mar 9, 2016, at 1:26 PM, Tristan Nixon <st...@memeticlabs.org> wrote:
>>>>>> 
>>>>>> Looks like it’s trying to bind on port 0, then 1.
>>>>>> Often the low-numbered ports are restricted to system processes and 
>>>>>> “established” servers (web, ssh, etc.) and
>>>>>> so user programs are prevented from binding on them. The default should 
>>>>>> be to run on a high-numbered port like 8080 or such.
>>>>>> 
>>>>>> What do you have in your spark-env.sh?
>>>>>> 
>>>>>>> On Mar 9, 2016, at 12:35 PM, Aida <aida1.tef...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Hi everyone, thanks for all your support
>>>>>>> 
>>>>>>> I went with your suggestion Cody/Jakob and downloaded a pre-built 
>>>>>>> version
>>>>>>> with Hadoop this time and I think I am finally making some progress :)
>>>>>>> 
>>>>>>> 
>>>>>>> ukdrfs01:spark-1.6.0-bin-hadoop2.6 aidatefera$ ./bin/spark-shell 
>>>>>>> --master
>>>>>>> local[2]
>>>>>>> log4j:WARN No appenders could be found for logger
>>>>>>> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
>>>>>>> log4j:WARN Please initialize the log4j system properly.
>>>>>>> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
>>>>>>> more info.
>>>>>>> Using Spark's repl log4j profile:
>>>>>>> org/apache/spark/log4j-defaults-repl.properties
>>>>>>> To adjust logging level use sc.setLogLevel("INFO")
>>>>>>> Welcome to
>>>>>>>   __
>>>>>>> / __/__  ___ _/ /__
>>>>>>> _\ \/ _ \/ _ `/ __/  '_/
>>>>>>> /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
>>>>>>> /_/
>>>>>>> 
>>>>>>> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java
>>>>>>> 1.8.0_73)
>>>>>>> Type in expressions to have them evaluated.
>>>>>>> Type :help for more information.
>>>>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on 
>>>&g

Re: Installing Spark on Mac

2016-03-09 Thread Tristan Nixon
Also, do you have the SPARK_HOME environment variable set in your shell, and if 
so what is it set to?

> On Mar 9, 2016, at 1:53 PM, Tristan Nixon <st...@memeticlabs.org> wrote:
> 
> There should be a /conf sub-directory wherever you installed spark, which 
> contains several configuration files.
> I believe that the two that you should look at are
> spark-defaults.conf
> spark-env.sh
> 
> 
>> On Mar 9, 2016, at 1:45 PM, Aida Tefera <aida1.tef...@gmail.com> wrote:
>> 
>> Hi Tristan, thanks for your message
>> 
>> When I look at the spark-defaults.conf.template it shows a spark 
>> example(spark://master:7077) where the port is 7077
>> 
>> When you say look to the conf scripts, how do you mean?
>> 
>> Sent from my iPhone
>> 
>>> On 9 Mar 2016, at 19:32, Tristan Nixon <st...@memeticlabs.org> wrote:
>>> 
>>> Yeah, according to the standalone documentation
>>> http://spark.apache.org/docs/latest/spark-standalone.html
>>> 
>>> the default port should be 7077, which means that something must be 
>>> overriding this on your installation - look to the conf scripts!
>>> 
>>>> On Mar 9, 2016, at 1:26 PM, Tristan Nixon <st...@memeticlabs.org> wrote:
>>>> 
>>>> Looks like it’s trying to bind on port 0, then 1.
>>>> Often the low-numbered ports are restricted to system processes and 
>>>> “established” servers (web, ssh, etc.) and
>>>> so user programs are prevented from binding on them. The default should be 
>>>> to run on a high-numbered port like 8080 or such.
>>>> 
>>>> What do you have in your spark-env.sh?
>>>> 
>>>>> On Mar 9, 2016, at 12:35 PM, Aida <aida1.tef...@gmail.com> wrote:
>>>>> 
>>>>> Hi everyone, thanks for all your support
>>>>> 
>>>>> I went with your suggestion Cody/Jakob and downloaded a pre-built version
>>>>> with Hadoop this time and I think I am finally making some progress :)
>>>>> 
>>>>> 
>>>>> ukdrfs01:spark-1.6.0-bin-hadoop2.6 aidatefera$ ./bin/spark-shell --master
>>>>> local[2]
>>>>> log4j:WARN No appenders could be found for logger
>>>>> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
>>>>> log4j:WARN Please initialize the log4j system properly.
>>>>> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
>>>>> more info.
>>>>> Using Spark's repl log4j profile:
>>>>> org/apache/spark/log4j-defaults-repl.properties
>>>>> To adjust logging level use sc.setLogLevel("INFO")
>>>>> Welcome to
>>>>>    __
>>>>> / __/__  ___ _/ /__
>>>>> _\ \/ _ \/ _ `/ __/  '_/
>>>>> /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
>>>>>  /_/
>>>>> 
>>>>> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java
>>>>> 1.8.0_73)
>>>>> Type in expressions to have them evaluated.
>>>>> Type :help for more information.
>>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>>> 0. Attempting port 1.
>>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>>> 0. Attempting port 1.
>>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>>> 0. Attempting port 1.
>>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>>> 0. Attempting port 1.
>>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>>> 0. Attempting port 1.
>>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>>> 0. Attempting port 1.
>>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>>> 0. Attempting port 1.
>>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>>> 0. Attempting port 1.
>>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>>> 0. Attempting port 1.
>>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>>> 0. Attempting port 1.
>>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not 

Re: Installing Spark on Mac

2016-03-09 Thread Tristan Nixon
There should be a /conf sub-directory wherever you installed spark, which 
contains several configuration files.
I believe that the two that you should look at are
spark-defaults.conf
spark-env.sh


> On Mar 9, 2016, at 1:45 PM, Aida Tefera <aida1.tef...@gmail.com> wrote:
> 
> Hi Tristan, thanks for your message
> 
> When I look at the spark-defaults.conf.template it shows a spark 
> example(spark://master:7077) where the port is 7077
> 
> When you say look to the conf scripts, how do you mean?
> 
> Sent from my iPhone
> 
>> On 9 Mar 2016, at 19:32, Tristan Nixon <st...@memeticlabs.org> wrote:
>> 
>> Yeah, according to the standalone documentation
>> http://spark.apache.org/docs/latest/spark-standalone.html
>> 
>> the default port should be 7077, which means that something must be 
>> overriding this on your installation - look to the conf scripts!
>> 
>>> On Mar 9, 2016, at 1:26 PM, Tristan Nixon <st...@memeticlabs.org> wrote:
>>> 
>>> Looks like it’s trying to bind on port 0, then 1.
>>> Often the low-numbered ports are restricted to system processes and 
>>> “established” servers (web, ssh, etc.) and
>>> so user programs are prevented from binding on them. The default should be 
>>> to run on a high-numbered port like 8080 or such.
>>> 
>>> What do you have in your spark-env.sh?
>>> 
>>>> On Mar 9, 2016, at 12:35 PM, Aida <aida1.tef...@gmail.com> wrote:
>>>> 
>>>> Hi everyone, thanks for all your support
>>>> 
>>>> I went with your suggestion Cody/Jakob and downloaded a pre-built version
>>>> with Hadoop this time and I think I am finally making some progress :)
>>>> 
>>>> 
>>>> ukdrfs01:spark-1.6.0-bin-hadoop2.6 aidatefera$ ./bin/spark-shell --master
>>>> local[2]
>>>> log4j:WARN No appenders could be found for logger
>>>> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
>>>> log4j:WARN Please initialize the log4j system properly.
>>>> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
>>>> more info.
>>>> Using Spark's repl log4j profile:
>>>> org/apache/spark/log4j-defaults-repl.properties
>>>> To adjust logging level use sc.setLogLevel("INFO")
>>>> Welcome to
>>>>     __
>>>>  / __/__  ___ _/ /__
>>>> _\ \/ _ \/ _ `/ __/  '_/
>>>> /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
>>>>   /_/
>>>> 
>>>> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java
>>>> 1.8.0_73)
>>>> Type in expressions to have them evaluated.
>>>> Type :help for more information.
>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>> 0. Attempting port 1.
>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>> 0. Attempting port 1.
>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>> 0. Attempting port 1.
>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>> 0. Attempting port 1.
>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>> 0. Attempting port 1.
>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>> 0. Attempting port 1.
>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>> 0. Attempting port 1.
>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>> 0. Attempting port 1.
>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>> 0. Attempting port 1.
>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>> 0. Attempting port 1.
>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>> 0. Attempting port 1.
>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>> 0. Attempting port 1.
>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>> 0. Attempting port 1.
>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>> 0. Attempting port 1.
>>>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>>>> 0. Atte

Re: Installing Spark on Mac

2016-03-09 Thread Tristan Nixon
Yeah, according to the standalone documentation
http://spark.apache.org/docs/latest/spark-standalone.html

the default port should be 7077, which means that something must be overriding 
this on your installation - look to the conf scripts!

> On Mar 9, 2016, at 1:26 PM, Tristan Nixon <st...@memeticlabs.org> wrote:
> 
> Looks like it’s trying to bind on port 0, then 1.
> Often the low-numbered ports are restricted to system processes and 
> “established” servers (web, ssh, etc.) and
> so user programs are prevented from binding on them. The default should be to 
> run on a high-numbered port like 8080 or such.
> 
> What do you have in your spark-env.sh?
> 
>> On Mar 9, 2016, at 12:35 PM, Aida <aida1.tef...@gmail.com> wrote:
>> 
>> Hi everyone, thanks for all your support
>> 
>> I went with your suggestion Cody/Jakob and downloaded a pre-built version
>> with Hadoop this time and I think I am finally making some progress :)
>> 
>> 
>> ukdrfs01:spark-1.6.0-bin-hadoop2.6 aidatefera$ ./bin/spark-shell --master
>> local[2]
>> log4j:WARN No appenders could be found for logger
>> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
>> log4j:WARN Please initialize the log4j system properly.
>> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
>> more info.
>> Using Spark's repl log4j profile:
>> org/apache/spark/log4j-defaults-repl.properties
>> To adjust logging level use sc.setLogLevel("INFO")
>> Welcome to
>>   __
>>/ __/__  ___ _/ /__
>>   _\ \/ _ \/ _ `/ __/  '_/
>>  /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
>> /_/
>> 
>> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java
>> 1.8.0_73)
>> Type in expressions to have them evaluated.
>> Type :help for more information.
>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>> 0. Attempting port 1.
>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>> 0. Attempting port 1.
>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>> 0. Attempting port 1.
>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>> 0. Attempting port 1.
>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>> 0. Attempting port 1.
>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>> 0. Attempting port 1.
>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>> 0. Attempting port 1.
>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>> 0. Attempting port 1.
>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>> 0. Attempting port 1.
>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>> 0. Attempting port 1.
>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>> 0. Attempting port 1.
>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>> 0. Attempting port 1.
>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>> 0. Attempting port 1.
>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>> 0. Attempting port 1.
>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>> 0. Attempting port 1.
>> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
>> 0. Attempting port 1.
>> 16/03/09 18:26:57 ERROR SparkContext: Error initializing SparkContext.
>> java.net.BindException: Can't assign requested address: Service
>> 'sparkDriver' failed after 16 retries!
>>  at sun.nio.ch.Net.bind0(Native Method)
>>  at sun.nio.ch.Net.bind(Net.java:433)
>>  at sun.nio.ch.Net.bind(Net.java:425)
>>  at
>> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
>>  at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>>  at
>> io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:125)
>>  at
>> io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:485)
>>  at
>> io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1089)
>>  at
>> io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:430)
>>  at
>> io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:415)
>>  at
>> io.netty.channel.D

Re: Installing Spark on Mac

2016-03-09 Thread Tristan Nixon
Looks like it’s trying to bind on port 0, then 1.
Often the low-numbered ports are restricted to system processes and 
“established” servers (web, ssh, etc.) and
so user programs are prevented from binding on them. The default should be to 
run on a high-numbered port like 8080 or such.

What do you have in your spark-env.sh?

> On Mar 9, 2016, at 12:35 PM, Aida  wrote:
> 
> Hi everyone, thanks for all your support
> 
> I went with your suggestion Cody/Jakob and downloaded a pre-built version
> with Hadoop this time and I think I am finally making some progress :)
> 
> 
> ukdrfs01:spark-1.6.0-bin-hadoop2.6 aidatefera$ ./bin/spark-shell --master
> local[2]
> log4j:WARN No appenders could be found for logger
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
> Using Spark's repl log4j profile:
> org/apache/spark/log4j-defaults-repl.properties
> To adjust logging level use sc.setLogLevel("INFO")
> Welcome to
>    __
> / __/__  ___ _/ /__
>_\ \/ _ \/ _ `/ __/  '_/
>   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
>  /_/
> 
> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.8.0_73)
> Type in expressions to have them evaluated.
> Type :help for more information.
> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
> 16/03/09 18:26:57 WARN Utils: Service 'sparkDriver' could not bind on port
> 0. Attempting port 1.
> 16/03/09 18:26:57 ERROR SparkContext: Error initializing SparkContext.
> java.net.BindException: Can't assign requested address: Service
> 'sparkDriver' failed after 16 retries!
>   at sun.nio.ch.Net.bind0(Native Method)
>   at sun.nio.ch.Net.bind(Net.java:433)
>   at sun.nio.ch.Net.bind(Net.java:425)
>   at
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
>   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>   at
> io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:125)
>   at
> io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:485)
>   at
> io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1089)
>   at
> io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:430)
>   at
> io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:415)
>   at
> io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:903)
>   at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:198)
>   at 
> io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:348)
>   at
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
>   at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> java.net.BindException: Can't assign requested address: Service
> 'sparkDriver' failed after 16 retries!
>   at sun.nio.ch.Net.bind0(Native Method)
>   at sun.nio.ch.Net.bind(Net.java:433)
>   at sun.nio.ch.Net.bind(Net.java:425)
>   at
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
>   at 

Re: Specify log4j properties file

2016-03-09 Thread Tristan Nixon
You can also package an alternative log4j config in your jar files

> On Mar 9, 2016, at 12:20 PM, Ashic Mahtab  wrote:
> 
> Found it. 
> 
> You can pass in the jvm parameter log4j.configuration. The following works:
> 
> -Dlog4j.configuration=file:path/to/log4j.properties
> 
> It doesn't work without the file: prefix though. Tested in 1.6.0.
> 
> Cheers,
> Ashic.
> 
> From: as...@live.com
> To: user@spark.apache.org
> Subject: Specify log4j properties file
> Date: Wed, 9 Mar 2016 17:57:00 +
> 
> Hello,
> Is it possible to provide a log4j properties file when submitting jobs to a 
> cluster? I know that by default spark looks for a log4j.properties file in 
> the conf directory. I'm looking for a way to specify a different 
> log4j.properties file (external to the application) without pointing to a 
> completely different conf directory. Is there a way to achieve this?
> 
> Thanks,
> Ashic.



Re: SparkFiles.get() returns with driver path Instead of Worker Path

2016-03-08 Thread Tristan Nixon
Based on your code:

sparkContext.addFile("/home/files/data.txt");
List file =sparkContext.textFile(SparkFiles.get("data.txt")).collect();

I’m assuming the file in “/home/files/data.txt” exists and is readable in the 
driver’s filesystem.
Did you try just doing this:

List file =sparkContext.textFile("/home/files/data.txt").collect();

> On Mar 8, 2016, at 1:20 PM, Ashik Vetrivelu <vcas...@gmail.com> wrote:
> 
> Hey, yeah I also tried by setting sc.textFile() with a local path and it 
> still throws the exception when trying to use collect().
> 
> Sorry I am new to spark and I am just messing around with it.
> 
> On Mar 8, 2016 10:23 PM, "Tristan Nixon" <st...@memeticlabs.org 
> <mailto:st...@memeticlabs.org>> wrote:
> My understanding of the model is that you’re supposed to execute 
> SparkFiles.get(…) on each worker node, not on the driver.
> 
> Since you already know where the files are on the driver, if you want to load 
> these into an RDD with SparkContext.textFile, then this will distribute it 
> out to the workers, there’s no need to use SparkContext.add to do this.
> 
> If you have some functions that run on workers that expects local file 
> resources, then you can use SparkContext.addFile to distribute the files into 
> worker local storage, then you can execute SparkFiles.get separately on each 
> worker to retrieve these local files (it will give different paths on each 
> worker).
> 
> > On Mar 8, 2016, at 5:31 AM, ashikvc <vcas...@gmail.com 
> > <mailto:vcas...@gmail.com>> wrote:
> >
> > I am trying to play a little bit with apache-spark cluster mode.
> > So my cluster consists of a driver in my machine and a worker and manager in
> > host machine(separate machine).
> >
> > I send a textfile using `sparkContext.addFile(filepath)` where the filepath
> > is the path of my text file in local machine for which I get the following
> > output:
> >
> >INFO Utils: Copying /home/files/data.txt to
> > /tmp/spark-b2e2bb22-487b-412b-831d-19d7aa96f275/userFiles-147c9552-1a77-427e-9b17-cb0845807860/data.txt
> >
> >INFO SparkContext: Added file /home/files/data.txt at
> > http://192.XX.XX.164:58143/files/data.txt 
> > <http://192.xx.xx.164:58143/files/data.txt> with timestamp 1457432207649
> >
> > But when I try to access the same file using `SparkFiles.get("data.txt")`, I
> > get the path to file in my driver instead of worker.
> > I am setting my file like this
> >
> >SparkConf conf = new
> > SparkConf().setAppName("spark-play").setMaster("spark://192.XX.XX.172:7077");
> >conf.setJars(new String[]{"jars/SparkWorker.jar"});
> >JavaSparkContext sparkContext = new JavaSparkContext(conf);
> >sparkContext.addFile("/home/files/data.txt");
> >List file
> > =sparkContext.textFile(SparkFiles.get("data.txt")).collect();
> > I am getting FileNotFoundException here.
> >
> >
> >
> >
> >
> > --
> > View this message in context: 
> > http://apache-spark-user-list.1001560.n3.nabble.com/SparkFiles-get-returns-with-driver-path-Instead-of-Worker-Path-tp26428.html
> >  
> > <http://apache-spark-user-list.1001560.n3.nabble.com/SparkFiles-get-returns-with-driver-path-Instead-of-Worker-Path-tp26428.html>
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> > <mailto:user-unsubscr...@spark.apache.org>
> > For additional commands, e-mail: user-h...@spark.apache.org 
> > <mailto:user-h...@spark.apache.org>
> >
> 



Re: Analyzing json Data streams using sparkSQL in spark streaming returns java.lang.ClassNotFoundException

2016-03-08 Thread Tristan Nixon
this is a bit strange, because you’re trying to create an RDD inside of a 
foreach function (the jsonElements). This executes on the workers, and so will 
actually produce a different instance in each JVM on each worker, not one 
single RDD referenced by the driver, which is what I think you’re trying to get.

Why don’t you try something like:

JavaDStream jsonElements = lines.flatMap( … )

and just skip the lines.foreach?

> On Mar 8, 2016, at 11:59 AM, Nesrine BEN MUSTAPHA 
>  wrote:
> 
> Hello,
> 
> I tried to use sparkSQL to analyse json data streams within a standalone 
> application. 
> 
> here the code snippet that receive the streaming data: 
> final JavaReceiverInputDStream lines = 
> streamCtx.socketTextStream("localhost", Integer.parseInt(args[0]), 
> StorageLevel.MEMORY_AND_DISK_SER_2());
> 
> lines.foreachRDD((rdd) -> {
> 
> final JavaRDD jsonElements = rdd.flatMap(new FlatMapFunction String>() {
> 
> @Override
> 
> public Iterable call(final String line)
> 
> throws Exception {
> 
> return Arrays.asList(line.split("\n"));
> 
> }
> 
> }).filter(new Function() {
> 
> @Override
> 
> public Boolean call(final String v1)
> 
> throws Exception {
> 
> return v1.length() > 0;
> 
> }
> 
> });
> 
> //System.out.println("Data Received = " + jsonElements.collect().size());
> 
> final SQLContext sqlContext = 
> JavaSQLContextSingleton.getInstance(rdd.context());
> 
> final DataFrame dfJsonElement = sqlContext.read().json(jsonElements); 
> 
> executeSQLOperations(sqlContext, dfJsonElement);
> 
> });
> 
> streamCtx.start();
> 
> streamCtx.awaitTermination();
> 
> }
> 
> 
> 
> 
> 
> 
> 
> 
> 
> I got the following error when the red line is executed:
> 
> java.lang.ClassNotFoundException: 
> com.intrinsec.common.spark.SQLStreamingJsonAnalyzer$2
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
> 
> 
> 
> 
> 
> 



Re: SparkFiles.get() returns with driver path Instead of Worker Path

2016-03-08 Thread Tristan Nixon
My understanding of the model is that you’re supposed to execute 
SparkFiles.get(…) on each worker node, not on the driver.

Since you already know where the files are on the driver, if you want to load 
these into an RDD with SparkContext.textFile, then this will distribute it out 
to the workers, there’s no need to use SparkContext.add to do this.

If you have some functions that run on workers that expects local file 
resources, then you can use SparkContext.addFile to distribute the files into 
worker local storage, then you can execute SparkFiles.get separately on each 
worker to retrieve these local files (it will give different paths on each 
worker).

> On Mar 8, 2016, at 5:31 AM, ashikvc  wrote:
> 
> I am trying to play a little bit with apache-spark cluster mode.
> So my cluster consists of a driver in my machine and a worker and manager in
> host machine(separate machine).
> 
> I send a textfile using `sparkContext.addFile(filepath)` where the filepath
> is the path of my text file in local machine for which I get the following
> output:
> 
>INFO Utils: Copying /home/files/data.txt to
> /tmp/spark-b2e2bb22-487b-412b-831d-19d7aa96f275/userFiles-147c9552-1a77-427e-9b17-cb0845807860/data.txt
> 
>INFO SparkContext: Added file /home/files/data.txt at
> http://192.XX.XX.164:58143/files/data.txt with timestamp 1457432207649
> 
> But when I try to access the same file using `SparkFiles.get("data.txt")`, I
> get the path to file in my driver instead of worker.
> I am setting my file like this
> 
>SparkConf conf = new
> SparkConf().setAppName("spark-play").setMaster("spark://192.XX.XX.172:7077");
>conf.setJars(new String[]{"jars/SparkWorker.jar"});
>JavaSparkContext sparkContext = new JavaSparkContext(conf);
>sparkContext.addFile("/home/files/data.txt");
>List file
> =sparkContext.textFile(SparkFiles.get("data.txt")).collect();
> I am getting FileNotFoundException here.
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkFiles-get-returns-with-driver-path-Instead-of-Worker-Path-tp26428.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: OOM exception during Broadcast

2016-03-07 Thread Tristan Nixon
Yeah, the spark engine is pretty clever and its best not to pre-maturely 
optimize. It would be interesting to profile your join vs. the collect on the 
smaller dataset. I suspect that the join is faster (even before you broadcast 
it back out).

I’m also curious about the broadcast OOM - did you try expanding the driver 
memory?

> On Mar 7, 2016, at 8:28 PM, Arash <aras...@gmail.com> wrote:
> 
> So I just implemented the logic through a standard join (without collect and 
> broadcast) and it's working great.
> 
> The idea behind trying the broadcast was that since the other side of join is 
> a much larger dataset, the process might be faster through collect and 
> broadcast, since it avoids the shuffle of the bigger dataset. 
> 
> I think the join is working much better in this case so I'll probably just 
> use that, still a bit curious as why the error is happening.
> 
> On Mon, Mar 7, 2016 at 5:55 PM, Tristan Nixon <st...@memeticlabs.org 
> <mailto:st...@memeticlabs.org>> wrote:
> I’m not sure I understand - if it was already distributed over the cluster in 
> an RDD, why would you want to collect and then re-send it as a broadcast 
> variable? Why not simply use the RDD that is already distributed on the 
> worker nodes?
> 
>> On Mar 7, 2016, at 7:44 PM, Arash <aras...@gmail.com 
>> <mailto:aras...@gmail.com>> wrote:
>> 
>> Hi Tristan, 
>> 
>> This is not static, I actually collect it from an RDD to the driver. 
>> 
>> On Mon, Mar 7, 2016 at 5:42 PM, Tristan Nixon <st...@memeticlabs.org 
>> <mailto:st...@memeticlabs.org>> wrote:
>> Hi Arash,
>> 
>> is this static data?  Have you considered including it in your jars and 
>> de-serializing it from jar on each worker node?
>> It’s not pretty, but it’s a workaround for serialization troubles.
>> 
>>> On Mar 7, 2016, at 5:29 PM, Arash <aras...@gmail.com 
>>> <mailto:aras...@gmail.com>> wrote:
>>> 
>>> Hello all,
>>> 
>>> I'm trying to broadcast a variable of size ~1G to a cluster of 20 nodes but 
>>> haven't been able to make it work so far.
>>> 
>>> It looks like the executors start to run out of memory during 
>>> deserialization. This behavior only shows itself when the number of 
>>> partitions is above a few 10s, the broadcast does work for 10 or 20 
>>> partitions. 
>>> 
>>> I'm using the following setup to observe the problem:
>>> 
>>> val tuples: Array[((String, String), (String, String))]  // ~ 10M tuples
>>> val tuplesBc = sc.broadcast(tuples)
>>> val numsRdd = sc.parallelize(1 to 5000, 100)
>>> numsRdd.map(n => tuplesBc.value.head).count()
>>> 
>>> If I set the number of partitions for numsRDD to 20, the count goes through 
>>> successfully, but at 100, I'll start to get errors such as:
>>> 
>>> 16/03/07 19:35:32 WARN scheduler.TaskSetManager: Lost task 77.0 in stage 
>>> 1.0 (TID 1677, xxx.ec2.internal): java.lang.OutOfMemoryError: Java heap 
>>> space
>>> at 
>>> java.io.ObjectInputStream$HandleTable.grow(ObjectInputStream.java:3472)
>>> at 
>>> java.io.ObjectInputStream$HandleTable.assign(ObjectInputStream.java:3278)
>>> at 
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1789)
>>> at 
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>>> at 
>>> scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at 
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>> at 
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>> at 
>>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
>>> at 
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1897)
>>> at 
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>> at 
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>> at 
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
>>> at 
>>> java.io

Re: OOM exception during Broadcast

2016-03-07 Thread Tristan Nixon
I’m not sure I understand - if it was already distributed over the cluster in 
an RDD, why would you want to collect and then re-send it as a broadcast 
variable? Why not simply use the RDD that is already distributed on the worker 
nodes?

> On Mar 7, 2016, at 7:44 PM, Arash <aras...@gmail.com> wrote:
> 
> Hi Tristan, 
> 
> This is not static, I actually collect it from an RDD to the driver. 
> 
> On Mon, Mar 7, 2016 at 5:42 PM, Tristan Nixon <st...@memeticlabs.org 
> <mailto:st...@memeticlabs.org>> wrote:
> Hi Arash,
> 
> is this static data?  Have you considered including it in your jars and 
> de-serializing it from jar on each worker node?
> It’s not pretty, but it’s a workaround for serialization troubles.
> 
>> On Mar 7, 2016, at 5:29 PM, Arash <aras...@gmail.com 
>> <mailto:aras...@gmail.com>> wrote:
>> 
>> Hello all,
>> 
>> I'm trying to broadcast a variable of size ~1G to a cluster of 20 nodes but 
>> haven't been able to make it work so far.
>> 
>> It looks like the executors start to run out of memory during 
>> deserialization. This behavior only shows itself when the number of 
>> partitions is above a few 10s, the broadcast does work for 10 or 20 
>> partitions. 
>> 
>> I'm using the following setup to observe the problem:
>> 
>> val tuples: Array[((String, String), (String, String))]  // ~ 10M tuples
>> val tuplesBc = sc.broadcast(tuples)
>> val numsRdd = sc.parallelize(1 to 5000, 100)
>> numsRdd.map(n => tuplesBc.value.head).count()
>> 
>> If I set the number of partitions for numsRDD to 20, the count goes through 
>> successfully, but at 100, I'll start to get errors such as:
>> 
>> 16/03/07 19:35:32 WARN scheduler.TaskSetManager: Lost task 77.0 in stage 1.0 
>> (TID 1677, xxx.ec2.internal): java.lang.OutOfMemoryError: Java heap space
>> at 
>> java.io.ObjectInputStream$HandleTable.grow(ObjectInputStream.java:3472)
>> at 
>> java.io.ObjectInputStream$HandleTable.assign(ObjectInputStream.java:3278)
>> at 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1789)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>> at 
>> scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at 
>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
>> at 
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1897)
>> at 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at 
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
>> at 
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
>> at 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at 
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
>> at 
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
>> at 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at 
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
>> at 
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
>> at 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at 
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
>> at 
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
>> at 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at 
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
>> at 
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
>> 
>> 
>> I'm using spark 1.5.2. Cluster nodes are amazon r3.2xlarge. The spark 
>> property maximizeResourceAllocation is set to true (executor.memory = 48G 
>> according to spark ui environment). We're also using kryo serialization and 
>> Yarn is the resource manager.
>> 
>> Any ideas as what might be going wrong and how to debug this?
>> 
>> Thanks,
>> Arash
>> 
> 
> 



[jira] [Commented] (TIKA-1362) Add GoogleTranslate implementation of Translation API

2015-07-10 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622982#comment-14622982
 ] 

Tristan Nixon commented on TIKA-1362:
-

Storing the API key in the properties file is very cumbersome and difficult to 
use. Especially as this is not documented well (I had to dig into the source to 
figure out the package path and property key name). Why not just provide a 
constructor argument or setAPIKey( String key ) method? This is how the 
MicrosoftTranslator works. At the least some consistency across implementations 
and some improved documentation would be very much appreciated. Thanks!

 Add GoogleTranslate implementation of Translation API
 -

 Key: TIKA-1362
 URL: https://issues.apache.org/jira/browse/TIKA-1362
 Project: Tika
  Issue Type: Bug
  Components: translation
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.6


 Add an implementation of the Translation API that uses the Google Translate 
 v2 API and Apache CXF: 
 https://www.googleapis.com/language/translate/v2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1362) Add GoogleTranslate implementation of Translation API

2015-07-10 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622999#comment-14622999
 ] 

Tristan Nixon commented on TIKA-1362:
-

Great to hear, and thanks for the invite. I'm new to using Tika, but finding it 
immensely useful. I'd be happy to contribute in whatever way I can.

 Add GoogleTranslate implementation of Translation API
 -

 Key: TIKA-1362
 URL: https://issues.apache.org/jira/browse/TIKA-1362
 Project: Tika
  Issue Type: Bug
  Components: translation
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.6


 Add an implementation of the Translation API that uses the Google Translate 
 v2 API and Apache CXF: 
 https://www.googleapis.com/language/translate/v2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable

2015-05-19 Thread Tristan Nixon (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tristan Nixon updated OPENNLP-776:
--
Attachment: model-constructors.patch

I realized that for automatic de-serialization, all models need No-Op 
constructors. See attached.

 Model Objects should be Serializable
 

 Key: OPENNLP-776
 URL: https://issues.apache.org/jira/browse/OPENNLP-776
 Project: OpenNLP
  Issue Type: Improvement
  Components: Formats
Affects Versions: tools-1.5.3
Reporter: Tristan Nixon
Priority: Minor
  Labels: features, patch
 Attachments: BaseModel-serialization.patch, model-constructors.patch


 Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
 enable a number of features offered by other Java frameworks (my own use case 
 is described below). You've already got a good mechanism for 
 (de-)serialization, but it cannot be leveraged by other frameworks without 
 implementing the Serializable interface. I'm attaching a patch to BaseModel 
 that implements the methods in the java.io.Externalizable interface as 
 wrappers to the existing (de-)serialization methods. This simple change can 
 open up a number of useful opportunities for integrating OpenNLP with other 
 frameworks.
 My use case is that I am incorporating OpenNLP into a Spark application. This 
 requires that components of the system be distributed between the driver and 
 worker nodes within the cluster. In order to do this, Spark uses Java 
 serialization API to transmit objects between nodes. This is far more 
 efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2015-05-19 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14550561#comment-14550561
 ] 

Tristan Nixon commented on OPENNLP-776:
---

You're totally welcome! Let me know when this gets merged into a release, so I 
can update my project and get rid of my custom build.

 Model Objects should be Serializable
 

 Key: OPENNLP-776
 URL: https://issues.apache.org/jira/browse/OPENNLP-776
 Project: OpenNLP
  Issue Type: Improvement
  Components: Formats
Affects Versions: tools-1.5.3
Reporter: Tristan Nixon
Priority: Minor
  Labels: features, patch
 Attachments: BaseModel-serialization.patch


 Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
 enable a number of features offered by other Java frameworks (my own use case 
 is described below). You've already got a good mechanism for 
 (de-)serialization, but it cannot be leveraged by other frameworks without 
 implementing the Serializable interface. I'm attaching a patch to BaseModel 
 that implements the methods in the java.io.Externalizable interface as 
 wrappers to the existing (de-)serialization methods. This simple change can 
 open up a number of useful opportunities for integrating OpenNLP with other 
 frameworks.
 My use case is that I am incorporating OpenNLP into a Spark application. This 
 requires that components of the system be distributed between the driver and 
 worker nodes within the cluster. In order to do this, Spark uses Java 
 serialization API to transmit objects between nodes. This is far more 
 efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2015-05-19 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14550604#comment-14550604
 ] 

Tristan Nixon commented on OPENNLP-776:
---

It does not make the (de-)serialization process more efficient. It allows me to 
use a model as a broadcast variable which means it is de-serialized once on 
each worker node, and can then be re-used for all work on that node. Otherwise, 
it may need to be de-serialized multiple times, adding quite a bit of overhead 
to the application.

 Model Objects should be Serializable
 

 Key: OPENNLP-776
 URL: https://issues.apache.org/jira/browse/OPENNLP-776
 Project: OpenNLP
  Issue Type: Improvement
  Components: Formats
Affects Versions: tools-1.5.3
Reporter: Tristan Nixon
Priority: Minor
  Labels: features, patch
 Attachments: BaseModel-serialization.patch, model-constructors.patch


 Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
 enable a number of features offered by other Java frameworks (my own use case 
 is described below). You've already got a good mechanism for 
 (de-)serialization, but it cannot be leveraged by other frameworks without 
 implementing the Serializable interface. I'm attaching a patch to BaseModel 
 that implements the methods in the java.io.Externalizable interface as 
 wrappers to the existing (de-)serialization methods. This simple change can 
 open up a number of useful opportunities for integrating OpenNLP with other 
 frameworks.
 My use case is that I am incorporating OpenNLP into a Spark application. This 
 requires that components of the system be distributed between the driver and 
 worker nodes within the cluster. In order to do this, Spark uses Java 
 serialization API to transmit objects between nodes. This is far more 
 efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable

2015-05-14 Thread Tristan Nixon (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tristan Nixon updated OPENNLP-776:
--
Attachment: BaseModel-serialization.patch

My patch

 Model Objects should be Serializable
 

 Key: OPENNLP-776
 URL: https://issues.apache.org/jira/browse/OPENNLP-776
 Project: OpenNLP
  Issue Type: Improvement
  Components: Formats
Affects Versions: tools-1.5.3
Reporter: Tristan Nixon
Priority: Minor
  Labels: features, patch
 Attachments: BaseModel-serialization.patch


 Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
 enable a number of features offered by other Java frameworks (my own use case 
 is described below). You've already got a good mechanism for 
 (de-)serialization, but it cannot be leveraged by other frameworks without 
 implementing the Serializable interface. I'm attaching a patch to BaseModel 
 that implements the methods in the java.io.Externalizable interface as 
 wrappers to the existing (de-)serialization methods. This simple change can 
 open up a number of useful opportunities for integrating OpenNLP with other 
 frameworks.
 My use case is that I am incorporating OpenNLP into a Spark application. This 
 requires that components of the system be distributed between the driver and 
 worker nodes within the cluster. In order to do this, Spark uses Java 
 serialization API to transmit objects between nodes. This is far more 
 efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (OPENNLP-776) Model Objects should be Serializable

2015-05-14 Thread Tristan Nixon (JIRA)
Tristan Nixon created OPENNLP-776:
-

 Summary: Model Objects should be Serializable
 Key: OPENNLP-776
 URL: https://issues.apache.org/jira/browse/OPENNLP-776
 Project: OpenNLP
  Issue Type: Improvement
  Components: Formats
Affects Versions: tools-1.5.3
Reporter: Tristan Nixon
Priority: Minor


Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
enable a number of features offered by other Java frameworks (my own use case 
is described below). You've already got a good mechanism for 
(de-)serialization, but it cannot be leveraged by other frameworks without 
implementing the Serializable interface. I'm attaching a patch to BaseModel 
that implements the methods in the java.io.Externalizable interface as wrappers 
to the existing (de-)serialization methods. This simple change can open up a 
number of useful opportunities for integrating OpenNLP with other frameworks.

My use case is that I am incorporating OpenNLP into a Spark application. This 
requires that components of the system be distributed between the driver and 
worker nodes within the cluster. In order to do this, Spark uses Java 
serialization API to transmit objects between nodes. This is far more efficient 
than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-4414) SparkContext.wholeTextFiles Doesn't work with S3 Buckets

2015-04-28 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517886#comment-14517886
 ] 

Tristan Nixon commented on SPARK-4414:
--

Thanks, [~petedmarsh], I was having this same issue. It worked fine on my OS X 
laptop but not on an ec2 linux instance I set up with the spark-c2 script. My 
local version was built with Hadoop 2.4, but the default for systems configured 
from the script is Hadoop 1. It seems that this problem goes to the S3 drivers 
in the different versions of Hadoop.

I destroyed and then re-launched my ec2 cluster using the 
--hadoop-major-version=2 option, and the resulting version works!

Perhaps support for Hadoop 1 should be deprecated? At least, it probably should 
no longer be the default version used in the spark-ec2 scripts.

 SparkContext.wholeTextFiles Doesn't work with S3 Buckets
 

 Key: SPARK-4414
 URL: https://issues.apache.org/jira/browse/SPARK-4414
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: Pedro Rodriguez
Priority: Critical

 SparkContext.wholeTextFiles does not read files which SparkContext.textFile 
 can read. Below are general steps to reproduce, my specific case is following 
 that on a git repo.
 Steps to reproduce.
 1. Create Amazon S3 bucket, make public with multiple files
 2. Attempt to read bucket with
 sc.wholeTextFiles(s3n://mybucket/myfile.txt)
 3. Spark returns the following error, even if the file exists.
 Exception in thread main java.io.FileNotFoundException: File does not 
 exist: /myfile.txt
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
   at 
 org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.init(CombineFileInputFormat.java:489)
 4. Change the call to
 sc.textFile(s3n://mybucket/myfile.txt)
 and there is no error message, the application should run fine.
 There is a question on StackOverflow as well on this:
 http://stackoverflow.com/questions/26258458/sparkcontext-wholetextfiles-java-io-filenotfoundexception-file-does-not-exist
 This is link to repo/lines of code. The uncommented call doesn't work, the 
 commented call works as expected:
 https://github.com/EntilZha/nips-lda-spark/blob/45f5ad1e2646609ef9d295a0954fbefe84111d8a/src/main/scala/NipsLda.scala#L13-L19
 It would be easy to use textFile with a multifile argument, but this should 
 work correctly for s3 bucket files as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[SAtalk] order of preferences in white / black listing

2003-11-11 Thread Tristan Nixon
Hello all,

I have a question regarding the way in which SA deals
with whitelisting  blacklisting.  If I want to whitelist
all but a few select entries from a domain, how would I do it.
Should the following work?

whitelist_from [EMAIL PROTECTED]
unwhitelist_from [EMAIL PROTECTED]
unwhitelist_from [EMAIL PROTECTED]

I can't seem to find a way to do this.  Here is my predicament, 
I have been getting a whole pile of viagara spams a day, all of which
have my email address as both the To: and From: address.  I have
whitelisted my own domain, but would love to be able to unwhitelist my
own email address ( I don't have much call for sending messages to
myself ).  Anyone know how to do this?

Another thought - it would be really great if you could write a rule
which removed /added an email to the white / blacklist just temporarily
while evaluating that mail.  So, for example, you could check if the To:
and From address is the same, and then remove that particular address
from the whitelist just for that particular email.

-- 
Cheers,
Tristan
[EMAIL PROTECTED]



---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk